Linear Regression and Distribution Free Tests
Chi-squared Test
The Chi-squared test is a non-parametric test used to determine if there is a significant association between categorical variables. It compares the observed frequencies in each category to the frequencies we would expect if there were no association.
The test statistic is calculated as:
- = observed frequency
- = expected frequency, calculated as
Suppose we want to test if there is an association between gender (male, female) and preference for a product (like, dislike). We collect the following data:
Like | Dislike | Total | |
---|---|---|---|
Male | 30 | 10 | 40 |
Female | 20 | 40 | 60 |
Total | 50 | 50 | 100 |
Fixed Level Testing
In fixed level testing, the significance level () is predetermined, typically set at 0.05 or 0.01. This level indicates the probability of making a Type I error.
If , we reject the null hypothesis if the p-value is less than 0.05.
Type I and Type II Errors
- Type I Error (): The probability of rejecting the null hypothesis when it is true. For example, concluding that a new drug is effective when it is not.
- Type II Error (): The probability of failing to reject the null hypothesis when it is false. For example, concluding that a new drug is not effective when it actually is.
Power of a Test
The power of a test is the probability of correctly rejecting the null hypothesis when it is false. It is influenced by several factors:
Factors Affecting Power
- Sample Size (n): Increasing the sample size reduces variability and increases power.
- Effect Size: Larger effect sizes (the difference between the null hypothesis and the true value) increase power.
- Significance Level (): Increasing increases power but also increases the risk of Type I error.
Simple Linear Regression
Simple linear regression is used to model the relationship between a dependent variable and an independent variable . The goal is to find the best-fitting line through the data points.
Correlation
The correlation coefficient () quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to 1.
- : Perfect positive correlation
- : Perfect negative correlation
- : No correlation
The Least Square Lines
The least squares method minimizes the sum of the squared differences between observed values and predicted values.
Regression Equation
The regression line is given by:
- (slope)
- (y-intercept)
Predictions using Regression Models
To make predictions, substitute the value of into the regression equation. The predicted value is calculated as: