Linear Regression and Distribution Free Tests

Chi-squared Test

The Chi-squared test is a non-parametric test used to determine if there is a significant association between categorical variables. It compares the observed frequencies in each category to the frequencies we would expect if there were no association.

The test statistic is calculated as:

  • = observed frequency
  • = expected frequency, calculated as

Suppose we want to test if there is an association between gender (male, female) and preference for a product (like, dislike). We collect the following data:

LikeDislikeTotal
Male301040
Female204060
Total5050100

Fixed Level Testing

In fixed level testing, the significance level () is predetermined, typically set at 0.05 or 0.01. This level indicates the probability of making a Type I error.

If , we reject the null hypothesis if the p-value is less than 0.05.

Type I and Type II Errors

  • Type I Error (): The probability of rejecting the null hypothesis when it is true. For example, concluding that a new drug is effective when it is not.
  • Type II Error (): The probability of failing to reject the null hypothesis when it is false. For example, concluding that a new drug is not effective when it actually is.

Power of a Test

The power of a test is the probability of correctly rejecting the null hypothesis when it is false. It is influenced by several factors:

Factors Affecting Power

  1. Sample Size (n): Increasing the sample size reduces variability and increases power.
  2. Effect Size: Larger effect sizes (the difference between the null hypothesis and the true value) increase power.
  3. Significance Level (): Increasing increases power but also increases the risk of Type I error.

Simple Linear Regression

Simple linear regression is used to model the relationship between a dependent variable and an independent variable . The goal is to find the best-fitting line through the data points.

Correlation

The correlation coefficient () quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to 1.

  • : Perfect positive correlation
  • : Perfect negative correlation
  • : No correlation

The Least Square Lines

The least squares method minimizes the sum of the squared differences between observed values and predicted values.

Regression Equation

The regression line is given by:

  • (slope)
  • (y-intercept)

Predictions using Regression Models

To make predictions, substitute the value of into the regression equation. The predicted value is calculated as: