Statistical Tests Simplified

Part 2: Chi-Square, ANOVA, t-Test, Regression Analysis

Yamac Eren Ay
10 min readAug 9, 2024

This is the second part of my article series “Statistical Tests Simplified.” You can click below for the first article:

In the first article, I explained the basics and the generic framework of statistical testing, provided a quick introduction to types of variables, and applied this framework to analyze the relationship between two binary categorical variables using Fisher’s Exact Test.

In this part, we’ll consider how to handle larger datasets, variables with more than two categories, and/or continuous variables, where other types of statistical tests become more appropriate and efficient for assessing the association between categorical variables. Most importantly, all these statistical tests mentioned in this article have something in common that we need to address: Degrees of freedom!

Degrees of Freedom

Degrees of Freedom

Degrees of freedom (DoF) is a concept that plays a crucial role in various statistical tests. It can be thought of as the number of independent values or quantities that can vary in the analysis without breaking any constraints. In simpler terms, degrees of freedom represent the amount of information available for estimating statistical parameters.

Degrees of Freedom in Chi-Square Tests.
Degrees of Freedom in Chi-Square Tests. Source.

In the context of statistical tests, degrees of freedom are essential because they influence the shape of the sampling distribution used to calculate test statistics and p-values. Different statistical tests have their own specific ways of calculating degrees of freedom, depending on the structure of the data and the test’s design. This requires us to slightly modify the hypothesis testing framework, because the decision-making not only depends on the significance level but also on the degrees of freedom.

Generic Algorithm for Hypothesis Testing

  1. State the Hypotheses
  2. Choose the Significance Level
  3. Collect and Summarize the Data
  4. Choose the Appropriate Test and Calculate the Test Statistic
  5. Determine the Degrees of Freedom and Find the Critical Value: Calculate the degrees of freedom for the test and use the significance level to find the critical value from the appropriate statistical distribution table.
  6. Compare the Test Statistic to the Critical Value and Interpret the Results: Compare the test statistic to the critical value. If the test statistic is greater than the critical value, reject the null hypothesis. Otherwise, fail to reject the null hypothesis.

Now, we’re ready to explore the rest of the statistical tests.

Chi-Square Test

Key Characteristics: Larger sample sizes, categorical data, contingency tables (e.g., 2x2, 2x3, 3x3, …).

The Chi-Square Test is used to assess whether there is a significant association between two categorical variables. It compares the observed frequencies in each category to the frequencies expected if the variables were independent.

Imagine you’re studying the relationship between X = education level (“high school,” “bachelor’s,” “master’s”) and Y = job satisfaction (“satisfied,” “neutral,” “dissatisfied”). The Chi-Square Test can help determine if the distribution of job satisfaction varies significantly across different education levels.

Our null hypothesis is that job satisfaction is independent of education level, and we choose a significance level (say 0.05).

Given below is the contingency table for this task, where each cell of the contingency table at the i-th row (education level) and j-th column (job satisfaction) corresponds to the observed frequency O{ij}:

A sample dataset with categorical variables.
A sample dataset with categorical variables.

As you may recall, we assumed in the null hypothesis that both categorical values are independent from each other. For this, let’s define the expected frequency E{ij} as “the row total times column total divided by the overall total”:

Expected frequency E{ij}
Expected frequency E{ij}

The more likely it is that the null hypothesis holds, the closer E{ij} (expected frequency) should be to O{ij}​ (observed frequency). Ideally, if both variables are perfectly independent, the observed and expected frequencies for any cell should be equal. To measure the distance between O{ij}​ and E{ij}​, we use the squared error. By normalizing this error by dividing it by the expected frequency, we obtain a universal test statistic known as the Chi-Square Test statistic (χ2):

Chi-Square test statistic
Chi-Square test statistic

Next, let’s derive the degrees of freedom. Intuitively, in a contingency table, the row and column totals are fixed based on the data collected. In a 2x2 table, for example, once you know the values in three of the cells, the fourth cell’s value is determined by the row and column totals. Generally, the number of independent values is given by (m×n)−(m+n−1), where m and n are the number of categories in X and Y, respectively. This formula simplifies to (m−1)×(n−1).

Using this, we can look up the critical value determined by the significance level and the degrees of freedom. In our case, with 4 degrees of freedom, the critical value from the distribution table is 9.488.

Chi-Square, Finding the critical value
Chi-Square, Finding the critical value

Given that our test statistic is approximately 12.109, which is greater than the critical value, we reject the null hypothesis. Therefore, we conclude that education level and job satisfaction are statistically related.

Next, we will analyze the case where the independent variables are continuous, and enter the ANOVA approach.

ANOVA (Analysis of Variance)

Key Characteristics: Comparing three or more groups, Continuous dependent variable, Independent categorical variable.

ANOVA is used to compare means across three or more groups to see if at least one group mean is different from the others. It helps in determining if variations in data can be attributed to the grouping variable or if they are simply due to random chance.

Imagine we want to compare the average test scores of students from n = 3 different teaching methods, and we want to find out whether there are statistically significant differences in their scores.

Our null hypothesis is that the means of all groups are equal, also µ1 = µ2 = µ3. If at least one group mean is different, we reject the null hypothesis. Same as before, we can pick the common choice 0.05 for the significance level (α). We can gather data and organize it into the groups as below:

List of numbers for all groups
List of numbers for all groups

Now, we calculate the mean for each group (E[X1] = 87.67, E[X2] = 81.67, E[X3] = 94.67), as well as the overall mean (E[X] = 88.67).

Upper: Expected value for i-th group, Lower: Overall expected value
Upper: Expected value for i-th group, Lower: Overall expected value

If the mean of all groups are different from each other, it should result in a high test statistic, so that we could reject the null hypothesis. Luckily, the between-group mean square V{between} might provide us this information; and, we can also divide it by the within-group mean square V{within} to obtain a fully normalized test statistic (called F-value):

V{between} as the variance across different groups and V{within} as sum of variances within the same group.
V{between} as the variance across different groups and V{within} as sum of variances within the same group.
F-value as the division of between-group mean square by within-group mean square
F-value as the division of between-group mean square by within-group mean square

With this in mind, let’s find out the DoFs of the test statistic, and thus the DoFs of both V{between} and V{within}. Fortunately, this is easy to determine, just look up the first factor of each term and take its reciprocal, which gives you the following information:

  • The DoF for V{between}​ is n−1.
  • The DoF for V{within}​ is N−n.

Using these DoFs and the significance level (α), we can look up the critical value in the F-distribution table by finding the intersection of the (N−n)-th row and the (n−1)-th column for the given α.

ANOVA, Finding the critical value
ANOVA, Finding the critical value

In our case, the F-value is approximately 13.82 and the critical value is 5.14. Since the F-value exceeds the critical value, we reject the null hypothesis, which means that there is a significant difference between at least one pair of groups.

This time, we’ll consider a special case with a similar setup as before, but now focusing on a scenario where only two groups are compared to each other.

t-Test

Key Characteristics: Comparing two groups, Continuous dependent variable, Independent or related groups (depending on the t-Test type).

The t-Test is used to compare the means of two groups to find out if they are significantly different from each other. The one type of t-Test is the independent samples t-Test for comparing two separate groups. The other type of t-Test is the paired samples t-Test used for comparing two related groups, whose test subjects match exactly. In this article, we’ll only cover the first one.

Independent samples t-test (left) vs. Paired samples t-test (right)
Independent samples t-test (left) vs. Paired samples t-test (right)

Let’s take this example shown above: The body weights of two groups of people are measured (in kg), where the one (test) group works out and the other (control) group doesn’t. The null assumption is that there is no difference in the mean body weights.

Using the formula below, we can estimate the mean and the standard deviation of both groups as follows: E[X1] = 75.5, E[X2] = 76.8, S[X1] ≈ 4.09, S[X2] ≈ 9.74.

Estimation of standard deviation
Estimation of standard deviation

Intuitively, the test statistic should be proportional to E[X1]-E[X2], because the higher the mean difference, the higher the test statistic and the less likely that both groups have the same mean. For a better scaling, we can divide this term by the pooled standard deviation, which is a weighted standard deviation of both groups, assuming that both groups are of equal variance:

Pooled standard deviation
Pooled standard deviation

In the following, we’ll also find out that the red-colored term is the DoF as the number of free parameters (N{1} + N{2}) minus the number of fixed margins (2 for both variances).

After a slight modification, we can get the test statistic t as follows:

t-Test statistic
t-Test statistic

We can look up the critical value from the table by the DoF = 9 and α = 0.05 as usual:

t-Test distribution table
t-Test distribution table

The critical value is approximately ± 2.262 (minus for the left tail and plus for the right tail), and obviously, the test statistic t = -0.30 falls within the range [-2.262, +2.262], so we retain the null hypothesis and assume that both groups have the same mean.

Now, in Regression Analysis, we transition from the straightforward statistical testing approach to a more sophisticated method, one that aligns well with Machine Learning techniques.

Regression Analysis

Key Characteristics: Continuous dependent variable, One or more independent variables (continuous or categorical), Modeling and prediction.

Regression Analysis is a statistical method for modeling the relationship between a dependent variable and one or more independent variables, all continuous (or encoded as continuous using One-Hot Encoding or Label-Encoding etc.). It involves predicting the dependent variable based on the values of the independent variables.

For example, consider modeling the relationship between house size (in square feet) as the predictor X and house prices as the outcome Y. The null hypothesis in this context is that there is no relationship between X and Y. Below is a sample dataset:

A sample dataset
A sample dataset

The goal is to learn the underlying linear model:

True linear model
True linear model

where a is the slope, b is the intercept, and ϵ represents the noise. To evaluate the goodness of the model, we construct an evaluation score as follows:

  • The lower the residual variance (MSE between the actual and predicted values, reflecting the closeness of predictions to the actual values), the higher the score.
  • The higher the total variance of the actual values (MSE between the actual values and their mean), the lower the score. This can be viewed as the residual variance of a dummy prediction with zero slope and mean outcome as intercept.
  • The explained variance, which represents the increase in knowledge after applying the regression model, is calculated as “total variance minus residual variance.”
  • In the best case, the explained variance equals the total variance, so we normalize this by dividing it by the total variance, which yields a score ranging from 0 to 1. Enter R2 (R-Squared) score.
R2-score
R2-score

If the R2-score exceeds a predefined threshold (take 0.90 for example), then we’re supposed to reject the null hypothesis. Now, let’s try to find the optimal regressor model.

If the noise is Gaussian-distributed with a zero mean and the predictor values are assumed to be exact, we aim to minimize the Mean Square Error (MSE), which reflects the difference between the actual and predicted values:

Mean Square Error to be minimized
Mean Square Error to be minimized

Using Maximum Likelihood Estimation (MLE), we can estimate the parameters in a statistically optimal way:

Estimation of slope and intercept
Estimation of slope and intercept

The results from the plot are promising:

House Price vs. Size Plot
House Price vs. Size Plot

In this case, the optimal model yields an R2-score of approximately 0.938, which indicates a strong relationship between house prices and house sizes.

Final Thoughts

Understanding the appropriate use of statistical tests is crucial for accurate data analysis. Fisher’s Exact Test, Chi-Square Test, ANOVA, t-Test, and Regression Analysis each have their specific applications and assumptions, but they’re not the only methods available. In fact, there are many other tests, each excelling in different scenarios.

A well-known example of applying such methods in real life is A/B-Testing. This technique is commonly used by data-driven companies to increase click rates on a website or predict the market impact of a new feature based on customer reviews. By making more reliable and valid inferences, smarter decision-making and purely scientific insights become less far-fetched.

The key takeaway is this: it all starts with precisely formulating your goals and selecting the right test for your use case. The rest involves mathematical and computational techniques. Stay tuned for more content like this!

Funny Science meme
Funny Science meme

--

--

Yamac Eren Ay
Yamac Eren Ay

No responses yet