Statistical Tests Simplified
Part 1: Types of Variables, Hypothesis Testing, Fisher’s Exact Test
Hypothesis testing is a cornerstone of scientific research and a powerful tool for representing the current belief system of science. It allows us to articulate a belief that can be either retained or disproven by others, and it embodies the scientific process of “thesis — antithesis — synthesis”. I find it quite amusing to compare it to the politics, where effectiveness, ideologies, and beliefs are debated through tone, arguments, and gestures. In contrast, science relies on hypothesis testing, a unique and universal tool that (almost) all researchers from diverse backgrounds agree upon. By doing so, scientific ideas stand out based on empirical evidence rather than persuasive rhetoric or mind tricks.
What’s in it for you? Learning this scientific framework could improve your decision-making and help you convince others that your method is better than the other baseline methods. For example, you want to know:
- How does one variable relate to another under certain assumptions?
- Is the test group different from the control group?
- How likely is the outcome of one variable given the state of another variable?
In this article series, we’ll not only explore this scientific framework, but also apply them in various use cases using different statistical tests that suit our needs, such as Fisher’s Exact Test, Chi-Square Test, ANOVA, t-Test and Regression Analysis.
But before we jump right into the interesting part, first we have to address the elephant in the room: Types of variables.
Crashcourse: Types of Variables
Categorical, Nominal: Nominal variables can only be compared based on equality. For example, names like “Brandon” and “Kyle” can be compared to see if they are the same or different, but we cannot determine whether one is larger than the other. When there are only two values, such as in yes/no questions, it’s called a binary variable. The mode, or the most frequent value, is a good measure of central tendency for nominal variables.
Categorical, Ordinal: Ordinal variables not only allow for equality comparisons but also can be ordered. For example, AC levels such as “cold” (0), “warm” (1), and “hot” (2) can be ranked based on the perceived temperature. However, the exact intervals between these levels are not known, and perceptions of “warm” can vary between individuals. The median, which is the middle value of a distribution, is a good measure of central tendency for ordinal variables.
Continuous, Cardinal: Cardinal variables, also known as interval variables, have meaningful intervals between values, allowing for the calculation of differences. These variables can represent counts, integers, or ratios. For example, temperature can be measured on a scale where the difference between 20 and 30 degrees is the same as between 30 and 40 degrees. The mean, which is the sum of all values divided by the number of samples, is a good measure of central tendency for cardinal variables.
Continuous, Ratio: Ratio variables have all the properties of cardinal variables but also include an absolute zero point, making division of values meaningful. For example, absolute temperature measured in Kelvin has a true zero point and increases linearly. This allows for meaningful ratios, such as comparing temperatures to understand energy differences. Another great example is the metric scale.
Now, it’s time to explore the scientific framework of hypothesis testing!
Generic Algorithm of Statistical Testing
State the Hypotheses: Formulate two statements that contradict each other, with only one being true. The first is the Null Hypothesis (H0), which represents the status quo or baseline assumption (no effect or difference). The second is the Alternative Hypothesis (H1), which contradicts the null hypothesis and represents what you aim to provide evidence for.
Choose the Significance Level: The significance level (α, “alpha”) is the probability of rejecting the null hypothesis when it is actually true. It acts as a threshold for the maximum allowable p-value. Common choices for α are 0.05 and 0.01.
Collect and Summarize the Data: Gather and organize sample data into a structured format, such as a contingency table for categorical data or a list of numerical values for continuous data.
Choose the Appropriate Test and Calculate the Test Statistic: Select the appropriate test based on the nature of the data and the hypotheses. Each test has its own formula for calculating the test statistic, which provides valuable information about the data sample.
Determine the p-value: The p-value is the probability of obtaining test results as extreme as the observed results, assuming the null hypothesis is correct. Its computation can be very difficult and typically requires specialized statistical software or numerical methods, since it often involves integrating complex probability distributions, which cannot be easily solved by hand or with simple formulas.
This is where the previous step becomes useful: Instead of calculating the p-value for any arbitrary data distribution each time, one has to calculate it based on the given test statistic once, so that we can reuse this pre-computation later. By leveraging such test statistic tables, we can perform hypothesis testing more efficiently and accurately.
Compare the p-value to the Significance Level and Interpret the Results: If the p-value ≤ α, reject the null hypothesis (evidence supports the alternative hypothesis); otherwise, fail to reject the null hypothesis (insufficient evidence to support the alternative hypothesis).
In the following, we will go through some interesting cases and introduce statistical tests one by one. In this article, I will only cover the Fisher’s Exact Test, and explain the rest of statistical tests in the second part.
Fisher’s Exact Test
Key Characteristics: Small sample sizes, Categorical data, 2x2 contingency tables.
Fisher’s Exact Test is a powerful tool for examining how two (binary) categorical variables are associated, especially when sample sizes are small. As the name indicates, Fisher’s Exact Test calculates the exact probability of observing the data under the null hypothesis.
Given a 2x2 contingency table below (with X as treatment and Y as outcome variable), our goal is to find out whether the treatment works:
So, the null hypothesis is that the probability of improvement is the same for treated and non-treated individuals, also the independence of both variables. Likewise, the alternative hypothesis is that the healing probability is higher for those receiving a treatment.
Given the binomial coefficient formula B(n, k) = n! / [k! (n-k!)] (“n choose k”), we can count the number of possible combinations for the following cases:
- Among a + c people who received the drug, there are B(a + c, a) distinct combinations with exactly a people showing improvement. Left marginal factor.
- Among b + d people who received the placebo, there are B(b + d, b) distinct combinations with exactly b people showing improvement. Right marginal factor.
- Among a + b + c + d people, there are B(a + b + c + d, a + b) distinct combinations with exactly a + b people showing improvement. Joint factor.
The product of left and right marginal factor, divided by the joint factor, acts as an independence measure, which peaks for a + b cured people when a / (a + c) = b / (b + d) or the ratios are the same across both groups, as you can see below:
So it is a perfect candidate for the test statistic! In order to obtain the p-value, also the probability P(Y=1 | X=1) ≥ P(Y=1 | X = 0), we can run the following Python script:
Since the obtained p-value 8.3 x 10^{-13} is very low, we can reject the null hypothesis in favor of the alternative hypothesis and conclude that the drug treatment is significantly associated with an improvement.
In the second part of my article series, I will tell more about other tests such as Chi-Square Test, ANOVA, t-Test and Regression Analysis. Thank you for reading so far and see you in my next article!