zaro

What is the Chi-square Test for Contingency Table?

Published in Statistical Hypothesis Test 4 mins read

The Chi-square test for contingency tables is a powerful statistical tool used to determine if there is a statistically significant association between two categorical variables. It helps researchers understand whether observed patterns in data are likely due to a real relationship or simply random chance.

Purpose and Application

At its core, Pearson's chi-squared test is employed to ascertain if a statistically significant difference exists between the frequencies you observe in your data (observed frequencies) and the frequencies you would theoretically expect to see if there were no relationship between the variables (expected frequencies) within one or more categories of a contingency table.

This test is particularly useful for analyzing qualitative data, such as survey responses, demographic information, or experimental outcomes where variables fall into distinct categories. For instance, you might use it to investigate if there's an association between a person's gender and their preference for a certain type of movie, or between a treatment group and a specific outcome in a medical study.

It's important to consider sample size: for contingency tables involving smaller sample sizes, Fisher's exact test is often preferred over the Chi-square test to ensure the validity of the results.

How It Works

The Chi-square test operates by comparing the actual counts in each cell of a contingency table to the counts that would be expected if the two variables were completely independent.

1. Setting Up Hypotheses:

  • Null Hypothesis (H₀): There is no association between the two categorical variables; they are independent.
  • Alternative Hypothesis (H₁): There is an association between the two categorical variables; they are dependent.

2. Constructing a Contingency Table:

Data is organized into a contingency table (also known as a cross-tabulation or cross-tab), which displays the frequency distribution of the variables.

Category A1 Category A2 Total (Rows)
Category B1 Observed N₁₁ Observed N₁₂ Row Total 1
Category B2 Observed N₂₁ Observed N₂₂ Row Total 2
Total (Cols) Col Total 1 Col Total 2 Grand Total N

3. Calculating Expected Frequencies:

For each cell in the table, the expected frequency (E) is calculated under the assumption that the null hypothesis is true (i.e., no association).

E = (Row Total × Column Total) / Grand Total

4. Calculating the Chi-square Statistic (χ²):

The Chi-square statistic quantifies the difference between observed and expected frequencies across all cells.

χ² = Σ [(Observed - Expected)² / Expected]

A larger χ² value indicates a greater discrepancy between observed and expected frequencies, suggesting a stronger relationship between the variables.

5. Determining Degrees of Freedom (df):

The degrees of freedom reflect the number of values in the final calculation of a statistic that are free to vary. For a contingency table:

df = (Number of Rows - 1) × (Number of Columns - 1)

6. Interpreting the P-value:

Once the χ² value and degrees of freedom are calculated, they are used to determine a p-value.

  • If the p-value is less than the chosen significance level (commonly 0.05), you reject the null hypothesis, concluding that there is a statistically significant association between the variables.
  • If the p-value is greater than the significance level, you fail to reject the null hypothesis, indicating insufficient evidence of an association.

When to Use It

The Chi-square test for contingency tables is appropriate when:

  • You have two categorical variables.
  • You want to determine if there is a relationship or association between these two variables.
  • The data consists of frequencies or counts for each category.
  • The expected frequencies in each cell are not too small (generally, most cells should have an expected count of at least 5 to ensure the validity of the Chi-square approximation, otherwise Fisher's exact test is preferred).

Practical Example

Imagine a researcher wants to know if there's a relationship between a person's smoking status (Smoker/Non-smoker) and their likelihood of developing a chronic cough (Yes/No). They collect data from 200 individuals:

Observed Frequencies:

Chronic Cough: Yes Chronic Cough: No Total
Smoker 60 40 100
Non-smoker 20 80 100
Total (Columns) 80 120 200

Calculation Steps:

  1. Expected Frequencies:

    • Expected Smoker & Cough Yes = (100 * 80) / 200 = 40
    • Expected Smoker & Cough No = (100 * 120) / 200 = 60
    • Expected Non-smoker & Cough Yes = (100 * 80) / 200 = 40
    • Expected Non-smoker & Cough No = (100 * 120) / 200 = 60
  2. Chi-square Calculation:

    • χ² = [(60-40)²/40] + [(40-60)²/60] + [(20-40)²/40] + [(80-60)²/60]
    • χ² = [400/40] + [400/60] + [400/40] + [400/60]
    • χ² = 10 + 6.67 + 10 + 6.67 = 33.34
  3. Degrees of Freedom:

    • df = (2-1) * (2-1) = 1
  4. P-value Interpretation:

    • Consulting a Chi-square distribution table or statistical software with χ² = 33.34 and df = 1 would yield a p-value far less than 0.001. This extremely small p-value would lead to the rejection of the null hypothesis.

Conclusion:

Based on this analysis, there is a statistically significant association between smoking status and the presence of a chronic cough. This suggests that smokers are significantly more likely to develop a chronic cough than non-smokers in this sample.

For more in-depth information, you can explore the Chi-squared test on Wikipedia.