18 Chi-square

Learning Outcomes

In this chapter, you will learn how to:

  • Identify when appropriate to run a chi-square test of goodness-of-fit for independence.
  • Describe the concept of a contingency table for categorical data.
  • Complete hypothesis test for chi-square test of goodness-of-fit and independence.
  • Compute and interpret effect size for chi-square.
  • Describe Simpson’s paradox and why it is important for categorical data analysis.

We come at last to our final statistic: chi-square (χ2). This test is a special form of analysis called a nonparametric test, so the structure of it will look a little bit different from what we have done so far. However, the logic of hypothesis testing remains unchanged. The purpose of chi-square is to understand the frequency distribution of a single categorical variable or find a relationship between two categorical variables, which is a frequently very useful way to look at our data.

Categories and Frequency Tables

Our data for the χ2 test are categorical, specifically nominal, variables. Recall from Unit 1 that nominal variables have no specified order and can only be described by their names and the frequencies with which they occur in the dataset. Thus, unlike our other variables that we have tested, we cannot describe our data for the χ2 test using means and standard deviations. Instead, we will use frequencies tables.

Cat

Dog

Other

Total

Observed

14

17

5

36

Expected

12

12

12

36

Table 1. Pet Preferences

Table 1 gives an example of a frequency table used for a χ2 test. The columns represent the different categories within our single variable, which in this example is pet preference. The χ2 test can assess as few as two categories, and there is no technical upper limit on how many categories can be included in our variable, although, as with ANOVA, having too many categories makes our computations long and our interpretation difficult. The final column in the table is the total number of observations, or N. The χ2 test assumes that each observation comes from only one person and that each person will provide only one observation, so our total observations will always equal our sample size.

There are two rows in this table. The first row gives the observed frequencies of each category from our dataset; in this example, 14 people reported preferring cats as pets, 17 people reported preferring dogs, and 5 people reported preferring a different animal. The second row gives expected values; expected values are what would be found if each category had equal representation. Note: Chi-square tests should not be used when an expected value for any cell will be less than 5.

Calculation for Expected Value

[latex]E=\frac{N}{C}[/latex]

Where:

[latex]E=[/latex] the expected value or what we would expect if there was no preference or the preferences were equal

[latex]N=[/latex] the total number of people in our sample

[latex]C=[/latex] the number of categories in our variable (also the number of columns in our table, not including the Total column)

The expected values correspond with the null hypothesis for χ2 tests: equal representation of categories. Our first of two χ2 tests, the Goodness-of-Fit test, will assess how well our data lines up with, or deviates from, this assumption.

Goodness-of-Fit Test

The first of our two χ2 tests assesses one categorical variable against a null hypothesis of equally sized frequencies. Equal frequency distributions are what we would expect to get if placement into a particular category was completely random. We could, in theory, also test against a specific distribution of category sizes if we have a good reason to (e.g., we have a solid foundation of how the regular population is distributed), but this is less common, so we will not deal with it in this text. For example, if you know that for every 2 cats available for adoption there are 5 dogs available, you might expect more dogs than cats in your expected values row.

Step 1 – State the Hypotheses

All χ2 tests, including the goodness-of-fit test, are nonparametric. This means that there is no population parameter we are estimating or testing against; we are working only with our sample data. Because of this, there are no mathematical statements for χ2 hypotheses. This should make sense because the mathematical hypothesis statements were always about population parameters (e.g., μ), so if our test is nonparametric, we have no parameters and therefore no statistical notation hypothesis statements.

We do, however, still state our hypotheses in written form. For goodness-of-fit χ2 tests, our null hypothesis is that there is an equal number of observations in each category. That is, there is no difference between the categories in how prevalent they are. Our alternative hypothesis says that the categories do differ in their frequency. We do not have specific directions or one-tailed tests for χ2, matching our lack of mathematical statement. That is:

H0: There is not difference in number of observations between categories

HA: There is a difference in number of observations between categories

Step 2 – Find the Critical Value

Our degrees of freedom for the χ2 test are based on the number of categories we have in our variable, not on the number of people or observations like it was for our other tests. Luckily, they are still as simple to calculate.

Degrees of Freedom for χ2 Goodness of Fit Test

[latex]df=C-1[/latex]

So for our pet preference example, we have 3 categories, so we have 3-1=2 degrees of freedom. Our degrees of freedom, along with our significance level (still defaulted to α = 0.05) are used to find our critical values in a χ2 table, which is shown in Figure 1. Because we do not have directional hypotheses for χ2 tests (notice the number is a squared value so it can never be negative), we do not need to differentiate between critical values for 1- or 2-tailed tests. Just like our F tests for regression and ANOVA, all χ2 tests are 1-tailed tests. According Figure 1, our critical value would be χ2crit= 5.99.

image

Figure 1. First 10 rows of a χ2 table

Step 3 – Calculate the Test Statistic

The calculations for our test statistic in χ2 tests combine our information from our observed frequencies (O) and our expected frequencies (E) for each level of our categorical variable. For each cell (category) we find the difference between the observed and expected values, square them, and divide by the expected values. We then sum this value across cells for our test statistic.

χ2 Goodness of Fit Test Statistic

[latex]\chi^2=\sum\frac{(O-E)^2}{E}[/latex]

Where:

[latex]O=[/latex] the observed frequency for a category

[latex]E=[/latex] the expected frequency for a category

 

If we do this for our pet preference data, we would have the following.

 

Cat

Dog

Other

Total

Observed

14

17

5

36

Expected

12

12

12

36

Table 2. Pet Preferences

 

[latex]\chi^2=\frac{(14-12)^2}{12}+\frac{(17-12)^2}{12}+\frac{(5-12)^2}{12}[/latex]

 

[latex]\chi^2=0.33+2.08+4.08=6.49[/latex]

 

Step 4 – Make a Decision and Interpret the Results

Now we can compare the test statistic of 6.49 to the critical value of 5.99. Because 6.49 is greater than 5.99, we can reject the null hypothesis and state that pet preference is different from random chance. So, let’s interpret this in APA style.

The sample of 36 people showed a significant preference for type of pet, χ2(2) = 6.49, p < .05.

 

Finding a Relationship between Two Categorical Variables

 

The goodness-of-fit test is a useful tool for assessing a single categorical variable. However, what is more common is wanting to know if two categorical variables are related to one another. This type of analysis is similar to a correlation, the only difference being that we are working with nominal data, which violates the assumptions of traditional correlation coefficients. This is where the χ2 test for independence comes in handy.
As noted above, our only description for nominal data is frequency, so we will again present our observations in a frequency table. When we have two categorical variables, our frequency table is crossed. That is, each combination of levels from each categorical variable are presented. This type of frequency table is called a contingency table because it shows the frequency of each category in one variable contingent upon the specific level of the other variable.

An example contingency table is shown in Table 3 which displays whether or not 168 college students watched college sports growing up (Yes/No) and whether the students’ final choice of which college to attend was influenced by the college’s sports teams (Yes – Primary, Yes – Somewhat, No).

College Sports

Affected Decision

Primary

Somewhat

No

Total

Watched

Yes

47

26

14

87

No

21

23

37

81

Total

68

49

51

168

Table 3. Contingency table of college sports and decision making

In contrast to the frequency table for our goodness-of-fit test, our contingency table does not contain expected values, only observed data. Within our table, wherever our rows and columns cross, we have a “cell”. A cell contains the frequency of observing its corresponding specific levels of each variable at the same time. The top left cell in table 3 shows us that 47 people in our study watched college sports as a child AND had college sports as their primary deciding factor in which college to attend.

Cells are numbered based on which row they are in (rows are numbered top to bottom) and which column they are in (columns are numbered left to right). We always name the cell using (R, C), with the row first and the column second. Based on this convention, the top left cell containing our 47 participants who watched college sports as a child and had sports as a primary criteria is cell (1, 1). Next to it, which has 26 people who watched college sports as a child but had sports only somewhat affect their decision, is cell (1, 2), and so on. We only number the cells where our categories cross. We do not number our total cells, which have their own special name: marginal values. Marginal values are the total values for a single category of one variable, added up across levels of the other variable. In table 3, these marginal values have been italicized for ease of explanation, though this is not normally the case. We can see that, in total, 87 of our participants (47+26+14) watched college sports growing up and 81 (21+23+37) did not. The total of these two marginal values is 168, the total number of people in our study. Likewise, 68 people used sports as a primary criteria for deciding which college to attend, 50 considered it somewhat, and 50 did not use it as criteria at all. The total of these marginal values is also 168, our total number of people. The marginal values for rows and columns will always both add up to the total number of participants, N, in the study. If they do not, then a calculation error was made and you must go back and check your work.

Expected Values of Contingency Tables

Our expected values for contingency tables are based on the same logic as they were for frequency tables, but now we must incorporate information about how frequently each row and column was observed (the marginal values) and how many people were in the sample overall (N) to find what random chance would have made the frequencies out to be.

Calculating Expected Values for Contingency Tables

[latex]E=\frac{n_c*n_r}{N}[/latex]

Where:

[latex]n_c=[/latex] the number of people in the column

[latex]n_r=[/latex] the number of people in the row

[latex]N=[/latex] the total number of people in the entire sample

 

So, for our data we would calculate expected values for each cell of observed values as follows.

 

 

College Sports

Affected Decision

Primary

Somewhat

No

Total

Watched

Yes

[latex]\frac{87*68}{168}[/latex]
[latex]=35.21[/latex]
[latex]\frac{87*49}{168}[/latex]
[latex]=25.38[/latex]
[latex]\frac{87*51}{168}[/latex]
[latex]=26.41[/latex]

87

No

[latex]\frac{81*68}{168}[/latex]
[latex]=32.79[/latex]
[latex]\frac{81*49}{168}[/latex]
[latex]=23.62[/latex]
[latex]\frac{81*51}{168}[/latex]
[latex]=24.59[/latex]

81

Total

68

49

51

168

Table 4. Contingency table of expected value calculations (with observed totals included for calculations) of college sports and decision making

Notice that the marginal values still add up to the same totals as before. This is because the expected frequencies are just row and column averages simultaneously. Our total N will also add up to the same value.

The observed and expected frequencies can be used to calculate the same χ2 statistic as we did for the goodness-of-fit test. Before we get there, though, we should look at the hypotheses and degrees of freedom used for contingency tables.

χ2 Test for Independence

The χ2 test performed on contingency tables is known as the test for independence. In this analysis, we are looking to see if the values of each categorical variable (that is, the frequency of their levels) is related to or independent of the values of the other categorical variable. Let’s do the four-step test for this example.

Step 1 – State the Hypotheses

Because we are still doing a χ2 test which is nonparametric, we still do not have statistical versions of our hypotheses. The actual interpretations of the hypotheses are quite simple: the null hypothesis says that the variables are independent or not related, and alternative says that they are not independent or that they are related.

H0: Watching college sports as a child is not related to choosing college.

OR

H0: Watching college sports as a child is independent of college choice.

HA: Watching college sports as a child is related to choosing college.

OR

HA: Watching college sports as a child is not independent of college choice.

Step 2 – Find the Critical Value

For step 2, the only change is degrees of formula. Our critical value will come from the same table that we used for the goodness-of-fit test, but our degrees of freedom will change. Because we now have rows and columns (instead of just columns) our new degrees of freedom is found by multiplying two numbers together.

Degrees of Freedom for Test for Independence

[latex]df=(R-1)(C-1)[/latex]

Where:

[latex]R=[/latex] the number of rows in the contingency table (or the number of categories in the first categorical variable)

[latex]C=[/latex] the number of columns in the contingency table (or the number of categories in the second categorical variable)

So, for our example, we have

[latex]df=(2-1)(3-1)=1*2=2[/latex]

If we go back to Figure 1 and look at df = 2 for α = .05, we get a critical value of 5.99 again.

Step 3 – Calculate the Test Statistic

As you can see below, we calculate our test statistic for the test for independence the same way we did for the goodness of fit test. Our equation just gets a bit longer because we now have to do it for every cell in the contingency table. So, for our 2×3 contingency table, we will have 6 components in our equation.

Test for Independence Test Statistic

[latex]E=\frac{n_c*n_r}{N}[/latex]

Where:

[latex]n_c=[/latex] the number of people in the column

[latex]n_r=[/latex] the number of people in the row

[latex]N=[/latex] the total number of people in the entire sample

 

College Sports

Affected Decision

Primary

Somewhat

No

Total

Watched

Yes

47
(35.21)

26
(25.38)

14
(26.41)

87

No

21
(32.79)

23
(23.62)

37
(24.59)

81

Total

68

49

51

168

Table 5. Contingency table with observed (and expected) values of college sports and decision making

With the observed and expected values found in Table 5, we can calculate our χ2 test statistic for this example.

 

[latex]\chi^2=\frac{(35.21-47)^2}{35.21}+\frac{(25.38-26)^2}{25.38}+\frac{(26.41-14)^2}{26.41}+\frac{(32.79-21)^2}{32.79}+\frac{(23.62-23)^2}{23.62}+\frac{(24.59-37)^2}{24.59}[/latex]

 

[latex]\chi^2=3.94+0.02+5.83+4.24+0.02+6.26=20.31[/latex]

 

Step 4 – Make a Decision and Interpret the Results

The final decision for our test of independence is still based on our observed value (20.31) and our critical value (5.99). Because our observed value is greater than our critical value, we can reject the null hypothesis.

Reject H0. Based on our data from 168 people, we can say that there is a statistically significant relationship between whether or not someone watches college sports growing up and how much a college’s sports team factor in to that person’s decision on which college to attend, χ2(2) = 20.31, p < 0.05.

Effect Size for χ2

Like all other significance tests, χ2 tests – both goodness-of-fit and tests for independence – have effect sizes that can and should be calculated for statistically significant results. There are many options for which effect size to use, and the ultimate decision is based on the type of data, the structure of your frequency or contingency table, and the types of conclusions you would like to draw. For the purpose of our introductory course, we will focus only on a single effect size that is simple and flexible: Cramer’s V. This is appropriate to use when the χ2 test involves a matrix larger that 2 X 2. Cramer’s Vis a type of correlation coefficient that can be computed on categorical data. 

Cramer’s V

[latex]V=\sqrt{\frac{\chi^2}{N(k-1)}}[/latex]

Where:

[latex]N=[/latex] the total number of people in the sample

[latex]k=[/latex] the smaller value of either R (the number of rows) or C (the number of columns, that is the number of categories for the variable with the smallest number of categories

[latex]\chi^2=[/latex] the test statistic calculated in Step 4

So, for our example above, we can calculate an effect size given that we found a significant relationship and that we have a 2×3 contingency table that we are working with.

We know that N = 168 and k would either be 2 or 3, but we want the smaller of the two, so k = 2. And, finally, χ2 = 20.31.

 

[latex]V=\sqrt{\frac{20.31}{168(2-1)}}=\sqrt{\frac{20.31}{168}}=\sqrt{0.121}=0.35[/latex]

Like other statistic effect sizes there are range cut offs of small, medium, and large. So the statistically significant relation between our variables was moderately strong examining the effect size table below.

Small Medium Large
df = 1 0.10 0.30 0.50
df = 2 0.07 0.21 0.35
df = 3 0.06 0.17 0.29

Table 6. The effect size ranges of Cramer’s V.

 

Additional Thoughts

Beyond Pearson’s Chi-Square Test: Standardized Residuals

For a more applicable example, let’s take the question of whether a Black driver is more likely to be searched when they are pulled over by a police officer, compared to a white driver. The Stanford Open Policing Project (https://openpolicing.stanford.edu/) has studied this, and provides data that we can use to analyze the question. We will use the data from the State of Connecticut since they are fairly small and thus easier to analyze.

The standard way to represent data from a categorical analysis is through a contingency table, which presents the number or proportion of observations falling into each possible combination of values for each of the variables. Table 7 below shows the contingency table for the police search data. It can also be useful to look at the contingency table using proportions rather than raw numbers, since they are easier to compare visually, so we include both absolute and relative numbers here.

searched Black White Black (relative) White (relative)
FALSE 36244 239241 0.13 0.86
TRUE 1219 3108 0.00 0.01

Table 7. Contingency Table for Police Search Data

The Pearson chi-squared test (discussed above) allows us to test whether observed frequencies are different from expected frequencies, so we need to determine what frequencies we would expect in each cell if searches and race were unrelated – which we can define as being independent. If we perform this test easily using our statistical software, X2 (1) = 828, p < .001. This shows that the observed data would be highly unlikely if there was truly no relationship between race and police searches, and thus we should reject the null hypothesis of independence.

When we find a significant effect with the chi-squared test, this tells us that the data are unlikely under the null hypothesis, but it doesn’t tell us how the data differ. To get a deeper insight into how the data differ from what we would expect under the null hypothesis, we can examine the residuals from the model, which reflects the deviation of the data (i.e., the observed frequencies) from the model (i.e., the expected frequencies) in each cell. Rather than looking at the raw residuals (which will vary simply depending on the number of observations in the data), it’s more common to look at the standardized residuals (sometimes called Pearson residuals).

Table 8 shows these for the police stop data from X2 above. Remember that we examined the question of whether a Black driver is more likely to be searched when they are pulled over by a police officer, compared to a white driver. These standardized residuals can be interpreted as Z scores – in this case, we see that the number of searches for Black individuals are substantially higher than expected based on independence, and the number of searches for white individuals are substantially lower than expected. This provides us with the context that we need to interpret the significant chi-squared result.

 
searched driver_race Standardized residuals
FALSE Black -3.3
TRUE Black 26.6
FALSE White 1.3
TRUE White -10.4

Table 8. Summary of standardized residuals for police stop data

Beware of Simpson’s paradox

The contingency tables that represent summaries of large numbers of observations, but summaries can sometimes be misleading. Let’s take an example from baseball. The table below shows the batting data (hits/at bats and batting average) for Derek Jeter and David Justice over the years 1995-1997:

Player 1995 1996 1997 Combined
Derek Jeter 12/48 .250 183/582 .314 190/654 .291 385/1284 .300
David Justice 104/411 .253 45/140 .321 163/495 .329 312/1046 .298

Table 9. Player Batting data for 2 baseball players

If you look closely, you will see that something odd is going on: In each individual year Justice had a higher batting average than Jeter, but when we combine the data across all three years, Jeter’s average is actually higher than Justice’s! This is an example of a phenomenon known as Simpson’s paradox, in which a pattern that is present in a combined dataset may not be present in any of the subsets of the data. This occurs when there is another variable that may be changing across the different subsets – in this case, the number of at-bats varies across years, with Justice batting many more times in 1995 (when batting averages were low). We refer to this as a lurking variable, and it’s always important to be attentive to such variables whenever one examines categorical data.

 

definition

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Statistics for the Social Sciences Copyright © 2021 by Jennifer Ivie; Alicia MacKay is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book