13 Independent Measures ANOVA
Learning outcomes
In this chapter, you will learn how to:
- Identify when to use an ANOVA
- Conduct an independent measures ANOVA hypothesis test
- Evaluate effect size for an independent measures ANOVA
- Conduct post hoc tests for an independent measures ANOVA
- Identify the advantages of an independent measures ANOVA
Analysis of variance, often abbreviated to ANOVA, serves the same purpose as the t-tests we learned in the last unit: it tests for differences in group means. ANOVA is more flexible in that it can handle any number of groups, unlike t-tests which are limited to two groups (independent samples) or two time points (paired samples). Thus, the purpose and interpretation of ANOVA will be the same as it was for t-tests, as will the hypothesis testing procedure. However, ANOVA will, at first glance, look much different from a mathematical perspective, though as we will see, the basic logic behind the test statistic for ANOVA is actually the same.
Observing and Interpreting Variability
We have seen time and again that scores, be they individual data or group means, will differ naturally. Sometimes this is due to random chance, and other times it is due to actual differences. Our job as scientists, researchers, and data analysts is to determine if the observed differences are systematic and meaningful (via a hypothesis test) and, if so, what is causing those differences. Through this, it becomes clear that, although we are usually interested in the mean or average score, it is the variability in the scores that is key.
Take a look at Figure 1, which shows scores for many people on a test of skill used as part of a job application. The x-axis has each individual person, in no particular order, and the y-axis contains the score each person received on the test. As we can see, the job applicants differed quite a bit in their performance, and understanding why that is the case would be extremely useful information. However, there’s no interpretable pattern in the data, especially because we only have information on the test, not on any other variable (remember that the x-axis here only shows individual people and is not ordered or interpretable).
Figure 1. Scores on a job test
Our goal is to explain this variability that we are seeing in the dataset. Let’s assume that as part of the job application procedure we also collected data on the highest degree each applicant earned. With knowledge of what the job requires, we could sort our applicants into three groups: those applicants who have a college degree related to the job, those applicants who have a college degree that is not related to the job, and those applicants who did not earn a college degree. This is a common way that job applicants are sorted, and we can use ANOVA to test if these groups are actually different. Figure 2 presents the same job applicant scores, but now they are color coded by group membership (i.e. which group they belong in). Now that we can differentiate between applicants this way, a pattern starts to emerge: those applicants with a relevant degree (coded red) tend to be near the top, those applicants with no college degree (coded black) tend to be near the bottom, and the applicants with an unrelated degree (coded green) tend to fall into the middle. However, even within these groups, there is still some variability, as shown in Figure 2.
Figure 2. Applicant scores coded by degree earned
This pattern is even easier to see when the applicants are sorted and organized into their respective groups, as shown in Figure 3.
Figure 3. Applicant scores by group
Now that we have our data visualized into an easily interpretable format, we can clearly see that our applicants’ scores differ largely along group lines. Those applicants who do not have a college degree received the lowest scores, those who had a degree relevant to the job received the highest scores, and those who did have a degree but one that is not related to the job tended to fall somewhere in the middle. Thus, we have systematic differences (or variance) between our groups.
We can also clearly see that within each group, our applicants’ scores differed from one another. Those applicants without a degree tended to score very similarly, since the scores are clustered close together. Our group of applicants with relevant degrees varied a little but more than that, and our group of applicants with unrelated degrees varied quite a bit. It may be that there are other factors that cause the observed score differences within each group, or they could just be due to random chance. Because we do not have any other explanatory data in our dataset, the variability we observe within our groups is considered random error, with any deviations between a person and that person’s group mean caused only by chance. Thus, we have unsystematic (random) variance within our groups.
The process and analyses used in ANOVA will take these two sources of variance (systematic variance between groups and random error within groups, or how much groups differ from each other and how much people differ within each group) and compare them to one another to determine if the groups have any explanatory value in our outcome variable. By doing this, we will test for statistically significant differences between the group means, just like we did for t– tests. We will go step by step to break down the math to see how ANOVA actually works.
Sources of Variance
ANOVA is all about looking at the different sources of variance (i.e. the reasons that scores differ from one another) in a dataset. Fortunately, the way we calculate these sources of variance takes a very familiar form: the Sum of Squares. Before we get into the calculations themselves, we must first lay out some important terminology and notation.
In ANOVA, we are working with two variables, a grouping or explanatory variable and a continuous outcome variable. The grouping variable is our predictor (it predicts or explains the values in the outcome variable) or, in experimental terms, our independent variable, and it is made up of k groups, with k being any whole number 2 or greater. That is, ANOVA requires two or more groups to work, and it is usually conducted with three or more. In ANOVA, we refer to groups as “levels”, so the number of levels is just the number of groups, which again is k. In the above example, our grouping variable was education, which had 3 levels, so k = 3. When we report any descriptive value (e.g. mean, sample size, standard deviation) for a specific group, we will use a subscript to denote which group it refers to. For example, if we have three groups and want to report the standard deviation s for each group, we would report them as s1, s2, and s3.
Our second variable is our outcome variable. This is the variable on which people differ, and we are trying to explain or account for those differences based on group membership. In the example above, our outcome was the score each person earned on the test. Our outcome variable will still use X for scores as before. When describing the outcome variable using means, we will use subscripts to refer to specific group means. So if we have k = 3 groups, our means will be M1, M2, and M3. These different means will be how we calculate our sums of squares.
Finally, we now have to differentiate between several different sample sizes. Our data will now have sample sizes for each group, and we will denote these with a lower case “n” and a subscript, just like with our other descriptive statistics: n1, n2, and n3. We also have the overall sample size in our dataset, and we will denote this with a capital N. The total sample size is just the group sample sizes added together.
Between Groups Sum of Squares
One source of variability we identified in Figure 3 of the above example was differences or variability between the groups. That is, the groups clearly had different average levels. The variability arising from these differences is known as the between groups variability, and it is quantified using Between Groups Sum of Squares.
Our calculations for sums of squares in ANOVA will take on the same form as it did for regular calculations of variance. Each observation, in this case the group means, is compared to the overall mean, in this case the grand mean, to calculate a deviation score. These deviation scores are squared so that they do not cancel each other out and sum to zero. The squared deviations are then added up, or summed. There is, however, one small difference. Because each group mean represents a group composed of multiple people, before we sum the deviation scores we must multiple them by the number of people within that group.
As you can see, the only difference between this equation and the familiar sum of squares for variance is that we are adding in the sample size. Everything else logically fits together in the same way. We also have the option of using the computational formula as we did before, which can be easier to use for some. But either is an option.
Calculating Between Groups Sums of Squares
Computational Formula
[latex]SS_{between}=\displaystyle\sum_{j=1}^{k}\frac{T_j^2}{n_j}-\frac{G^2}{N}[/latex]
Where:
j = “jth” group where j = 1…k to keep track of which group mean and sample size we are working with
Tj = sum of scores within treatment j
nj = number of scores within treatment j
G = sum of all scores across all treatments 1 through k
N = number of all scores across all treatments 1 through k
Definitional Formula
[latex]SS_{between}=\displaystyle\sum_{j=1}^{k}\left[\left(M_j-M_{grand}\right)^2*n_j\right][/latex]
Where:
j = “jth” group where j = 1…k to keep track of which group mean and sample size we are working with
nj = number of scores within treatment j
Mj = mean of scores within treatment j
Mgrand = mean of scores across all treatments 1 through k
Within Groups Sum of Squares
The other source of variability in the figures comes from differences that occur within each group. That is, each individual deviates a little bit from their respective group mean, just like the group means differed from the grand mean. We therefore label this source the Within Groups Sum of Squares. Because we are trying to account for variance based on group-level means, any deviation from the group means indicates an inaccuracy or error. Thus, our within groups variability represents our error in independent measures ANOVA.
Calculating Within Groups Sums of Squares
[latex]SS_{within}=\displaystyle\sum_{j=1}^{k}SS_j[/latex]
[latex]SS_j=\sum(X_j-M_j)^2[/latex]
Where:
j = “jth” group where j = 1…k to keep track of which group mean and sample size we are working with
Xj = a score within treatment j
Mj = mean of scores within treatment j
The formula for this sum of squares is again going to take on the same form and logic. What we are looking for is the distance between each individual person and the mean of the group to which they belong. We calculate this deviation score, square it so that they can be added together, then sum all of them into one overall value.
In this instance, because we are calculating this deviation score for each individual person, there is no need to multiple by how many people we have. It is important to remember that the deviation score for each person is only calculated relative to their group mean: do not calculate these scores relative to the other group means.
Total Sum of Squares
The Between Groups and Within Groups Sums of Squares represent all variability in our dataset. We also refer to the total variability as the Total Sum of Squares, representing the overall variability with a single number. The calculation for this score is exactly the same as it would be if we were calculating the overall variance in the dataset (because that’s what we are interested in explaining) without worrying about or even knowing about the groups into which our scores fall.
Calculating Total Sums of Squares
Computational Formula
[latex]SS_{total}=\sum(X^2)-\frac{(G)^2}{N}[/latex]
Where:
X = each individual score within each treatment 1 through k
G = sum of all scores across all treatments 1 through k
N = number of scores across all treatments 1 through k
Definitional Formula
[latex]SS_{total}=\sum(X-M_{grand})^2[/latex]
Where:
X = each individual score within each treatment 1 through k
G = sum of all scores across all treatments 1 through k
Finally, quite simply, the total sum of squares for independent measures ANOVA is just the between groups and the within groups added together.
[latex]SS_{total}=SS_{between}+SS_{within}[/latex]
We can see that our Total Sum of Squares is just each individual score minus the grand mean. As with our Within Groups Sum of Squares, we are calculating a deviation score for each individual person, so we do not need to multiply anything by the sample size; that is only done for Between Groups Sum of Squares.
An important feature of the sums of squares in ANOVA is that they all fit together. We could work through the algebra to demonstrate that if we added together the formulas for SSBetween and SSWithin, we would end up with the formula for SSTotal. This will prove to be very convenient, because if we know the values of any two of our sums of squares, it is very quick and easy to find the value of the third. It is also a good way to check calculations: if you calculate each SS by hand, you can make sure that they all fit together as shown above, and if not, you know that you made a math mistake somewhere.
ANOVA Table
All of our sources of variability fit together in meaningful, interpretable ways as we saw above, and the easiest way to do this is to organize them into a table. The ANOVA table, shown in Table 1, is how we calculate our test statistic.
Source | SS | df | MS | F |
Between | SSbetween | k-1 | [latex]\frac{SS_{between}}{df_{between}}[/latex] | [latex]\frac{MS_{betwee}}{MS_{within}}[/latex] |
Within | SSwithin | N-k | [latex]\frac{SS_{within}}{df_{within}}[/latex] | |
Total | SStotal | N-1 | ||
The first column of the ANOVA table, labeled “Source”, indicates which of our sources of variability we are using: between groups, within groups, or total. The second column, labeled “SS”, contains our values for the sums of squares that we learned to calculate above. Remember that the Total is the sum of the other two, in case you are only given two SS values and need to calculate the third.
The next column, labeled “df”, is our degrees of freedom. As with the sums of squares, there is a different df for each group, and the formulas are presented in the table. Notice that the total degrees of freedom, N – 1, is the same as it was for our regular variance. This matches the SST formulation to again indicate that we are simply taking our familiar variance term and breaking it up into difference sources. Also remember that the capital N in the df calculations refers to the overall sample size, not a specific group sample size. Notice that the total row for degrees of freedom, just like for sums of squares, is just the Between and Within rows added together. This is a convenient way to quickly check your calculations.
The third column, labeled “MS”, is our Mean Squares for each source of variance. A “mean square” is just another way to say variability. Each mean square is calculated by dividing the sum of squares by its corresponding degrees of freedom. Notice that we do this for the Between row and the Within row, but not for the Total row. There are two reasons for this. First, our Total Mean Square would just be the variance in the full dataset (put together the formulas to see this for yourself), so it would not be new information. Second, the Mean Square values for Between and Within would not add up to equal the Mean Square Total because they are divided by different denominators. This is in contrast to the first two columns, where the Total row was both the conceptual total (i.e. the overall variance and degrees of freedom) and the literal total of the other two rows.
The final column in the ANOVA table, labeled “F”, is our test statistic for ANOVA. The F statistic, just like a t– or z-statistic, is compared to a critical value to see whether we can reject for fail to reject a null hypothesis. Thus, although the calculations look different for ANOVA, we are still doing the same thing that we did in previous units. We are simply using a new type of data to test our hypotheses. We will see what these hypotheses look like shortly, but first, we must take a moment to address why we are doing our calculations this way.
ANOVA and Type I Error
You may be wondering why we do not just use another t-test to test our hypotheses about three or more groups the way we did in Unit 2. After all, we are still just looking at group mean differences. The reason is that our t-statistic formula can only handle up to two groups, one minus the other. With only two groups, we can move our population parameters for the group means around in our null hypothesis and still get the same interpretation: the means are equal, which can also be concluded if one mean minus the other mean is equal to zero. However, if we tried adding a third mean, we would no longer be able to do this. So, in order to use t– tests to compare three or more means, we would have to run a series of individual group comparisons.
For only three groups, we would have three t-tests: group 1 vs group 2, group 1 vs group 3, and group 2 vs group 3. This may not sound like a lot, especially with the advances in technology that have made running an analysis very fast, but it quickly scales up. With just one additional group, bringing our total to four, we would have six comparisons: group 1 vs group 2, group 1 vs group 3, group 1 vs group 4, group 2 vs group 3, group 2 vs group 4, and group 3 vs group 4. This makes for a logistical and computation nightmare for five or more groups.
A bigger issue, however, is our probability of committing a Type I Error. Remember that a Type I error is a false positive, and the chance of committing a Type I error is equal to our significance level, α. This is true if we are only running a single analysis (such as a t-test with only two groups) on a single dataset.
However, when we start running multiple analyses on the same dataset, our Type I error rate increases, raising the probability that we are capitalizing on random chance and rejecting a null hypothesis when we should not. ANOVA, by comparing all groups simultaneously with a single analysis, averts this issue and keeps our error rate at the α we set.
Hypotheses in ANOVA
So far we have seen what ANOVA is used for, why we use it, and how we use it. Now we can turn to the formal hypotheses we will be testing. As with before, we have a null and an alternative hypothesis to lay out. Our null hypothesis is still the idea of “no difference” in our data. Because we have multiple group means, we simply list them out as equal to each other:
H0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑔𝑟𝑜𝑢𝑝 𝑚𝑒𝑎𝑛𝑠
H0: 𝜇1 = 𝜇2 = 𝜇3
We list as many μ parameters as groups we have. In the example above, we have three groups to test, so we have three parameters in our null hypothesis. If we had more groups, say, four, we would simply add another μ to the list and give it the appropriate subscript, giving us:
H0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑔𝑟𝑜𝑢𝑝 𝑚𝑒𝑎𝑛𝑠
H0: 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4
Notice that we do not say that the means are all equal to zero, we only say that they are equal to one another; it does not matter what the actual value is, so long as it holds for all groups equally.
Our alternative hypothesis for ANOVA is a little bit different. Let’s take a look at it and then dive deeper into what it means:
H𝐴: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑚𝑒𝑎𝑛 𝑖𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
The first difference in obvious: there is no mathematical statement of the alternative hypothesis in ANOVA. This is due to the second difference: we are not saying which group is going to be different, only that at least one will be. Because we do not hypothesize about which mean will be different, there is no way to write it mathematically. Related to this, we do not have directional hypotheses (greater than or less than) like we did before. Due to this, our alternative hypothesis is always exactly the same: at least one mean is different.
In the previous unit, we saw that, if we reject the null hypothesis, we can adopt the alternative, and this made it easy to understand what the differences looked like. In ANOVA, we will still adopt the alternative hypothesis as the best explanation of our data if we reject the null hypothesis. However, when we look at the alternative hypothesis, we can see that it does not give us much information. We will know that a difference exists somewhere, but we will not know where that difference is. Is only group 1 different but groups 2 and 3 the same? Is it only group 2? Are all three of them different? Based on just our alternative hypothesis, there is no way to be sure. We will come back to this issue later and see how to find out specific differences. For now, just remember that we are testing for any difference in group means, and it does not matter where that difference occurs.
Now that we have our hypotheses for ANOVA, let’s work through an example. We will continue to use the data from Figures 1 through 3 for continuity.
Example: Scores on Job Application Tests
Our data come from three groups of 10 people each, all of whom applied for a single job opening: those with no college degree, those with a college degree that is not related to the job opening, and those with a college degree from a relevant field. We want to know if we can use this group membership to account for our observed variability and, by doing so, test if there is a difference between our three group means. We will start, as always, with our hypotheses.
Step 1: State the Hypotheses
Our hypotheses are concerned with the means of groups based on education level, so:
H0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
H0: 𝜇1 = 𝜇2 = 𝜇3
H𝐴: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑚𝑒𝑎𝑛 𝑖𝑠 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
Again, we phrase our null hypothesis in terms of what we are actually looking for, and we use a number of population parameters equal to our number of groups. Our alternative hypothesis is always exactly the same.
Step 2: Find the Critical Values
Our test statistic for ANOVA, as we saw above, is F. Because we are using a new test statistic, we will get a new table: the F distribution table.
There are now two degrees of freedom we must use to find our critical value. These correspond to the numerator and denominator of our test statistic, which for the independent-measures ANOVA are our Between Groups and Within Groups rows, respectively. The dfB is the Degrees of Freedom: Between because it is the degrees of freedom value used to calculate the Mean Square Between, which in turn was the numerator of our F statistic. Likewise, the dfW is the Degrees of Freedom: Within because it is the degrees of freedom value used to calculate the Mean Square Within, which was our denominator for F.
On most F distribution tables, these are the steps you would use to locate the critical value. For our example, we will use α = .05. The formula for dfB is k – 1, and remember that k is the number of groups we are assessing. In this example, k = 3 so our dfB = 2. This tells us that we will use the second column, the one labeled 2, to find our critical value. To find the proper row, we simply calculate the dfW, which was N – k. The original prompt told us that we have “three groups of 10 people each,” so our total sample size is 30. This makes our value for dfW = 27. If we follow the second column down to the row for 27, we find that our critical value is 3.35. We use this critical value the same way as we did before: it is our criterion against which we will compare our obtained test statistic to determine statistical significance.
Because F values are calculated from variances, you will never have a negative F value. An F-distribution is a positively skewed distribution that starts at zero, peaks around 1.00, and continues to infinity (see Figure 4).
Figure 4. F distribution
Step 3: Calculate the Test Statistic
Now that we have our hypotheses and the criterion we will use to test them, we can calculate our test statistic. To do this, we will fill in the ANOVA table. When we do so, we will work our way from left to right, filling in each cell to get our final answer. We will assume that we are given the SS values as shown below:
Source |
SS |
df |
MS |
F |
Between |
8246 |
|
|
|
Within |
3020 |
|
|
|
Total |
|
|
|
|
These may seem like random numbers, but remember that they are based on the distances between the groups themselves and within each group. Figure 5 shows the plot of the data with the group means and grand mean included. If we wanted to, we could use this information, combined with our earlier information that each group has 10 people, to calculate the Between Groups Sum of Squares by hand.
However, doing so would take some time, and without the specific values of the data points, we would not be able to calculate our Within Groups Sum of Squares, so we will trust that these values are the correct ones.
Figure 5. Means
We were given the sums of squares values for our first two rows, so we can use those to calculate the Total Sum of Squares (SSB + SSW = SST; 8246 + 3020 = 11266).
Source |
SS |
df |
MS |
F |
Between |
8246 |
|
|
|
Within |
3020 |
|
|
|
Total |
11266 |
|
|
|
We also calculated our degrees of freedom earlier, so we can fill in those values. Additionally, we know that the total degrees of freedom is N – 1, which is 29. This value of 29 is also the sum of the other two degrees of freedom, so everything checks out.
Source |
SS |
df |
MS |
F |
Between |
8246 |
2 |
|
|
Within |
3020 |
27 |
|
|
Total |
11266 |
29 |
|
|
Now we have everything we need to calculate our mean squares. Our MS values for each row are just the SS divided by the df for that row, giving us:
Source |
SS |
df |
MS |
F |
Between |
8246 |
2 |
4123 |
|
Within |
3020 |
27 |
111.85 |
|
Total |
11266 |
29 |
|
|
Remember that we do not calculate a Total Mean Square, so we leave that cell blank. Finally, we have the information we need to calculate our test statistic. F is our MSB divided by MSW.
Source |
SS |
df |
MS |
F |
Between |
8246 |
2 |
4123 |
36.86 |
Within |
3020 |
27 |
111.85 |
|
Total |
11266 |
29 |
|
|
So, working our way through the table given only two SS values and the sample size and group size given before, we calculate our test statistic to be F = 36.86, which we will compare to the critical value in step 4.
Step 4: Make and Interpret the Decision
Our test statistic was calculated to be Ftest = 36.86 and our critical value was found to be Fcrit = 3.35. Our test statistic is larger than our critical value, so we can reject the null hypothesis.
Reject H0. Based on our 3 groups of 10 people, we can conclude that job test scores are statistically significantly different based on education level, F(2, 27) = 36.86, p < .05.
Notice that when we report F, we include both degrees of freedom. We always report the degrees of freedom between then the degrees of freedom within, separated by a comma. We must also note that, because we were only testing for any difference, we cannot yet conclude which groups are different from the others. We will do so shortly, but first, because we found a statistically significant result, we need to calculate an effect size to see how big of an effect we found.
Effect Size: Variance Explained
Recall that the purpose of ANOVA is to take observed variability and see if we can explain those differences based on group membership. To that end, our effect size will be just that: the variance explained. You can think of variance explained as the proportion or percent of the differences we are able to account for based on our groups. We know that the overall observed differences are quantified as the Total Sum of Squares, and that our observed effect of group membership is the Between Groups Sum of Squares. Our effect size, therefore, is the ratio of these to sums of squares.
Calculating Eta-Squared
[latex]η^2=\frac{SS_{between}}{SS_{total}}[/latex]
The effect size 𝜂2 is called “eta-squared” and represents variance explained. For our example, our values give an effect size of:
[latex]η^2=\frac{8246}{11266}=0.73[/latex]
So, we are able to explain 73% of the variance in job test scores based on education. This is, in fact, a huge effect size, and most of the time we will not explain nearly that much variance. Our guidelines for the size of our effects are:
𝜂2 |
Size |
0.01 |
Small |
0.09 |
Medium |
0.25 |
Large |
So, we found that not only do we have a statistically significant result, but that our observed effect was very large! And, now we can add this information to our APA write-up of our results.
Reject H0. Based on our 3 groups of 10 people, we can conclude that job test scores are statistically significantly different based on education level, F(2, 27) = 36.86, p < .05, η2 = 0.73.
However, we still do not know specifically which groups are different from each other. It could be that they are all different, or that only those who have a relevant degree are different from the others, or that only those who have no degree are different from the others. To find out which is true, we need to do a special analysis called a post hoc test.
Post Hoc Tests
A post hoc test is used only after we find a statistically significant result and need to determine where our differences truly came from. The term “post hoc” comes from the Latin for “after the event”. There are many different post hoc tests that have been developed, and most of them will give us similar answers. We will only focus here on the most commonly used ones. We will also only discuss the concepts behind each and will not worry about calculations.
Bonferroni Test
A Bonferroni test is perhaps the simplest post hoc analysis. A Bonferroni test is a series of t-tests performed on each pair of groups. As we discussed earlier, the number of groups quickly grows the number of comparisons, which inflates Type I error rates. To avoid this, a Bonferroni test divides our significance level α by the number of comparisons we are making so that when they are all run, they sum back up to our original Type I error rate. Once we have our new significance level, we simply run independent samples t-tests to look for difference between our pairs of groups. This adjustment is sometimes called a Bonferroni Correction, and it is easy to do by hand if we want to compare obtained p-values to our new corrected α level, but it is more difficult to do when using critical values like we do for our analyses so we will leave our discussion of it to that.
Tukey’s Honestly Significant Difference
Tukey’s Honestly Significant Difference (HSD) is a very popular post hoc analysis. This analysis, like Bonferroni’s, makes adjustments based on the number of comparisons, but it makes adjustments to the test statistic when running the comparisons of two groups. This test allows you to test whether the mean differences between pairs of groups are “honestly significantly different” by comparing them to a newly calculated value, the HSD.
Calculating Tukey’s HSD
[latex]HSD=q\sqrt\frac{MS_{within}}{n}[/latex]
This first equation works when you have equal sample sizes. However, when your sample sizes are not equal, you must adjust your equation as such:
[latex]HSD=q\sqrt\frac{\displaystyle\sum_{j=1}^{k}\frac{MS_{within}}{n_j}}{k}[/latex]
Where q is a value derived from a Studentized Range critical value table using alpha level, dfwithin, and k.
Below are the differences between the group means, and because we know that each group has 10 people, we can figure out HSD using the first equation above. Because we had an alpha level of .05, dfwithin = 27, and k = 3, we can find that we can use a q = 3.523 from a Studentized Range critical value table.
[latex]HSD=q\sqrt\frac{MS_{within}}{n}=3.523\sqrt\frac{111.85}{10}[/latex]
[latex]HSD=3.523\sqrt11.185=3.523*3.344=11.78[/latex]
Now, we compare each paired comparison’s mean difference to this value of 11.78. If it is larger than this value (if the numerical difference is bigger than this number, signs do not matter), then we can say that the two groups are significantly different from each other.
Comparison |
Mean Difference |
Tukey’s HSD = 11.78 |
None vs Relevant |
40.60 |
40.60 > 11.78 ** |
None vs Unrelated |
19.50 |
19.50 > 11.78 ** |
Relevant vs Unrelated |
21.10 |
21.10 > 1178 ** |
As we can see, all of the paired comparisons have mean differences greater than Tukey’s HSD, so we can conclude that all three groups are different from one another.
Scheffe’s Test
Another common post hoc test is Scheffe’s Test. Like Tukey’s HSD, Scheffe’s test adjusts the test statistic for how many comparisons are made, but it does so in a slightly different way. The result is a test that is “conservative,” which means that it is less likely to commit a Type I Error, but this comes at the cost of less power to detect effects.
There are many more post hoc tests than just these three, and they all approach the task in different ways, with some being more conservative and others being more powerful. In general, though, they will give highly similar answers. What is important here is to be able to interpret a post hoc analysis. If you are given post hoc analysis confidence intervals, like the ones seen above, read them the same way we read confidence intervals in previous chapters: if they contain zero, there is no difference; if they do not contain zero, there is a difference.
The average squared deviation of the scores from the mean.