
Ch. 21 Resources
Chapter 21: More About Tests
Let's work through another example of a hypothesis test concerning a single proportion. This time we'll look at some additional terminology and techniques associated with hypothesis tests.
Is red better than blue?
Researchers Russell A. Hill and Robert A. Barton from the Evolutionary Anthropology Research Group at the University of Durham in the United Kingdom analyzed data from four combat sports at the 2004 Olympic Games in Athens (boxing, tae kwon do, Greco-Roman wrestling and freestyle wrestling). In each contest, one contestant was randomly assigned a red uniform (or body protector) and the other a blue uniform (or body protector). After excluding matches in which there was a forfeit or disqualification, they found that the red contestant won 242 out of 441 times (or about 54.9% of the time). If there were no advantage to wearing red, we would expect that the contestant wearing red would win about 50% of the time. Does a sample proportion of nearly 55% provide evidence that contestants wearing red have sort sort of advantage?
Here are links to a PDF of Hill and Barton's article as printed in the British journal Nature, along with their actual data set, a document detailing their methodology, and an NPR report discussing their conclusions. (You should briefly look over these documents, then listen to the NPR story before continuing through this example.)
- "Study: Red Is the Color of Olympic Victory"
An NPR story about Hill and Barton's study. - "Red enhances human performance in contests" (PDF, 135K)
Russell A. Hill and Robert A. Barton.
Nature, Vol. 435, 19 May 2005 - Hill and Barton's methods (PDF, 12K)
- Hill and Barton's data (Excel, 66K)
Let's test Hill and Barton's hypothesis at a significance level of a = 0.05. We first state our null and alternative hypotheses:
H0: `p = 0.500`
HA: `p > 0.500`
Here p represents the proportion of all matches in which a contestant wearing red wins (not just those at the 2004 Olympic Games).
Next we check the conditions.
Independence Assumption: It is plausible that each match is independent of the others, at least as far as the color of each contestant's uniform is concerned.
Randomization: The contestants were not randomly selected, but the colors of the their uniforms in each match were.
10% Condition: The data includes nearly 100% of matches in these sports at the 2004 Olympic Games, but we may consider these to be 441 instances of an infinite number of trials in which red and blue uniforms are randomly assigned to contestants.
Success/Failure Condition: We check that `np_0 = 441(0.500) = 220.5 ge 10` and `nq_0 = 441(0.500) = 220.5 ge 10`, so this condition is satisfied.
We are hypothesizing that `p = 0.500`, so `E( hat p ) = p = 0.5` and `SD( hat p ) = sqrt((p_0 q_0)/(n)) = sqrt(((0.5)(0.5))/(441)) approx 0.0238`
so we will use the model N(0.5,0.0238), shown here:
Given that `p = 0.500`, the probability of observing a sample proportion of `hat p = 0.549` or greater is about 0.02; in other words, the P-value for this test is
normalcdf(242/441,1E99,0.5,√(0.5*0.5/441)) ≈ 0.0203
Since P ≈ 0.02 < 0.05 = α, we reject the null hypothesis and conclude that there is evidence (P = 0.02) to support the claim that contestants wearing red uniforms are more likely to win.
P-values
It's always important to state the P-value in the conclusion of a hypothesis test, but especially so in a case like this. If in fact the null hypothesis is true, there is a 2% chance that among 441 randomly selected matches we would observe a sample proportion of red wins of 54.9% or bigger. In other words,
`P( hat p ge 0.549 | p = 0.500) approx 0.02`
or:
P(observing a `hat p` at least this extreme | H0 is true) ≈ 0.02
Notice that the P-value is a conditional probability. We don't know whether or not the condition holds, but if we assume that it does we can compute the probability of observing a sample proportion at least this extreme.
Type I and Type II Errors
Did we make the right decision when we rejected the null hypothesis? If not (in other words, if the null hypothesis is true but we rejected it anyway) then we made a Type I Error. In this example, if a red uniform actually has no bearing on the outcome, Hill and Barton made a Type I error by concluding that it did.
Sports statistician Scott Berry, who is interviewed in the NPR story, did not find the evidence convincing enough to conclude that athletes wearing red uniforms are more likely to win. He might have been using a significance level of α = 0.01; if so he would have retained the null hypothesis since P ≈ 0.02 > 0.01 = α. Berry concluded that there was not sufficient evidence to support the claim that contestants wearing red are more likely to win. The P-value is still the same, but he came to a different conclusion based on different a standard of evidence. In this case if Berry made the wrong decision in retaining the null hypothesis (in other words, if the null hypothesis were not true but he retained it anyway) he made a Type II error.
Power
In our original solution we had α = 0.05, so:
P(Type I error) = P(rejecting H0 | H0 is true) = α = 0.05
If `p = 0.500`, we would (incorrectly) reject the null hypothesis about 5% of the time, or whenever `hat p` = invNorm(0.95,0.5,0.0238) ≈ 0.539:
How do we find the probability of a Type II error?
P(Type II error) = P(retaining H0 | H0 is not true)
The problem here is that even if we knew that H0 was not true (in this case, that `p ne 0.500`), we wouldn't know the value of `p`, so we couldn't construct an appropriate model in order to compute the desired probability.
Suppose for the moment that `p = 0.57` (in other words 57% of all such athletes wearing red uniforms defeat their opponents). Then:
P(Type II error) = P(retaining H0 | `p = 0.570`)
We retain H0 when `hat p le 0.539` so:
P(`hat p le 0.539` | `p = 0.570`) = normalcdf(-1E99,0.539,0.57,√(0.57*0.43/441)) ≈ 0.094
Thus β = P(Type II error) ≈ 0.094:
and hence the power of the test is `1 - beta approx 1 - 0.094 = 0.906.`
We should reiterate that this is the power of the test if `p = 0.57`, but we have no way of knowing what `p` really is. In general, we don't know "the truth" and hence can't attach a specific value to the power.
Homework
Work the following exercises in Chapter 21: 1–7 odd, 11, 17, 19, 21, 25, 31.
Errata
In the W's margin note on page 531, the What should be "helmet status" (not "% wearing helmets").
In the For Example on page 532, note that the researchers are using more advanced techniques to analyze the data; if you compute the P-value for this text using the information provided here and the techniques we have learned in this chapter, you will get a much different value.
On page 534, the Success/Failure Condition should read np0 and nq0 (rather than np and nq).
Near the middle of the "How guilty is the suspect?" box on page 535, it should read "the defendant's height" (not "dependant's").
Exercise 5 should read "same decision at a α = 0.10" and "How about at a α = 0.01?"
ActivStats
Work through the lessons on pages 21-1 and 21-2 in the ActivStats lesson book, as time permits.