
Confidence Intervals
Earlier we looked at the results of the 2004 presidential election in Washington. In that situation we knew the population proportion, `p`, and we imagined what would happen if we drew random samples from that population and computed the proportion of people in the sample who voted for Bush. We denote such a sample proportion `hat p` and note that this proportion will vary from sample to sample. In fact, if we consider all possible samples of size `n`, we expect the possible values of `hat(p)` to follow a Normal model with mean `E(hat p) = p` and standard deviation `SD(hat p) = sqrt((pq)/(n))`; this Normal model will approximate the underlying binomial model as long as `np geq 10` and `nq ge 10`.
In the case of the 2004 Bush voters in Washington state, we expected the model to be N(0.4564,0.0498):
We note (from the 68-95-99.7 rule of thumb) that about 95% of all sample proportions should be within 2 SDs of the mean; in this case we would expect about 95% of all samples of size 100 to fall between 0.356 and 0.556.
We can be more precise here, however, since the 68-95-99.7 rule is just a rule of thumb. If we want to find the cutoff values for the "middle 95%" of the sample proportions, we note that this excludes the "most extreme 5%," in other words the 2.5% in the lower tail and the 2.5% in the upper tail. The cutoff values for these tails are given by invNorm(0.025) ≈ -1.96 and invNorm(0.975) ≈ 1.96, so we can now say that we expect 95% of the sample proportions to be within 1.96 SDs of the mean.
In the case of our Washington state Bush voters, we would expect 95% of all samples of size 100 to yield sample proportions between `0.4564-1.96 times 0.0498 approx 0.359` and `0.4564+1.96 times 0.0498 approx 0.554`.
Now we consider a situation where we do not know `p`, but only a single `hat p`.
Exit Polls
Exit polling from the 2004 general election showed that among 1242 female voters in Washington state, 708 voted for John Kerry. We know (more or less) what percentage of all Washington voters voted for Kerry (53%), but ballots are not counted according to gender, so if we want to know something about the proportion of female voters who voted for Kerry we need to rely on exit polls. In this case we have
`n = 1242`, `hat p_1 = (708)/(1242) approx 0.57` and `hat q_1 approx 0.43`
I use the subscript 1 here since this `hat p_1` is the one sample proportion we do know out of many different possible values of `hat p`.
We'd like to know `p` but since we don't, we want to say as much as we can about what `p` might be.
Does a binomial model apply here? Can we use a Normal model to approximate that binomial model?
Plausible Independence Condition: A professional exit polling organization sampled the 1242 female voters, so we have reason to believe that they are representative of all female voters in Washington. It is plausible that these 1242 women are independent of one another, although we really don't know for sure.
Two outcomes: Either a female voter cast her ballot for Kerry, or she cast it for someone else. (We're only looking at whether or not someone voted for Kerry here, not whether they voted for Kerry or Bush or a third-party candidate.)
Constant probability of success: The 1242 women in our sample are a very small percentage of the more than 1,000,000 female Washington voters who participated in the 2004 general election, so it's reasonable to treat the probability of success as constant.
Independent trials: We would hope that a professional exit polling organization would employ randomization in their sampling procedure, although we don't know for sure.
Normal approximation: We need to check that `np ge 10` and `nq ge 10` but here our problem is that we don't know `p` and `q`. We do, however, expect that `hat p_1` is reasonably close to `p` and thus `hat q_1` should be reasonably close to `q`, so we can instead check that:
`n hat p_1 = (1242)(0.57) = 708 ge 10` and `n hat q_1 = (1242)(0.43) = 534 ge 10`
We expect about 95% of all samples to yield sample proportions `hat p` that are within 1.96 SDs of `p`. The problem here is that we don't know `SD(hat p)` since we don't know `p` or `q`. We expect, however, that `hat p_1` (the one sample proportion we do know) is reasonably close to `p` and likewise that `hat q_1` is reasonably close to `q`, so in place of the standard deviation `SD(hat p)` we can use a reasonable estimate, which we call the standard error:
`SE(hat p) = sqrt((hat p_1 hat q_1)/(n)) = sqrt(((0.57)(0.43))/(1242)) approx 0.014`
We can now compute what we call the margin of error:
`ME = 1.96 times SE(hat p) = 1.96 times 0.014 approx 0.028`
Here we use 1.96 because we want to trap 95% of all possible sample proportions; we call 95% the confidence level and call 1.96 the critical value for this confidence level, which in general we denote by z*. If in future problems we wish to use a different confidence level, we would need to recompute z*. Typical confidence levels are 90%, 95% and 99%, but polls of the type we're discussing here nearly always use a 95% confidence level.
We still don't know what `p` is, but we now know that 95% of all sample proportions from random samples of size 1242 should be within 0.028 of `p`. In particular, we are 95% confident that the one sample proportion we do know (`hat p_1 = 0.57`) is within 0.028 of `p`. In general this means that we are 95% confident that:
`hat p_1 - ME < p < hat p_1 + ME`
and in particular in this case we are 95% confident that
`0.57 - 0.028 < p < 0.57 + 0.028`
or:
`0.542 < p < 0.598`
We call (0.542,0.598) the 95% confidence interval for the true proportion of female Washington voters who voted for John Kerry in 2004.
We can conclude that we are 95% confident that the true proportion of female Washington voters who voted for John Kerry in 2004 is between 54% and 60%.
On the calculator
The TI-84 offers a shortcut to find a confidence interval: press STAT, move the cursor right to TESTS, then down to 1-PropZInt... and press ENTER. For x, use 708 (in general, the number of successes), for n use 1242 (in general, the sample size) and for C-Level use 0.95, then move the cursor to Calculate:
and then press ENTER. You should see the confidence interval displayed in the form (0.542, 0.598):
This is a good way to check your answers, but note that the TI-84 doesn't give you the margin of error directly, nor does it check assumptions and conditions, or interpret the meaning of the interval. (Warning: x must always be an integer, so be sure to round to the nearest integer if you need to estimate x when using the calculator.
Exercises
1. The Democratic polling firm Greenberg Quinlan Rosner and the Republican polling firm American Viewpoint jointly conducted a poll during October 15−21, 2012, on behalf of the USC Dornsife College of Letters, Arts and Sciences and the Los Angeles Times, interviewing 1,504 randomly selected registered voters. Among those interviewed, 56% reported that they planned to vote for Barack Obama in the upcoming presidential election. Construct a 95% confidence interval for the proportion of all California voters who plan to vote for Pres. Obama on November 6.
2. A survey conducted by Key Research during October 9−12, 2012, interviewed 500 likely voters in the state of Utah, finding that 71% of those surveyed planned to vote for Mitt Romney in the upcoming presidential election. Construct a 95% confidence interval for the proportion of all Utah voters who plan to vote for Romney on November 6.