Sample Proportions

In the 2004 presidential election, 46% of voters in Washington state voted for President George W. Bush. More precisely, as can be seen from the official vote totals from the Washington Secretary of State, Bush received 45.64% of the vote in Washington state.

Suppose that we had randomly selected 100 voters on election day and asked each voter whether or not he or she voted for Bush. A binomial model is appropriate here: there are two possible outcomes, there is a (more or less) fixed probability of "success" (`p = 0.4564`) since we are only selecting 100 voters out of the 2,859,084 Washington voters who cast a vote for president in 2004, and each trial is assumed to be independent of the other since we are randomly selecting the 100 voters. So we have a fixed number of Bernoulli trials (`n = 100`) and thus a binomial model applies to this situation.

If we want to approximate the binomial model using a Normal model, we just need to check that `np = (100)(0.4564) = 45.64 ge 10` and `nq = (100)(0.5436) = 54.36 ge 10`.

On average, we expect `mu = np = 100(0.4564) = 45.64` voters to say they will vote for Bush with a standard deviation of `sigma = sqrt(npq) = sqrt(100(0.4564)(0.5436)) = sqrt(24.809904) approx 4.98` voters. Thus the Normal model N(45.64,4.98) is appropriate here:

N(45.64,4.98) shaded above 50

Suppose we wanted to know the probability that the majority of our 100-person sample (i.e. more than 50) are Bush voters. This is approximately normalcdf(50,1E99,45.64,4.98) ≈ 19.1%.

Since any number between 50.5 and 51.5 would be interpreted as 51 people, while any number from 49.5 to 50. 5 would be interpreted as 50 people, we could get a better approximation with: normalcdf(50.5,1E99,45.64,4.98) ≈ 16.5%; the exact answer is given by 1-binomcdf(100,0.4564,50) ≈ 16.5%.

If we were to report this information in a newspaper article, however, we wouldn't say "46 out of 100 people voted for Bush" or "51 out of 100 people voted for Bush," we would say that "about 46% of voters voted for Bush." So let's convert these numbers to percents or proportions (which is fairly easy in this case since `n = 100`); the mean proportion would be given by:

`(np)/(n) = p = 0.4564`

while the standard deviation would be given by:

`(sqrt(npq))/(n) = sqrt((npq)/(n^2)) = sqrt((pq)/(n))`

Note that we have simply converted from a mean count of 45.64 people (with a standard deviation of 4.98 people) to a mean proportion of 45.64% (with a standard deviation of 4.98%), so we have N(0.4564,0.0498):

N(0.4564,0.0498) shaded above 0.50

The probability that a majority of our 100 randomly selected Washington voters voted for Bush is about normalcdf(0.50,1E99,0.4564,0.0498) ≈ 19.1%. This is exactly the same answer that we got when working with the number of voters rather than proportion of voters.

More voters
What is the probability of randomly selecting a 500-person sample and finding that a majority voted for Bush? Here:

`mu = p = 0.4564`

as before but

`sigma = sqrt((pq)/(n)) = sqrt(((0.4564)(0.5436))/(500)) approx 0.0223`

so we use the Normal model N(0.4564,0.0223):

N(0.4564,0.0223) shaded above 0.50

Notice that the center of the model is in the same place but the SD is much smaller: the chances of getting a 500-person sample with a majority of Bush voters is much smaller than the chances of getting a 100-person sample with a majority of Bush voters.

More precisely, the probability of randomly selecting a sample of 500 Washington voters and getting a sample with a majority of Bush voters is given by normalcdf(0.50,1E99,0.4564,0.0223) ≈ 2.5%. As before, this is just an approximation (we're still using the Normal model to approximate the binomial model) but in this case the exact answer is given by 1-binomcdf(500,0.4564,250) ≈ 2.3%, which is close to the value from our Normal approximation.

Keep in mind that in this example we know the population proportion, `p`, since we have a record of each of the 2,859,084 voters who participated in the election. In most cases where we would want to take a sample of 100 voters or 500 voters we wouldn't know `p`. We'll discuss what to do in that situation soon.

Exercises

1. In the 2008 U.S. presidential election, 57.7% of Washington voters cast their ballot for Barack Obama and 40.5% voted for John McCain. We want to randomly select 150 Washington voters who participated in the 2008 election to ask about their plans for the 2012 election.

a) Compute the probability that a majority of 150 the voters selected voted for Obama in 2008.

b) Compute the probability that more than 60% of 150 the voters selected voted for Obama in 2008.

c) Compute the probability that more than 2% of the 150 voters selected voted for someone other than Obama or McCain in 2008.

2. In the 2008 U.S. presidential election, 61.3% of Idaho voters cast their ballot for John McCain and 36.0% voted for Barack Obama. We want to randomly select 1,200 Idaho voters who participated in the 2008 election to ask about their plans for the 2012 election.

a) Compute the probability that a majority of 1,200 the voters selected voted for McCain in 2008.

b) Compute the probability that less than 40% of 1,200 the voters selected voted for Obama in 2008.

c) Compute the probability that more than 2% of the 1,200 voters selected voted for someone other than Obama or McCain in 2008.