Ch. 18 Resources

Chapter 18: Sampling Distribution Models

This chapter contains two types of problems: the first involves models of sample proportions and the second involves models of sample means. We'll look at examples of each. Be aware, however, that we'll use the sample proportion models exclusively from Chapters 19 through 22; we won't see the sample mean models again until Chapter 23. For this reason (especially if you're pressed for time) you may wish to cover the material about sample proportions now, and come back to the material about sample means and the Central Limit Theorem right before Chapter 23.

2004 presidential election

In the 2004 presidential election, 46% of voters in Washington state voted for President George W. Bush. More precisely, as can be seen from the official vote totals from the Washington Secretary of State, Bush received 45.64% of the vote in Washington state.

Suppose that we had randomly selected 100 voters on election day and asked each voter whether or not he or she voted for Bush. A binomial model is appropriate here: there are two possible outcomes, there is a fixed probability of "success" (`p = 0.4564`) since we have taken a random sample, each trial is assumed to be independent of the other, and there are a fixed number of trials (`n = 100`). We have to assume independence here, but we can check that this assumption is reasonable by checking the

10% Condition: There are a finite number of voters, but 100 voters account for far less than 10% of the 2,859,084 Washington voters who cast a vote for president in 2004.

Using a Normal model for our computation is easier than using a Binomial model. Is it reasonable in this case to use a Normal model to approximate the binomial model? We check the

Success/Failure Condition: Since `np = (100)(0.4564) = 45.64 ge 10` and `nq = (100)(0.5436) = 54.36 ge 10` this condition is satisfied; a Normal model is appropriate.

On average, we expect `mu = np = 100(0.4564) = 45.64` voters to say they will vote for Bush with a standard deviation of `sigma = sqrt(npq) = sqrt(100(0.4564)(0.5436)) = sqrt(24.809904) approx 4.98` voters. Thus the Normal model N(45.64,4.98) is appropriate here:

N(45.64,4.98) shaded above 50

Suppose we wanted to know the probability that the majority of our 100-person sample (i.e. more than 50) are Bush voters. This is approximately normalcdf(50,1E99,45.64,4.98) ≈ 19.1%.

Since any number between 50.5 and 51.5 would be interpreted as 51 people, while any number from 49.5 to 50. 5 would be interpreted as 50 people, we could get a better approximation by using a continuity correction: normalcdf(50.5,1E99,45.64,4.98) ≈ 16.5%; the exact answer is given by 1-binomcdf(100,0.4564,50) ≈ 16.5%.

If we were to report this information in a newspaper article, however, we wouldn't say "46 out of 100 people voted for Bush" or "51 out of 100 people voted for Bush," we would say that "about 46% of voters voted for Bush." So let's convert these numbers to percents or proportions (which is fairly easy in this case since `n = 100`): the mean proportion would be given by:

`(np)/(n) = p = 0.4564`

while the standard deviation would be given by:

`(sqrt(npq))/(n) = sqrt((npq)/(n^2)) = sqrt((pq)/(n))`

Note that we have simply converted from a mean count of 45.64 people (with a standard deviation of 4.98 people) to a mean proportion of 45.64% (with a standard deviation of 4.98%), so we have N(0.4564,0.0498):

N(0.4564,0.0498) shaded above 0.50

The probability that a majority of our 100 randomly selected Washington voters voted for Bush is about normalcdf(0.50,1E99,0.4564,0.0498) ≈ 19.1%. This is exactly the same answer that we got when working with the number of voters rather than proportion of voters.

More voters

What is the probability of randomly selecting a 500-person sample and finding that a majority voted for Bush? Here:

`mu = p = 0.4564`

as before but

`sigma = sqrt((pq)/(n)) = sqrt(((0.4564)(0.5436))/(500)) approx 0.0223`

so we use the Normal model N(0.4564,0.0223):

N(0.4564,0.0223) shaded above 0.50

Notice that the center of the model is in the same place but the SD is much smaller: the chances of getting a 500-person sample with a majority of Bush voters is much smaller than the chances of getting a 100-person sample with a majority of Bush voters.

More precisely, the probability of randomly selecting a sample of 500 Washington voters and getting a sample with a majority of Bush voters is given by normalcdf(0.50,1E99,0.4564,0.0223) ≈ 2.5%. As before, this is just an approximation (we're still using the Normal model to approximate the binomial model) but in this case the exact answer is given by 1-binomcdf(500,0.4564,250) ≈ 2.3%, which is close to the value from our Normal approximation.

Keep in mind that in this example we know the population proportion, `p`, since we have a record of each of the 2,859,084 voters who participated in the election. In most cases where we would want to take a sample of 100 voters or 500 voters we wouldn't know `p`. We'll discuss what to do in that case in the next chapter.

Apples

In the Bush voter examples we were collecting survey data about a categorical variable (voter preference, with possible values of "Bush" or "not Bush") and we then converted the counts of the categorical data to proportions. But what if we collected data about a quantitative variable?

Recently many supermarkets have introduced self-scan kiosks where shoppers scan their grocery items themselves. To help prevent mistakes (and to prevent shoplifting) once a shopper has scanned an item and placed it in the bag, a scale records the weight of the item and compares it to the weight of the item stored in the store database; if the item weighs too little or too much, an alarm sounds and a store clerk is summoned to correct the problem. Of course, not all loaves of bread (for example) weigh exactly the same amount; thus, there must a tolerance for variations in the weights of each item so that the alarm is not sounding constantly.

Most often produce is sold by the pound, but suppose one grocery store has a special sale in which Red Delicious apples are sold at the price of 4 for $1.00 (or $0.25 each). The store samples a large number of apples and finds that the mean mass of its Red Delicious apples is `mu = 235` grams, with a standard deviation of `sigma = 15` grams. (In the metric system, grams are a unit of mass, not weight, so we use the term "mass" from here on out.)

Suppose a customer purchases 10 apples and let the mass of each of the 10 apples be represented by a different random variable, which we will denote by ``

`E(X_1) = E(X_2) = cdots = E(X_{10}) =235` g

and:

`SD(X_1) = SD(X_2) = cdots = SD(X_{10}) = 15` g

The expected combined mass of 10 apples is then given by:

`E(X_1 + X_2 + cdots + X_{10}) = E(X_1) + E(X_2) + cdots + E(X_{10}) = 235 + 235 + cdots + 235 = 235 times 10 = 2350` g

with a variance given by:

`mbox{Var}(X_1 + X_2 + cdots + X_{10}) = mbox{Var}(X_1) + mbox{Var}(X_2) + cdots + mbox{Var}(X_{10}) = 15^2 + 15^2 + cdots + 15^2 = 15^2 times 10 = 2250`

so that the standard deviation is:

`SD(X_1 + X_2 + cdots + X_{10}) = sqrt(2250) approx 47.4` g

If a customer at the checkout scanner bought 10 apples with a combined mass of 2500 g, should this set off the alarm? Assuming that the combined masses of 10 randomly selected apples follows a Normal model (we don't know this to be the case, so we have to be careful) we would use the Normal model N(2350,47.4):

N(2350,47.4) shaded above 2500

The probability of getting 10 randomly selected apples with a combined mass of 2500 g (or more) would be given by normalcdf(2500,1E99,2350,47.4) ≈ 0.08%. A randomly selected batch of 10 apples would have a mass in excess of 2500 g far less than 1% of the time, so we might consider this unusual and program the scanner to sound an alarm under these circumstances.

More Apples

Now suppose we're interested in the average mass of these 10 apples rather than the combined mass. The mean weight of these 10 apples is given by:

`bar x = (X_1 + X_2 + cdots + X_{10})/(10)`

and the expected mean mass of 10 randomly selected apples is given by:

`E( bar x) = E( (X_1 + X_2 + cdots + X_{10})/(10) ) = (1)/(10) [E(X_1) + E(X_2) + cdots + E(X_{10})] = (2350)/(10}) = 235` g

so that (as we might expect) the expected mean mass of 10 apples is just the mass of an average apple.

The variance of the sample mean is given by:

` mbox{Var}( bar x ) = mbox{Var}((X_1 + X_2 + cdots + X_{10})/(10) ) = (1)/(10^2) [mbox{Var}(X_1) + mbox{Var}(X_2) + cdots + mbox{Var}(X_{10})] = (15^2 times 10)/(10^2) = (15^2)/(10)`

so `SD( bar x) = (15)/(sqrt(10)) approx 4.74` g.

In general, for a sample of n apples with mean μ and SD σ, `E(bar x) = mu` and `SD(bar x) = (sigma)/(sqrt(n))`. This is a fundamental part of the Central Limit Theorem and from now on we can simply use these formulas directly rather than going through the lengthier calculations as we did above. In order to use the CLT we also need to check three things:

Random Sampling Condition: This may not be satisfied, as shoppers may choose the largest apples they can find.

Independence Assumption: Again, the weights may not be entirely independent, since shoppers may tend to choose the largest apples that they can find; therefore, we should proceed with caution.

10% Condition: 10 apples surely constitute less than 10% of all apples the store has in its inventory.

Since we're not sure that all of the conditions and assumptions have been satisfied, we should be careful about how we use the results of our analysis.

If the CLT does apply, then the mean masses of all randomly selected samples of 10 apples would be expected to follow the Normal model N(235,4.74):

N(235,4.74) shaded above 240

What's the probability that a randomly selected sample of 10 apples has a mean mass greater than 240 g? normalcdf(240,1E99,235,4.74) ≈ 14.6%. We could also enter this into the TI-84 more directly as normalcdf(240,1E99,235,15/√(10)) ≈ 14.6%.

Homework

Work the following exercises in Chapter 18 now: 1-11 odd, 21 and 23; work these exercises once you've covered the material about sample means: 29, 31, 35, 37, 39 and 47.

Errata

The lowermost figure at the bottom of page 475 should be labeled `mu-3(sigma)/(sqrt(n))`, `mu-2(sigma)/(sqrt(n))`, etc. rather than just `-3(sigma)/(sqrt(n))`, etc.

Part d of Exercise 6 should read "of the sample proportions" (not "of the proportion").

Part a of Exercise 15 should read "groups of this size" (not "this group").

ActivStats

Work through the lessons on pages 18-1 through 18-3 in the ActivStats lesson book, as time permits.