Central Limit Theorem

We have already seen that the distribution of sample proportions—the percentages of successes observed in a binomial categorical variable we compute from repeated samples from a population—follow a Normal model with mean `E(hat p) = p` and standard deviation `SD(hat p) = sqrt((pq)/n)`, where `p` is the population proportion, as long as certain conditions are satisfied. We used this sampling distribution to compute P-values for one-proportion hypothesis tests and margins of error for one-proportion confidence intervals.

We turn our attention now to sample means. When analyzing data related to a quantitative variable we no longer compute a proportion. We might instead compute a sample mean, for example the mean age of voters interviewed in an exit poll, or the mean amount a household spends on cell phone service during a particular month.

It is a fact (which we won't prove, but which we can observe in simulations) that the distribution of sample means derived form a population will follow a Normal model if the original population is Normally distributed. Furthermore, the sample means will have an approximately Normal distribution even when the original population is skewed, as long as the sample size is "big enough" (n > 15 or so if the original population is mildly skewed, n > 40 or so if the original popluation is extremely skewed.

Apples
Recently many supermarkets have introduced self-scan kiosks where shoppers scan their grocery items without the aid of a checker. To help prevent mistakes (and to prevent shoplifting), once a shopper has scanned an item and placed it in the bag, a scale records the weight of the item and compares it to the weight of the item stored in the store database; if the item weighs too little or too much, an alarm sounds and a store clerk is summoned to correct the problem. Of course, not all loaves of bread (for example) weigh exactly the same amount; thus, there must a tolerance for variations in the weights of each item so that the alarm is not sounding constantly.

Most stores sell produce by the pound, but some grocery stores set a single price for each item. Suppose a store sells Red Delicious apples at the price of 4 for $1.00 (or $0.25 each). The store samples a large number of apples and finds that the mean mass of its Red Delicious apples is `mu = 235` grams, with a standard deviation of `sigma = 15` grams. (In the metric system, grams are a unit of mass, not weight, so we use the term "mass" from here on.)

Suppose a customer purchases 10 apples and let the mass of each of the 10 apples be represented by a different random variable, which we will denote by ` `

`E(X_1) = E(X_2) = cdots = E(X_{10}) =235` g

and:

`SD(X_1) = SD(X_2) = cdots = SD(X_{10}) = 15` g

The expected combined mass of 10 apples is then given by:

`E(X_1 + X_2 + cdots + X_{10}) = E(X_1) + E(X_2) + cdots + E(X_{10}) = 235 + 235 + cdots + 235 = 235 times 10 = 2350` g

with a variance given by:

`mbox{Var}(X_1 + X_2 + cdots + X_{10}) = mbox{Var}(X_1) + mbox{Var}(X_2) + cdots + mbox{Var}(X_{10}) = 15^2 + 15^2 + cdots + 15^2 = 15^2 times 10 = 2250`

so that the standard deviation is:

`SD(X_1 + X_2 + cdots + X_{10}) = sqrt(2250) approx 47.4` g

If a customer at the checkout scanner bought 10 apples with a combined mass of 2500 g, should this set off the alarm? Assuming that the combined masses of 10 randomly selected apples follows a Normal model (which will be true if the masses of individual apples are Normally distributed) we would use the Normal model N(2350,47.4):

N(2350,47.4) shaded above 2500

The probability of getting 10 randomly selected apples with a combined mass of 2500 g (or more) would be given by normalcdf(2500,1E99,2350,47.4) ≈ 0.08%. A randomly selected batch of 10 apples would have a mass in excess of 2500 g far less than 1% of the time, so we might consider this unusual and program the scanner to sound an alarm under these circumstances.

Apple averages
Now suppose we're interested in the average (mean) mass of these 10 apples rather than the combined mass. The mean weight of these 10 apples is given by:

`bar x = (X_1 + X_2 + cdots + X_{10})/(10)`

and the expected mean mass of 10 randomly selected apples is given by:

`E( bar x) = E( (X_1 + X_2 + cdots + X_{10})/(10) ) = (1)/(10) [E(X_1) + E(X_2) + cdots + E(X_{10})] = (2350)/(10) = 235` g

so that (as we might expect) the expected mean mass of 10 apples is just the mass of an average apple.

The variance of the sample mean is given by:

` mbox{Var}( bar x ) = mbox{Var}((X_1 + X_2 + cdots + X_{10})/(10) ) = (1)/(10^2) [mbox{Var}(X_1) + mbox{Var}(X_2) + cdots + mbox{Var}(X_{10})] = (15^2 times 10)/(10^2) = (15^2)/(10)`

so `SD( bar x) = (15)/(sqrt(10)) approx 4.74` g.

In general, for a sample of n apples with mean μ and standard deviation σ, `E(bar x) = mu` and `SD(bar x) = (sigma)/(sqrt(n))`. This is a fundamental part of the Central Limit Theorem and from now on we can simply use these formulas directly rather than going through the lengthier calculations as we did above. In order to use the CLT we also need to check two things:

Independent Trials: The CLT assumes the apple masses are independent. This may well not be true, but if we randomly sample the apples (and the sample only constitutes a small percentage of the population) it would be reasonable to assume independent. In this situation, shoppers may actually tend to choose the largest apples that they can find; therefore, we should proceed with caution.

Normality: We need to check that the masses of the individual apples follow a Normal model, or that the sample size is relatively large (n > 15 if the masses are slightly skewed, otherwise n > 40). Here the sample is quite small (n = 10) and we don't know what the distribution of the individual apple masses look like, so we should proceed with caution.

If the CLT does apply, then the mean masses of all randomly selected samples of 10 apples would be expected to follow the Normal model N(235,4.74):

N(235,4.74) shaded above 240

What's the probability that a randomly selected sample of 10 apples has a mean mass greater than 240 g? normalcdf(240,1E99,235,4.74) ≈ 14.6%. We could also enter this into the TI-84 more directly as normalcdf(240,1E99,235,15/√(10)) ≈ 14.6%.

Exercises

1. Cumulative SAT scores generally follow a normal model with μ = 1500 and σ = 300.

a) If you randomly select a student who took the SAT, compute the probability that his or her cumulative score exceeds 1600.

b) If you randomly select 10 students who took the SAT, compute the probability that their mean cumulative score exceeds 1600.

c) If you randomly select 100 students who took the SAT, compute the probability that their mean cumulative score exceeds 1600.

2. The distribution of heights of male adults between the ages 20 and 62 in the US is approximately Normal with mean 70.0″ and standard deviation 3.3″.

a) If you randomly select one adult male, compute the probability that his height is below 68 inches.

b) If you randomly select 25 adult males, compute the probability that their mean height is below 68 inches.

b) If you randomly select 250 adult males, compute the probability that their mean height is below 68 inches.

Return to the Public Course Page