Summarizing Quantitative Data: Mean and Median

So far we have learned how to create graphical displays of a quantitative variable (histograms and stem-and-leaf displays) and describe the modality, skewness and outliers we might observe in those graphs. Now we will attempt to summarize a quantitative variable using summary statistics.

In the previous section we created a display of the exam scores for 45 calculus students:

5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13
1|
1|2

Key: 5|0 = 50 points 

How could we summarize this data set using a single number? An instructor might report the class "average" on an exam like this one by adding all of the scores and dividing by 45:

`(12+21+23+25+31+31+ cdots +48+48+49+50+50)/(45) = 1677/45 approx 37.27`

Because the term "average" is often used more generically to mean "typical," we call 37.27 the mean score. If we let x represent the score variable, then we write `bar(x) = 37.27` (where we read `bar(x)` out loud as "x-bar"). In general, a bar over a variable name indicates a sample mean (that is, a mean computed from sample data). For the mean of a population, we use μ (the Greek letter mu, pronounced "myoo").

The mean has one drawback as a summary statistic. If, for some reason, we determined that the outlier of 12 didn't belong in this data set (perhaps the student in question wandered into the wrong classroom and took the exam anyway), we could recompute the mean to get:

`(21+23+25+31+31+ cdots +48+48+49+50+50)/45 = 1665/44 approx 37.84`

Eliminating one data value can result in a fairly significant change to the summary statistic (here we'd round the first mean to 37 but the second mean to 38). Because of this, we avoid using the mean when a data set contains a significant outlier or is highly skewed. In those cases, we need another summary statistic that is less sensitive to outliers and skewness.

One option when looking for a "typical" value of a data set would be to locate the middle value. We can separate the 45 students who took this exam into 22 at the top (shown below in bold text) and 22 at the bottom (shown below in gray text), leaving one in the middle (highlighted):

5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13
1|
1|2

We call this middle value, 37, the median score on the exam. ("Median" comes from the Latin word medius, which means "middle.") If, for some reason, we omitted the outlier of 12 from the data set and recomputed the median, we would now have 44 values (22 in the top half and 22 in the bottom half):

5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13

Because there is no single middle value, we split the difference between the two values closest to the middle. In this example, both are 37, so the median is still 37 after omitting the outlier.

Here are the number of attempts made by each student in an online class on a Web-based quiz during Fall Quarter 2006:

0|6
0|55
0|44
0|3333
0|22222222
0|11111111
0|000

Key: 0|6 = 6 attempts

We previously determined this distribution to be unimodal and positively skewed with no significant outliers. In the exercises below, you're asked to compute the mean (2.14) and median (2) of this data set. Note that the mean is slightly higher than the median, which will generally be the case for positively skewed data; for negatively skewed data, the mean will generally be lower than the median. For unimodal, symmetric data the mean and median will typically be quite close.

For the exercises below, you should be able to compute the mean and median "by hand" but before long we will rely on technology to compute these (and other) summary statistics.

Exercises

1. Here are the midterm scores for the students enrolled in third-quarter calculus class during Winter Quarter 2012:

34 24 40 43 32 47 46 30 33 50 36 48 37 36 36

a) Compute the mean.

b) Compute the median.

c) Refer to the graphical display of this data you created in the previous section. Looking at the display, would you expect the mean to be higher or lower than the median?

d) Does your answer to part c) agree with what you computed in parts a) and b)?

2. [OIS 1.36] Infant Mortality The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The (relative frequency) histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. (CIA Factbook, Country Comparison: Infant Mortality Rate, 2012)

a) Would the mean or median be a more appropriate summary statistic for this data set?

b) If we omitted Afghanistan (with an estimated 121.63 deaths per live births) from the data set, what would you expect to happen to the mean?

c) If we omitted Afghanistan, what would you expect to happen to the median?

3. [OIS 1.43] Commuting times The histogram below shows the distribution of mean commuting times (in minutes) for 3,143 U.S. counties during 2010.

a) Would the mean or median be a more appropriate summary statistic for this data set? Explain.

b) Estimate (as best you can just by looking at the histogram) the median travel time.

c) Estimate the mean travel time.

4. The Ford Focus is a compact car introduced to North America in 1999 for model year 2000. The table below shows the model year, mileage (in miles) and asking price (in dollars) for all 14 used Ford Focus automobiles advertised for sale on the Web site of the Seattle Times on January 31, 2010.

year mileage  price
2007 25426 14595
2008 49223 13991
2008 49028 13991
2008 27690 11994
2008 36216 11980
2002 71646 10991
2007 41107 9671
2002 83454 8991
2007 49443 7988
2007 34179 7499
2002 63439 7475
2005 43012 5400
2001 86681 4494
2002 113000 2000


a) Refer to the graphical display of the mileage variable you created in the previous section. Would expect the mean mileage to be higher than the median mileage, lower, or about the same?

b) Compute the mean mileage for these cars.

c) Compute the median mileage for these cars.

d) Are the results of b) and c) consistent with your response to part a)?

5. Compute the mean and median for the number of attempts made by each student in an online class on a Web-based quiz during Fall Quarter 2006:

0|6
0|55
0|44
0|3333
0|22222222
0|11111111
0|000

Key: 0|6 = 6 attempts