
Summarizing Quantitative Data: Standard Deviation
In the previous section we computed the range and IQR for the exam scores for 45 calculus students:
5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13
1|
1|2
Key: 5|0 = 50 points
Both the range and IQR are summary statistics that measure how much the data varies (how "spread out" it is), but the IQR is superior to the range because it is not sensitive to outliers. Yet the IQR is based on only two of the data values (albeit ones that hold special positions in the data set), much like the median is based on a single data value (ableit one that holds a special position in the data set). Might we be able to define a summary statistic that measures the variation in a data set while utilizing all of the data values, much like the formula for the mean uses all of the data values?
If, after the calculus exam in the example above, the instructor decided that one exam problem was too difficult and awarded an extra 10 points to each student, a stem-and-leaf display of the updated scores would look like this:
6|00
5|667889
5|00022233
4|55556667777788889
4|11112244
3|5
3|13
2|
2|2
Key: 6|0 = 60 points
Notice that mode has shifted (up 10 points) and you can easily check that the median and mean have also increased 10 points. But the data values are just as spread out as they were before: the range is 60−22 = 38 points, the same as before, and you can verify that the IQR remains the same as well. So when devising a summary statistic that measures variation, we want it to measure how much the data varies from some central value (like the mean) irrespective of what the actual values in the data set are.
Returning to the original exam scores, let's compute how much each of those scores deviates from the mean score (37.27 points):
Because we're interested in the "average deviation" we could find the mean of the deviations, but you can check that this mean will be 0. (In fact, it's not hard to show that this mean will always be 0, no matter what the data values are, as the positive and negative deviations will cancel each other out.) How can we get around this? We could take the absolute value of each deviation, but it turns out that squaring the deviations will also solve our problem, with the added beneift of behaving "nicer" mathematically in later computations:
To find the mean of the squared deviations, we add up all of the squared deviations and get 162.14 + 162.14 +76.27 + ··· + 264.60 + 203.54 + 638.40 = 2616.8, then divide by 45. Except we're going to divide by 44 instead (for technical reasons we'll try to explain later) to get 2616.8/44 = 59.47. We call this the variance of the exam scores. The only problem is that the units for the variance are "points squared," which doesn't make too much sense, so we take the square root to get `sqrt(59.47) approx` 7.7 points. We call this value the standard deviation of the exam scores.
Computing standard deviation only involves arithmetic, but it is time-consuming, tedious arithmetic, so henceforth we will always use technology (a calculator or computer) to compute standard deviation.
If we had a valid reason to omit the outlier (the student who scored 12 points), the standard deviation would then be 6.8 points, which is significantly smaller. Like the mean, the standard deviation is sensitive to outliers (and skewness), so we only use the mean and standard deviation to summarize the center and spread of data that is unimodal, roughly symmetric and devoid of significant outliers. For other data sets, we report the median and IQR instead.
Exercises
1. [OIS 1.36] Infant Mortality The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The (relative frequency) histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. (CIA Factbook, Country Comparison: Infant Mortality Rate, 2012)
a) Should we use the range, IQR or standard deviation to measure the spread of the mortality rates?
b) If, for some reason, the estimated mortality rate for Afghanistan was revised from 121.63 down to 110, would the standard deviation increase or decrease?
2. [OIS 1.43] Commuting times The histogram below shows the distribution of mean commuting times (in minutes) for 3,143 U.S. counties during 2010.
a) Should we use the range, IQR or standard deviation to measure the variability of commute times?
b) Would you expect the standard deviation of the commute times to be closer to 8 minutes, 80 minutes or 800 minutes?