
Transforming Data
We previously looked at the exam scores for 45 calculus students:
5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13
1|
1|2
Key: 5|0 = 50 points
and considered what would happen if the instructor awarded an extra 10 points to each student in the class:
6|00
5|667889
5|00022233
4|55556667777788889
4|11112244
3|5
3|13
2|
2|2
Key: 6|0 = 60 points
This is an example of shifting a data set by adding or subtracting the same amount to each data value. Notice that the shape of the data remains unchanged.
Before the shift the median (the middle score, highlighted yellow in the top stem-and-leaf display) was 37; this score, like all of the others, was shifted up by 10 points, but it's still the middle score, so the new median (highlighted yellow in the bottom stem-and-leaf display) is 47. In any shifted data set, not just this one, the median will be shifted by the same amount as each data value.
The same holds true for the mean. The mean of the original exam scores (37.27 points) is found by adding up all of the scores (to get 1677 total points) and dividing by 45. To get the mean of the new scores, we add them all up, but since each score is 10 points higher than before, our grand total will be 10×45 = 450 points bigger than before, so (1677+450)/45 = 1677/45 + 450/45 = 37.27 + 10 = 47.27, which is exactly 10 points higher than the original mean.
What happens to the range? The original range was 50−12 = 38 points and the new range is 60−22 = 38 points, the same as before.
What about the IQR? The original quartiles were 34 and 42 (shown in red), so the original IQR was 42−34 = 8 points; the new quartiles, 44 and 52 (also shown in red) are both 10 points higher than previously, so their difference will still be 52−44 = 8 points.
Finally, the standard deviation was computed by first computing the deviations from the mean (37.27) for each of the 45 exam scores:
The new scores, will all be 10 points higher, but the new mean (47.27) is also 10 points higher, so the deviations will be the same; for example: 60−47.27 = 12.73 points. From this stage on, the computation of the standard deviation will be exactly the same, hence the standard deviation remains unchanged when data is shifted.
To summarize: when shifting data, the measures of center (mean, median) are also shifted, but the measures of variation (range, IQR, SD) are not.
Scaling data
What if, instead of awarding the students an extra 10 points, the instructor realized he should have made the exam worth 100 points? In that situation he would need to double all of the scores:
and considered what would happen if the instructor awarded an extra 10 points to each student in the class:
10|00
9|224668
8|00044466
7|00002224444466668
6|22224488
5|0
4|26
3|
2|4
Key: 10|0 = 100 points
This is an example of scaling a data set by multiplying (or dividing) each data value by the same amount. Notice that the shape of the data remains unchanged here as well.
Notice that the median has doubled (to 74 points), as has the mean: (100+100+98+...+42+24)/45 = 2(50+50+49+...+21+12)/45 = 2×37.27 = 74.53 points.
But in this situation, the range has doubled as well: 100−24 = 2(50)−2(12) = 2(50−12) = 2(38) = 76 points. As has the IQR: 84−68 = 2(42)−2(34) = 2(42−34) = 2(8) = 16 points. And the standard deviation, too: each data value gets doubled, as does the mean, so the deviations are doubled, resulting in the squared deviations being quadrupled, as is their sum, which becomes a doubling once we take the square root.
To summarize: when scaling data, the measures of center (mean, median) are also scaled, and the measures of variation (range, IQR, SD) are too.
Exercises
1. [OIS 1.36] Infant Mortality The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The (relative frequency) histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. (CIA Factbook, Country Comparison: Infant Mortality Rate, 2012)
a) If the infant mortality rate were instead reported in deaths per 100 births, how would the shape of the histogram differ from what you see above?
b) The 5-number summary for deaths per 1000 births is: 1.8, 6.5, 15.6, 42.1, 121.6. Compute the following summary statistics for deaths per 100 live births:
i) median
ii) IQR
iii) range
iv) Q3
c) The mean number of deaths per 1000 live births is 26.7 and the standard deviation is 25.9. Compute the following summary statistics for deaths per 100 live births:
i) mean
ii) standard deviation
d) It's possible to perform other transformations on data value other than scaling and shifting. Here is a histogram of the base-10 logarithm of the mortality rates:
(You may have learned that log(10) = 1, log(100) = 2, log(1000 = 3), etc.) Compare this histogram with the original histogram of the untransformed data values.
2. [OIS 1.43] Commuting times The histogram below shows the distribution of mean commuting times (in minutes) for 3,143 U.S. counties during 2010.
a) What would the distribution of mean commute times look like if we converted the units from minutes to hours?
b) How would each of the following summary statistics change after converting from minutes to hours?
i) mean
ii) standard deviation
iii) median
iv) IQR
3. Textbooks The following information about 18 statistics textbooks was retrieved on September 21, 2012 from Amazon.com: author, title, edition, ISBN, number of pages, shipping weight (in pounds), Amazon price (in dollars) and list price (in dollars).
author | title | ed. | ISBN | pages | weight | price | list |
De Veaux | Intro Stats | 3u | 0321500458 | 864 | 4.4 | 121.88 | 170.00 |
De Veaux | Stats: Data and Models | 3 | 0321692551 | 976 | 4.4 | 123.26 | 170.00 |
Agresti | Statistics: The Art and Science… | 2 | 0135131995 | 848 | 4.2 | 130.99 | 170.00 |
Triola | Elementary Statistics | 11u | 0321694503 | 888 | 4.3 | 155.50 | 170.00 |
McClave | Statistics | 11 | 0132069512 | 864 | 4.4 | 140.74 | 170.00 |
Moore | Intro. to the Practice of Statistics | 7 | 1429240326 | 709 | 3.8 | 132.82 | ------ |
Moore | Basic Practice of Statistics | 6 | 1464102546 | 745 | 3.2 | 129.49 | ------ |
Freund | Modern Elementary Statistics | 12 | 013187439X | 576 | 2.4 | 130.99 | 170.00 |
Bluman | Elementary Statistics | 8 | 0077460391 | 896 | 4.4 | 131.75 | ------ |
Utts | Mind on Statistics | 4 | 0538733489 | 752 | 3.4 | 144.72 | 202.95 |
Johnson | Elementary Statistics | 11 | 0538733500 | 832 | 4.2 | 193.94 | 241.95 |
Freedman | Statistics | 4 | 0393929728 | 720 | 3.0 | 120.62 | 241.95 |
Mendenhall | Intro. to Probability and Statistics | 14 | 1133103758 | 744 | 3.0 | 198.90 | 234.95 |
Larson | Elementary Statistics | 5 | 0321891872 | 352 | 4.0 | 170.90 | 188.33 |
Sullivan | Statistics: Informed Decisions… | 4 | 0321757270 | 960 | 4.7 | 157.08 | 170.00 |
Gould | Introductory Statistics | 1 | 0321322150 | 736 | 3.6 | 124.20 | 170.00 |
Peck | Statistics: The Exploration & Analysis… | 7 | 0840058012 | 816 | 4.0 | 196.35 | 252.95 |
Diez | OpenIntro Statistics | 2 | 1478217200 | 426 | 2.3 | 9.94 | 9.94 |
a) Compute the following summary statistics for the shipping weights of these books:
i) mean
ii) median
iii) standard deviation
iv) IQR
b) If packaging material weighing 0.2 pounds were added to each of the shipping weights, how would that affect the summary statistics you just computed?