Transforming Data

We previously looked at the exam scores for 45 calculus students:

5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13
1|
1|2

Key: 5|0 = 50 points 

and considered what would happen if the instructor awarded an extra 10 points to each student in the class:

6|00
5|667889
5|00022233
4|55556667777788889
4|11112244
3|5
3|13
2|
2|2

Key: 6|0 = 60 points 

This is an example of shifting a data set by adding or subtracting the same amount to each data value. Notice that the shape of the data remains unchanged.

Before the shift the median (the middle score, highlighted yellow in the top stem-and-leaf display) was 37; this score, like all of the others, was shifted up by 10 points, but it's still the middle score, so the new median (highlighted yellow in the bottom stem-and-leaf display) is 47. In any shifted data set, not just this one, the median will be shifted by the same amount as each data value.

The same holds true for the mean. The mean of the original exam scores (37.27 points) is found by adding up all of the scores (to get 1677 total points) and dividing by 45. To get the mean of the new scores, we add them all up, but since each score is 10 points higher than before, our grand total will be 10×45 = 450 points bigger than before, so (1677+450)/45 = 1677/45 + 450/45 = 37.27 + 10 = 47.27, which is exactly 10 points higher than the original mean.

What happens to the range? The original range was 50−12 = 38 points and the new range is 60−22 = 38 points, the same as before.

What about the IQR? The original quartiles were 34 and 42 (shown in red), so the original IQR was 42−34 = 8 points; the new quartiles, 44 and 52 (also shown in red) are both 10 points higher than previously, so their difference will still be 52−44 = 8 points.

Finally, the standard deviation was computed by first computing the deviations from the mean (37.27) for each of the 45 exam scores:

The new scores, will all be 10 points higher, but the new mean (47.27) is also 10 points higher, so the deviations will be the same; for example: 60−47.27 = 12.73 points. From this stage on, the computation of the standard deviation will be exactly the same, hence the standard deviation remains unchanged when data is shifted.

To summarize: when shifting data, the measures of center (mean, median) are also shifted, but the measures of variation (range, IQR, SD) are not.

Scaling data
What if, instead of awarding the students an extra 10 points, the instructor realized he should have made the exam worth 100 points? In that situation he would need to double all of the scores:

and considered what would happen if the instructor awarded an extra 10 points to each student in the class:

10|00
 9|224668
 8|00044466
 7|00002224444466668
 6|22224488
 5|0
 4|26
 3|
 2|4

Key: 10|0 = 100 points 

This is an example of scaling a data set by multiplying (or dividing) each data value by the same amount. Notice that the shape of the data remains unchanged here as well.

Notice that the median has doubled (to 74 points), as has the mean: (100+100+98+...+42+24)/45 = 2(50+50+49+...+21+12)/45 = 2×37.27 = 74.53 points.

But in this situation, the range has doubled as well: 100−24 = 2(50)−2(12) = 2(50−12) = 2(38) = 76 points. As has the IQR: 84−68 = 2(42)−2(34) = 2(42−34) = 2(8) = 16 points. And the standard deviation, too: each data value gets doubled, as does the mean, so the deviations are doubled, resulting in the squared deviations being quadrupled, as is their sum, which becomes a doubling once we take the square root.

To summarize: when scaling data, the measures of center (mean, median) are also scaled, and the measures of variation (range, IQR, SD) are too.

Exercises

1. [OIS 1.36] Infant Mortality The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The (relative frequency) histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. (CIA Factbook, Country Comparison: Infant Mortality Rate, 2012)

a) If the infant mortality rate were instead reported in deaths per 100 births, how would the shape of the histogram differ from what you see above?

b) The 5-number summary for deaths per 1000 births is: 1.8, 6.5, 15.6, 42.1, 121.6. Compute the following summary statistics for deaths per 100 live births:

i) median

ii) IQR

iii) range

iv) Q3

c) The mean number of deaths per 1000 live births is 26.7 and the standard deviation is 25.9. Compute the following summary statistics for deaths per 100 live births:

i) mean

ii) standard deviation

d) It's possible to perform other transformations on data value other than scaling and shifting. Here is a histogram of the base-10 logarithm of the mortality rates:

(You may have learned that log(10) = 1, log(100) = 2, log(1000 = 3), etc.) Compare this histogram with the original histogram of the untransformed data values. 

2. [OIS 1.43] Commuting times The histogram below shows the distribution of mean commuting times (in minutes) for 3,143 U.S. counties during 2010.

a) What would the distribution of mean commute times look like if we converted the units from minutes to hours?

b) How would each of the following summary statistics change after converting from minutes to hours?

i) mean

ii) standard deviation

iii) median

iv) IQR

 3. Textbooks The following information about 18 statistics textbooks was retrieved on September 21, 2012 from Amazon.com: author, title, edition, ISBN, number of pages, shipping weight (in pounds), Amazon price (in dollars) and list price (in dollars).

author title ed. ISBN pages weight price list
De Veaux Intro Stats 3u 0321500458 864 4.4 121.88 170.00
De Veaux Stats: Data and Models 0321692551 976 4.4 123.26 170.00
Agresti Statistics: The Art and Science… 0135131995 848 4.2 130.99 170.00
Triola Elementary Statistics 11u 0321694503 888 4.3 155.50 170.00
McClave Statistics 11  0132069512 864 4.4 140.74 170.00
Moore Intro. to the Practice of Statistics 1429240326 709 3.8 132.82 ------
Moore Basic Practice of Statistics 1464102546 745 3.2 129.49 ------
Freund Modern Elementary Statistics 12  013187439X 576 2.4 130.99 170.00
Bluman Elementary Statistics 0077460391 896 4.4 131.75 ------
Utts Mind on Statistics 0538733489 752 3.4 144.72 202.95
Johnson Elementary Statistics 11  0538733500 832 4.2 193.94 241.95
Freedman Statistics 0393929728 720 3.0 120.62 241.95
Mendenhall Intro. to Probability and Statistics 14  1133103758 744 3.0 198.90 234.95
Larson Elementary Statistics 0321891872 352 4.0 170.90 188.33
Sullivan Statistics: Informed Decisions… 0321757270 960 4.7 157.08 170.00
Gould Introductory Statistics 0321322150 736 3.6 124.20 170.00
Peck Statistics: The Exploration & Analysis… 0840058012 816 4.0 196.35 252.95
Diez OpenIntro Statistics 1478217200 426 2.3 9.94 9.94

a) Compute the following summary statistics for the shipping weights of these books:

i) mean

ii) median

iii) standard deviation

iv) IQR

b) If packaging material weighing 0.2 pounds were added to each of the shipping weights, how would that affect the summary statistics you just computed?