Transforming Data

We previously looked at the exam scores for 45 calculus students:

5|00
4|667889
4|00022233
3|55556667777788889
3|11112244
2|5
2|13
1|
1|2

Key: 5|0 = 50 points

and considered what would happen if the instructor awarded an extra 10 points to each student in the class:

6|00
5|667889
5|00022233
4|55556667777788889
4|11112244
3|5
3|13
2|
2|2

Key: 6|0 = 60 points

This is an example of shifting a data set by adding or subtracting the same amount to each data value. Notice that the shape of the data remains unchanged.

Before the shift the median (the middle score, highlighted yellow in the top stem-and-leaf display) was 37; this score, like all of the others, was shifted up by 10 points, but it's still the middle score, so the new median (highlighted yellow in the bottom stem-and-leaf display) is 47. In any shifted data set, not just this one, the median will be shifted by the same amount as each data value.

The same holds true for the mean. The mean of the original exam scores (37.27 points) is found by adding up all of the scores (to get 1677 total points) and dividing by 45. To get the mean of the new scores, we add them all up, but since each score is 10 points higher than before, our grand total will be 10×45 = 450 points bigger than before, so (1677+450)/45 = 1677/45 + 450/45 = 37.27 + 10 = 47.27, which is exactly 10 points higher than the original mean.

What happens to the range? The original range was 50−12 = 38 points and the new range is 60−22 = 38 points, the same as before.

What about the IQR? The original quartiles were 34 and 42 (shown in red), so the original IQR was 42−34 = 8 points; the new quartiles, 44 and 52 (also shown in red) are both 10 points higher than previously, so their difference will still be 52−44 = 8 points.

Finally, the standard deviation was computed by first computing the deviations from the mean (37.27) for each of the 45 exam scores:

The new scores, will all be 10 points higher, but the new mean (47.27) is also 10 points higher, so the deviations will be the same; for example: 60−47.27 = 12.73 points. From this stage on, the computation of the standard deviation will be exactly the same, hence the standard deviation remains unchanged when data is shifted.

To summarize: when shifting data, the measures of center (mean, median) are also shifted, but the measures of variation (range, IQR, SD) are not.

Scaling data
What if, instead of awarding the students an extra 10 points, the instructor realized he should have made the exam worth 100 points? In that situation he would need to double all of the scores:

and considered what would happen if the instructor awarded an extra 10 points to each student in the class:

10|00
9|224668
8|00044466
7|00002224444466668
6|22224488
5|0
4|26
3|
2|4

Key: 10|0 = 100 points

This is an example of scaling a data set by multiplying (or dividing) each data value by the same amount. Notice that the shape of the data remains unchanged here as well.

Notice that the median has doubled (to 74 points), as has the mean: (100+100+98+...+42+24)/45 = 2(50+50+49+...+21+12)/45 = 2×37.27 = 74.53 points.

But in this situation, the range has doubled as well: 100−24 = 2(50)−2(12) = 2(50−12) = 2(38) = 76 points. As has the IQR: 84−68 = 2(42)−2(34) = 2(42−34) = 2(8) = 16 points. And the standard deviation, too: each data value gets doubled, as does the mean, so the deviations are doubled, resulting in the squared deviations being quadrupled, as is their sum, which becomes a doubling once we take the square root.

To summarize: when scaling data, the measures of center (mean, median) are also scaled, and the measures of variation (range, IQR, SD) are too.

Exercises

1. [OIS 1.36] Infant Mortality The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The (relative frequency) histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. (CIA Factbook, Country Comparison: Infant Mortality Rate, 2012)

a) If the infant mortality rate were instead reported in deaths per 100 births, how would the shape of the histogram differ from what you see above?

b) The 5-number summary for deaths per 1000 births is: 1.8, 6.5, 15.6, 42.1, 121.6. Compute the following summary statistics for deaths per 100 live births:

i) median

ii) IQR

iii) range

iv) Q₃

c) The mean number of deaths per 1000 live births is 26.7 and the standard deviation is 25.9. Compute the following summary statistics for deaths per 100 live births:

i) mean

ii) standard deviation

d) It's possible to perform other transformations on data value other than scaling and shifting. Here is a histogram of the base-10 logarithm of the mortality rates:

(You may have learned that log(10) = 1, log(100) = 2, log(1000 = 3), etc.) Compare this histogram with the original histogram of the untransformed data values.

2. [OIS 1.43] Commuting times The histogram below shows the distribution of mean commuting times (in minutes) for 3,143 U.S. counties during 2010.

a) What would the distribution of mean commute times look like if we converted the units from minutes to hours?

b) How would each of the following summary statistics change after converting from minutes to hours?

i) mean

ii) standard deviation

iii) median

iv) IQR

3. Textbooks The following information about 18 statistics textbooks was retrieved on September 21, 2012 from Amazon.com: author, title, edition, ISBN, number of pages, shipping weight (in pounds), Amazon price (in dollars) and list price (in dollars).

author	title	ed.	ISBN	pages	weight	price	list
De Veaux	Intro Stats	3u	0321500458	864	4.4	121.88	170.00
De Veaux	Stats: Data and Models	3	0321692551	976	4.4	123.26	170.00
Agresti	Statistics: The Art and Science…	2	0135131995	848	4.2	130.99	170.00
Triola	Elementary Statistics	11u	0321694503	888	4.3	155.50	170.00
McClave	Statistics	11	0132069512	864	4.4	140.74	170.00
Moore	Intro. to the Practice of Statistics	7	1429240326	709	3.8	132.82	------
Moore	Basic Practice of Statistics	6	1464102546	745	3.2	129.49	------
Freund	Modern Elementary Statistics	12	013187439X	576	2.4	130.99	170.00
Bluman	Elementary Statistics	8	0077460391	896	4.4	131.75	------
Utts	Mind on Statistics	4	0538733489	752	3.4	144.72	202.95
Johnson	Elementary Statistics	11	0538733500	832	4.2	193.94	241.95
Freedman	Statistics	4	0393929728	720	3.0	120.62	241.95
Mendenhall	Intro. to Probability and Statistics	14	1133103758	744	3.0	198.90	234.95
Larson	Elementary Statistics	5	0321891872	352	4.0	170.90	188.33
Sullivan	Statistics: Informed Decisions…	4	0321757270	960	4.7	157.08	170.00
Gould	Introductory Statistics	1	0321322150	736	3.6	124.20	170.00
Peck	Statistics: The Exploration & Analysis…	7	0840058012	816	4.0	196.35	252.95
Diez	OpenIntro Statistics	2	1478217200	426	2.3	9.94	9.94

a) Compute the following summary statistics for the shipping weights of these books:

i) mean

ii) median

iii) standard deviation

iv) IQR

b) If packaging material weighing 0.2 pounds were added to each of the shipping weights, how would that affect the summary statistics you just computed?

Return to the Public Course Page