
Ch. 5 Resources
Chapter 5: Understanding and Comparing Distributions
I'll concentrate below on instructions for using the TI-84 and Data Desk to draw boxplots and other ways of creating graphical displays to compare two groups.
Boxplots with the TI-84
For our first example, let's work with the house data set previously encountered in the Chapter 2 and 4 Resources:
house | size | assess | lot | taxes | stories |
20911 | 1561 | 304 | 0.2 | 2604 | 1 |
20912 | 1038 | 297.6 | 0.2 | 280 | 1 |
20918 | 1224 | 289.5 | 0.17 | 2353 | 1 |
20921 | 1232 | 292.8 | 0.17 | 756 | 1 |
20924 | 1995 | 314.6 | 0.17 | 2620 | 2 |
20927 | 1714 | 322.7 | 0.18 | 2632 | 1 |
20930 | 1832 | 336.1 | 0.18 | 2779 | 2 |
21003 | 1095 | 279 | 0.18 | 2321 | 1 |
21006 | 2011 | 319.5 | 0.18 | 2663 | 2 |
21015 | 1366 | 289.3 | 0.18 | 2415 | 1 |
21018 | 1292 | 301.4 | 0.18 | 2477 | 1 |
21023 | 1458 | 314.3 | 0.18 | 1386 | 1 |
21028 | 2031 | 320.9 | 0.18 | 2676 | 2 |
21105 | 1366 | 304 | 0.18 | 2473 | 1 |
If you haven't already done so, enter the assessed value data into a list, say L1. To draw a boxplot of the assessed value data, follow the instructions in the Chapter 4 Resources for making a histogram, but choose the boxplot (or modified boxplot) icon instead of the histogram icon:
Then use ZoomStat to get the boxplot:
We can see a bit more clearly from the boxplot that the data is skewed positively (but notice that we can't tell if the data set is unimodal or bimodal from the boxplot, so we should look at both a histogram and a boxplot whenever possible). Note again that the axis isn't labeled and no scale is indicated, so this would not be a satisfactory graph on a HW solution, exam or project.
Boxplots with Data Desk
To use a computer to make a boxplot, use Data Desk. Import the houses.txt data file (from the preceding link or from the Data Sets folder in the online classroom) into Data Desk, as we did in the Chapter 4 Resources. Click on the assess variable so that the variable's icon has a Y over it:
then click on Plot and select BoxPlot Side by Side:
You should see something like this:
You can adjust the plot options by clicking on the hyperview menu (the triangle in the upper-left corner of the boxplot window) and selecting BoxPlot Options:
If you see some strange shading on your boxplot, I would recommend selecting Do not display 95% C.I.'s for comparing medians:
since you have no idea what this means yet; you can also select Set Defaults to make this the default display option.
As with the histogram in Chapter 4, you can make the boxplot window larger by clicking on the lower right corner of the window and dragging it across the screen. The variable name in our Data Desk boxplot is labeled and a scale is indicated on the axis, which is better than the TI-84, but the units are still missing. This would be better:
although I again had to use an image editor to add the label.
More boxplots
A boxplot of the 2006 property tax data for these homes reveals three outliers:
so we should report the median and IQR for the property tax variable, not the mean and standard deviation. If you do see a major outlier, you should investigate it: if it was the result of a data-entry error, you should correct it; if it was something that never should have been included in the data set in the first place (such as the age of the teacher in a data set consisting of the ages of students in a second-grade class), you can remove it; if it was reported in the wrong units (e.g. someone reporting their height in feet rather than inches) you can convert to the proper units. But you should never remove a data point just because it's an outlier.
You might, however, decide to report the summary statistics both with the outlier included and with it omitted. In the property tax data set, three of the homes are owned by senior citizens who participate in a program that freezes their property taxes (although they or their estate have to pay all of the deferred taxes when the home is sold). This explains the outliers, so we might choose to analyze the remaining 12 homes; if the remaining data is roughly unimodal and symmetric, then we could report the mean and standard deviation for the property taxes of a homeowner in this neighborhood not involved in the deferred-tax program.
Comparing groups
Use Data Desk to create a histogram of the size data from the houses.txt data set. You should get something like this:
which appears bimodal. We certainly shouldn't report the mean and standard deviation for a variable like this. In fact, there may be two separate groups here.
With the histogram still open, double-click on the stories variable to open up the variable that lists the number of stories in each house.
Now click on Modify and then Palettes to open up the Data Desk palettes (if some things disappear instead of appear, then click this again to make them reappear).
Click on the knife symbol to select it:
Next hold down the SHIFT key and click on the rightmost bar of the histogram:
You should see that the all the houses in this upper group correspond to the 2-story houses on the data set. Perhaps it would be wise to investigate the 1-story and 2-story houses separately.
Click on the size variable to select it as Y, then hold down the SHIFT key and click on the stories variable to select it as X:
Now click on Plot and Boxplot y by x:
You should see side-by-side boxplots, like this:
Clearly the 2-story houses are bigger than the 1-story houses—which is not terribly surprising! You can make side-by-side boxplots on the TI-84 as well, but you'll need to manually enter the 1-story house sizes into one list and set up a boxplot of it (as described above) and then manually enter the 2-story house sizes into another list and set up another boxplot using Plot2 instead of Plot1; when you press GRAPH you should see both boxplots.
Homework
Work the following problems in Chapter 5: 11, 13, 15, 23, 27, 29, 33, 43 and 47. (As usual, you are encouraged to work additional problems.)
Errata
Although the text doesn't mention it, the wind speed data set introduced on page 88 is on the DVD.
Likewise, the roller coaster data set introduced on page 90 is also on the DVD...
...as is the coffee mug data set introduced on page 93...
...and the late arrival data on page 95...
...as well as the CEO data on page 101...
...and the cotinine data on page 103.
On page 90, the lower fence computation should read:
Lower fence = Q1 − 1.5IQR = 1.15 − 1.5×1.78 = -1.52 mph
On page 111, part b of Exercise 11 should have a question mark (not a period) at the end of the sentence.
The music library data set mentioned in Exercise 50 is on the DVD, even though the orange T symbol is missing.
ActivStats
Work the activities on pages 5-1 through 5-2 in the ActivStats lesson book, as time permits.
Additional Resources
- Describing Distributions
- Episode 3 from Against All Odds includes a discussion of boxplots.
- Carnegie Mellon: Introduction to Statistics
- This open source course has a lesson about boxplots.
- Boxplot tool
- A Java applet for creating boxplots.
- TI-83/84 Troubleshooting
- Guide to some common errors encountered when using the TI-84.