Ch. 3 Resources

Chapter 3: Displaying and Describing Categorical Data

Categorical vs. Quantitative

This chapter deals exclusively with graphing categorical data, while the next chapter will deal with graphing quantitative data. Don't assume, however, that all of the data in the exercises for this chapter is of the appropriate type. You should always check that data is categorical before constructing a bar chart or pie chart.

Pepsi vs. Coke (revisited)

The Chapter 2 Resources contain an example about a survey administered to 41 Statistics students. For each student, his or her gender was recorded, along with whether the student preferred Coke-brand beverages, Pepsi-brand beverages, or neither. For reference, here is the data set one more time:

gender	beverage
female	coke
female	coke
female	coke
female	coke
female	coke
female	coke
female	coke
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	neither
female	neither
female	neither
female	neither
female	neither
female	neither
female	neither
female	neither
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	pepsi
male	pepsi
male	pepsi
male	pepsi
male	neither
male	neither
male	neither

Let's investigate the beverage preference of these 41 students. We can count that 16 prefer Coke and 14 prefer Pepsi, while 11 prefer "neither." We can summarize these counts using a frequency table:

beverage	count
Coke	16
Pepsi	14
neither	11

We are analyzing one categorical variable (beverage) here, so the Categorical Data Condition is satisfied and we can display this data using a bar chart or a pie chart. It's not difficult to make a bar chart by hand, especially with the aid of some graph paper:

$bar chart of beverage data$

Notice that we label each bar to indicate the category it represents.

Using technology we can also make a pie chart of this data:

$pie chart of the beverage data$

but notice that in the pie chart it is very difficult to tell whether more students prefer Coke or Pepsi, while in the bar chart we can easily see that Coke is preferred slightly more than Pepsi among these 41 students. For this reason (and because they're easier to draw) Statisticians prefer bar charts over pie charts.

Your TI-84 calculator can't make either one of these charts, but you can use your computer keyboard to create a decent-looking bar chart:

Coke	\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|
Pepsi	\|\|\|\|\|\|\|\|\|\|\|\|\|\|
neither	\|\|\|\|\|\|\|\|\|\|\|

Here I've just used the | symbol (you can find it above the ENTER key on most keyboards), with one instance for each person. You can use another symbol but in most fonts | takes up the least space.

Gender vs. beverage

Now let's investigate both of the categorical variables (gender and beverage) in our original data set simultaneously. We might ask the question, "Are the beverage preferences the same for males and females?" Or, in other words, "Is beverage preference independent of gender?"

One way to begin investigating this question is to create a two-way table, which we can do quite easily in this case by hand:

	female	male
Coke	7	9
Pepsi	10	4
neither	8	3

Note that each category of the beverage variable gets its own row, and each category of the gender variable gets its own column. This two-way table is a special type of two-way table called a contingency table: it is the result of surveying a single group (the 41 Statistics students) and classifying them according to two variables (gender and beverage). If we had considered two or more groups (say students in a Statistics class, students in a Calculus class and students in a Differential Equations class) and classified them according to a single variable (beverage preference) then we would have a two-way table but not a contingency table. (This difference is a subtle one, to be sure, but will become important in Chapter 26.)

What does the data tell us in contingency table form that we couldn't see before? It appears that males are more likely to prefer Coke and females more likely to prefer Pepsi or "neither." But is this difference significant? In other words, are the differences between males and females big enough to convince us that there really is a difference in beverage preferences between males and females? Or could male and female beverage preferences be about the same, but we just happened to get a class with fewer male Pepsi drinkers than would otherwise expect? This is a key question that we will spend most of this course developing techniques to answer systematically. Yet even now we can do a pretty good job of answering this question by following the Three Rules of Data Analysis: Make a picture, make a picture, make a picture!

Before we get to making an appropriate graphical display, however, let's spend a bit more time looking at this two-way table. Often it is helpful to include subtotals for each row and column, as follows:

	female	male	total
Coke	7	9	16
Pepsi	10	4	14
neither	8	3	11
total	25	16	41

Note that the subtotals in the rightmost column are the same as in our original frequency table for the beverage variable; this is called a marginal distribution of the beverage variable (since it occurs in the margin of the table). The marginal distribution of the gender variable can be found in the bottom row of the table. Note that the subtotals for the beverage categories sum to 41 (the total number of cases), as do the subtotals for the gender variable. The subtotals allow us to compute relevant percentages more easily. For example:

What percentage of males drink Pepsi? 4/16 = 25%

What percentage of Pepsi drinkers are male? 4/14 ≈ 28.6%

We could also isolate the Pepsi drinkers:

	female	male	total
Pepsi	10	4	14

or the males:

	male
Coke	9
Pepsi	4
neither	3
total	16

In each of these cases we call the isolated row or column a conditional distribution. The first of these (with the Pepsi drinkers isolated) is the distribution of the gender variable under the condition that we only look at the Pepsi drinkers; the second is the distribution of beverage preference under the condition that the students are male.

Counts vs. percentages

Now look back at our original two-way table:

	female	male	total
Coke	7	9	16
Pepsi	10	4	14
neither	8	3	11
total	25	16	41

One thing preventing us from easily comparing males to females is the fact that there are 25 females in the class but only 16 males. We can see there are more female students who prefer Pepsi than male students, but is that merely due to the fact that there are more females in the class? To better compare males to females we can look at percentages instead of counts. Noticing that there are 25 females in the class, we can compute that 7/25 = 28% perfer Coke, while 10/25 = 40% prefer Pepsi and 8/25 = 32% prefer neither; likewise, among the males, 9/16 ≈ 56% prefer Coke, while 4/16 = 25% prefer Pepsi and 3/16 ≈ 19% prefer neither. Replacing the counts in the original two-way table with these percentages, we have:

	female	male
Coke	28%	56%
Pepsi	40%	25%
neither	32%	19%
total	100%	100%

The female percentages are quite different from the male percentages. Because of this, we say that there is evidence that gender and beverage preference are not independent. Notice the phrase "There is evidence…" in this statement; we aren't absolutely sure this is true (and other groups of students may yield different results) but the evidence we have indicates there might be some association between gender and beverage preference. Note also that we say these two variables are "not independent" rather than saying they are "dependent"; the word "dependent" implies that one variable depends on the other, or in other words that there is a cause-and-effect relationship between the two variables. This may not be the case. In fact, it is quite unlikely that being female causes you to prefer Pepsi (it's more likely that certain soft drink companies target males or females in their advertising) and it's extremely unlikely that drinking Coke or Pepsi causes you to become male or female! We can say "there is an association" between the two variables or that the variables are "not inependent" but we should not use the word "dependent" here.

Mosaic plots

Now it's time to draw a picture. The textbook discusses a couple different types of plots you could use in this situation (segmented bar charts or side-by-side bar charts) but here we'll look at a newfangled plot that few Statistics books mention, but which is quite easy to draw and very useful. We begin with a square:

first step in mosaic plot

For the purposes of this discussion, let's assume that the dimensions of the square are 100 mm by 100 mm. Now let's split the square into males and females. Because 25/41 ≈ 61% of the class is female (so that 39% is male), we'll split the square into pieces that are 61 mm wide and 39 mm wide:

second step in mosaic plot

Now, let's split the female rectangle on the left into pieces that are 28 mm high, 40 mm high and 32 mm high to represent the percentages of females who prefer Coke, Pepsi or neither, respectively. We'll color-code these using red for Coke, blue for Pepsi and gray for neither:

third step in mosaic plot

Likewise, let's split the male rectangle on the right into pieces that are 56 mm high, 25 mm high and 19 mm high, using the same color-coding:

fourth step in mosaic plot

Finally, we can add some labels:

completed mosaic plot

We don't expect the male and female rectangles to be the same width, because there's no reason to think that there would be an equal number of males and females in the class. But if gender and beverage preference were independent (in other words, if there were no association between gender and beverage preference) we would expect the heights of the Coke, Pepsi and neither rectangles to be roughly the same width for the males as they are for the females. This is obviously not the case, so we conclude from the mosaic plot that there is evidence that gender and beverage preference are not independent. (We could also say that there is evidence of an association between gender and beverage preference.)

ActivStats

The ActivStats DVD offers guidance in using Data Desk with categorical variables. If time permits, view the lessons on pages 3-1 and 3-2.

Exercises

Work exercises 5, 7, 13, 19, 25, 27 and 35 in Chapter 3. (You are of course encouraged to work additional problems.)

Errata

The Titanic data set first mentioned on page 20 is on the DVD, but not among the Chapter 3 data sets; instead, it is listed as Ch26_Titanic.txt and may be found among the Chapter 26 data sets.

The fish diet data set first mentioned on page 31 is on the DVD, but not among the Chapter 3 data sets; instead, it is listed as Ch26_Fish_diet.txt and may be found among the Chapter 26 data sets.

The final bullet point on page 36 should read "there is evidence that the variables are independent."

In the definition of Frequency table on page 36, a right parenthesis is missing after the word "percentage" at the end of the first line.

Exercise 1 should read: "Find a bar graph chart of categorical data..."

The orange T symbol is missing next to Exercise 18 (the data set is in fact on the DVD).

The answer in the back of the book for part d i) of Exercise 23 should be 35.7% (not 60%).

The data set for Exercise 29 is on the DVD, but only in the Intro Stats 2e folder. You can also access it in the Data Sets folder here on WAMAP.

Exercise 30 repeats of parts c and d of Exercise 28.

Part c of exercise 35 should read: "Compare these distributions with a segmented bar graph chart."

Part d in exercise #43 should read "Explain." (Not "Explain?")

Exercise #44 should read "sales (in millions of dollars) by region" (not "percentages by region").

Additional Resources

"The Question of Causation": Episode 11 from Against All Odds explores some of the ideas of Chapter 3, including using segmented bar charts to do same sort of analysis we did with side-by-side pie charts above.
Introduction to Statistics: Carnegie Mellon's open source course has a lesson called "One Categorical Variable" that may also be of interest (see Unit 2, Module 1).
Bar chart tool: A Java applet for creating bar charts.
Pie chart tool: A Java applet for creating pie charts.
Create a Graph: Useful site for creating various types of graphs online.
Simpson's paradox: Another example of Simpson's paradox.
"Americans and the Super Bowl Phenomenon": Results of the Gallup Poll first mentioned on page 26.
"Cigarette Brand Preferences Among Adolescents" (PDF, 176K): Lloyd D. Johnston, Patrick M. O'Malleym Jerald G. Bachman, John E. Schulenberg.
Monitoring the Future survey referenced in Exercises 9 and 21.
"Polls show paranormal beliefs on the rise, evolution belief on the decline" (PDF, 21K): Skeptic, Vol. 9 No. 1, page 10
Article about the Gallup poll referenced in Exercise 15.
"ATF report renews calls for gun control" (PDF, 18K): Gary Fields
USA Today, June 22, 2000
Article referenced in Exercise 16.
International Tanker Owners Pollution Federation Limited: Web site with data referenced in Exercise 17.
"Global Warming: A Divide on Causes and Solutions": Results of a Pew Center survey mentioned in Exercise 19.
"Complications from therapeutic modalities: results of a national survey of athletic trainers": S. Nadler, et al.
Archives of Physical Medicine and Rehabilitation, Volume 84, Issue 6, Pages 849–853
Abstract of article mentioned in Exercise 20.
"Reasons for Effects": Paul R. Rosenbaum.
Chance, Winter 2005, pp. 5–10
Article mentioned in Exercise 22.
"Trends in Twin Birth Outcomes and Prenatal Care Utilization in the United States, 1981–1997": JAMA, 2000;284:335–341
Abstract of article mentioned in Exercise 34.
"Fluoxetine After Weight Restoration in Anorexia Nervosa: A Randomized Controlled Trial": JAMA, 2006;295:2605–2612.
Article about study mentioned in Exercise 37.
"Effect of Selective Serotonin Reuptake Inhibitors on the Risk of Fracture": Arch Intern Med. 2007;167(2):188–194.
Article about study mentioned in Exercise 38.
The Gallup Poll: Public Opinion: George Gallup Jr. 2001. pp. 111–112.
Book containing April 2001 Gallup Poll results mentioned in Exercise 41.
"Commercial tattooing as a potentially important source of hepatitis C infection": RW Haley, RP Fischer.
Medicine. 2001;80(2):134–151.
Abstract of article mentioned in Exercise 42.
"Sex Bias in Graduate Admissions: Data from Berkeley": PJ Bickel, EA Hammel, JW O'Connell.
Science. 2007;187:398–404.
Abstract of article mentioned in Exercise 47.

Return to the Public Course Page