Ch. 2 Resources

Chapter 2: Data

The Six W's

Most of the problems in this chapter involve identifying what the authors of the textbook call The W's: who, what, where, when, why and how. These are the same fundamental questions a journalist might ask when investigating a story for a newspaper article. More specifically, we need to be able to answer the following questions:

Whowas studied? (As opposed to who did the studying.)
- Cases are the individual elements of the "Who"
What was measured or recorded about about the "Who"? (Where appropriate, include the units!)
- Variables are the individual elements of the "What" (Include the type!)
Where was the data collected?
When was the data collected?
Why was the data collected?
How was the data collected?

The Who and What are even more important than the others—if you don't know "who" or "what" you probably don't have any context in which to perform a statistical analysis. But the other W's are important, too—we won't always be able to answer all six of these questions, but knowing that we don't know the answer to a question often tells us something very valuable and leads us to ask other important questions.

Houses

To use as an example for this chapter, I collected some data about the single-family residences on the street where I live in Edmonds. I collected this data on October 3, 2006 from the Web site of the Snohomish County Assessor. For each house I recorded the house number, the size (in square feet), the 2007 assessed value (in thousands of dollars), the lot size (in acres), the 2006 taxes (in dollars) and the number of stories. Here is the complete data set:

house	size	assess	lot	taxes	stories
20911	1561	304	0.2	2604	1
20912	1038	297.6	0.2	280	1
20918	1224	289.5	0.17	2353	1
20921	1232	292.8	0.17	756	1
20924	1995	314.6	0.17	2620	2
20927	1714	322.7	0.18	2632	1
20930	1832	336.1	0.18	2779	2
21003	1095	279	0.18	2321	1
21006	2011	319.5	0.18	2663	2
21015	1366	289.3	0.18	2415	1
21018	1292	301.4	0.18	2477	1
21023	1458	314.3	0.18	1386	1
21028	2031	320.9	0.18	2676	2
21105	1366	304	0.18	2473	1

Let's identify the W's for this data set:

Who: 14 houses
- Cases: each house is a case
What: house number, size, 2006 assessed value, lot size, 2007 taxes, number of stories
- Variable: house number; Type: categorical (identifier?)
- Variable: size; Type: quantitative; Units: square feet
- Variable: 2006 assessed value; Type: quantitative; Units: thousands of dollars
- Variable: lot size; Type: quantitative; Units: acres
- Variable: 2007 taxes; Type: quantitative; Units: dollars
- Variable: stories; Type: ordinal
When: data was collected on October 3, 2006
Where: Edmonds, WA
Why: to use as an example for this class
How: data was found on the Web site of the Snohomish County Assessor

Notice that there are 14 rows in the data table given above (not including the header row with the variable names) and that each row in the table corresponds to a case (that is, the "Who" corresponds to the rows). There are six columns and each column contains information about one variable (that is, the "What" corresponds to the columns).

Notice also that the "Who" isn't a group of people in this case, it's a group of houses. The only person mentioned in the information given with the data set is the instructor for this course, but although I gathered the data, I'm not who is being studied here, the houses are.

The house numbers, although they consist of numbers, are categorical, not quantitative. (We might consider them to be an identifier since each house on this street has a unique number, but only if we were interested in houses on this street and nowhere else in the city of Edmonds; if we looked at houses on the adjacent street we might find a house with the same number as one of these 14.)

The next four variables are quantitative and notice that we specify the units in each case. Without units a quantitative variable is meaningless. (If you don't believe me, just ask NASA.)

The final variable, number of stories, could be considered quantitative, but since it only takes on two values in this data set, we essentially have two groups here (one-story houses and two-story houses) so it's really functioning as a categorical variable (albeit an ordinal one) in this context. If we had collected data about buildings in downtown Seattle, which might be as short as a single story or as tall as the 76-story Columbia Center, we would probably consider the number of stories to be a quantitative variable.

Although the "Who" and "What" are the most vital W's, the others are important as well. If we didn't know "Where" these houses were located the information wouldn't be very useful. It might also be helpful to know exactly where in Edmonds these houses are located; given the assessed values, they're certainly not near Puget Sound—if they had views, the values would be at least double what they are here!

The "When" is important, too: if we want to use this information to learn about houses and taxes in Edmonds, it wouldn't do us much good to have tax data from 1960, when many of these houses were originally built. The "Why" may not be terribly interesting this case, but if the data was collected to argue against a property-tax increase, or to argue that such an increase would not negatively impact homeowners in Edmonds, we might wonder if these houses were representative of all houses subject to the proposed tax. The "How" allows someone to check the numbers by visiting the Web site and accessing the data at the original source.

We'll revisit this data set in later chapters when we examine quantitative variables.

Coke vs. Pepsi

Let's try one more example. On September 18, 2006, I administered a survey to a Statistics class. The survey asked several questions, among them the gender the student (male or female) and whether students preferred Coke-brand beverages, Pepsi-brand beverages, or neither. Here is the data set:

gender	beverage
female	coke
female	coke
female	coke
female	coke
female	coke
female	coke
female	coke
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	pepsi
female	neither
female	neither
female	neither
female	neither
female	neither
female	neither
female	neither
female	neither
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	coke
male	pepsi
male	pepsi
male	pepsi
male	pepsi
male	neither
male	neither
male	neither

Let's identify the W's for this data set:

Who: 41 Statistics students
- Cases: each student is a case
What: gender and beverage preference
- Variable: gender; Type: categorical
- Variable: beverage preference; Type: categorical
When: September 18, 2006
Where: Edmonds Community College
Why: not specified
How: in-class survey

Notice that there are 41 cases (hence 41 rows in the data table, not including the header row) and 2 columns (hence 2 columns in the table). Note that the first variable can take on two possible values ("male" or "female") while the second variable can take on one of three possible values: "Coke," "Pepsi" or "neither." When identifying the "What" be sure not to confuse the variables with the values that the variables can take on. We'll revisit this data set in the next chapter.

Exercises

Work exercises 1, 3, 9, 17, 27 and 29 in Chapter 2. (You are of course encouraged to work many more problems in addition to these.)

Errata

Note that the term ordinal variable is defined on page 11, even though this term is not bold-faced like categorical variable and quantitative variable are on page 10.

Likewise, note that the term identifier variable is defined on page 12.

The Just Checking exercises on page 14 indicate that the Tour de France data is on the DVD; it is not, although you can find it listed with the Chapter 9 data sets.

The T icon indicates that the data sets for Exercises 29 and 30 are on the DVD; they are not.

On page 19, the answer for Just Checking Exercise 1 should read "1903 to 2007" (not 2006).

ActivStats

You may wish to work through pages 2-1 through 2-3 in ActivStats, as time permits. These activities shouldn't take very long, but they will offer you a chance to collect some data by playing a computer game and introduce you to working with Data Desk, the statistical analysis program included on the DVD. (I'll include instructions for Data Desk here beginning with Chapter 3, when we'll actually start using it.)

Additional Resources

"Democrats Have Significant Identification, Image Advantage": Results of Gallup poll referenced in Exercise 1 of Chapter 2.
"Americans More Negative Than Positive About Economy": Results of Gallup poll referenced in Exercise 2.
"Bicycle Helmets Put You at Risk": Clive Thompson.
The New York Times. December 10, 2006.
Article referenced in Exercise 7.
"The Eyes of Honesty": Clive Thompson.
The New York Times. December 10, 2006.
Article referenced in Exercise 9.
"Cardiorespiratory fitness and smoking-related and total cancer mortality in men": Do Lee, Chong; Blair, Steven N.
Medicine & Science in Sports & Exercise. 34(5): 735–739, May 2002.
Abstract of article mentioned in Exercise 11.
"Rapid changes in flowering time in British plants": Fitter, A.H. and Fitter, R.S.R.
Science. 296 (5573). May 31, 2002. pp. 1689–1691
Abstract (with a link to a PDF of the full article) of the article referenced in Exercise 18.
"Plants found blooming earlier in the spring": USA Today. May 30, 2002.
Associated Press report about the previously cited Science article.
"Refrigerators" (PDF, 715K): Consumer Reports. Vol. 67 No. 8. August 2002. p. 25.
Article referenced in Exercise 27.
"Do lost people really walk in circles?": Kim Y. Masibay
Science World. September 22, 2003.
Article about experiment mentioned in Exercise 28.

Return to the Public Course Page