Organizing Data

Once someone provides us with data (or we begin collecting it ourselves), our first task will often be to organize that data in a useful manner before attempting to visualize or summarize aspects of the data set. Typically, we can format data in a data table (sometimes called a data matrix) using rows to represent cases (individual people or things about whom we have collected information) and columns to represent distinct variables (the specific information we record about each person or thing, which often varies from case to case).

As a first example, consider the following information about 18 statistics textbooks was retrieved on September 21, 2012 from Amazon.com: author, title, edition, ISBN, number of pages, shipping weight (in pounds), Amazon price (in dollars) and list price (in dollars).

 

author title ed. ISBN pages weight price list
De Veaux Intro Stats 3u 0321500458 864 4.4 121.88 170.00
De Veaux Stats: Data and Models 0321692551 976 4.4 123.26 170.00
Agresti Statistics: The Art and Science… 0135131995 848 4.2 130.99 170.00
Triola Elementary Statistics 11u 0321694503 888 4.3 155.50 170.00
McClave Statistics 11  0132069512 864 4.4 140.74 170.00
Moore Intro. to the Practice of Statistics 1429240326 709 3.8 132.82 ------
Moore Basic Practice of Statistics 1464102546 745 3.2 129.49 ------
Freund Modern Elementary Statistics 12  013187439X 576 2.4 130.99 170.00
Bluman Elementary Statistics 0077460391 896 4.4 131.75 ------
Utts Mind on Statistics 0538733489 752 3.4 144.72 202.95
Johnson Elementary Statistics 11  0538733500 832 4.2 193.94 241.95
Freedman Statistics 0393929728 720 3.0 120.62 241.95
Mendenhall Intro. to Probability and Statistics 14  1133103758 744 3.0 198.90 234.95
Larson Elementary Statistics 0321891872 352 4.0 170.90 188.33
Sullivan Statistics: Informed Decisions… 0321757270 960 4.7 157.08 170.00
Gould Introductory Statistics 0321322150 736 3.6 124.20 170.00
Peck Statistics: The Exploration & Analysis… 0840058012 816 4.0 196.35 252.95
Diez OpenIntro Statistics 1478217200 426 2.3 9.94 9.94

This data set contains 18 cases (where each case is a different textbook) and 8 variables.

You may be familiar with such a data structure from using a spreadsheet program like Microsoft Excel:

or Google Docs:

Notice that we reserve the first row for the variable names (so that a data file with 18 cases, like this one, actually uses 19 rows). Most spreadsheets and statistical analysis programs can store and retrieve files such as this one in tab-delimited text format, which the entries in each row are separated by a (typically invisible) tab character:

You can download the tab-delimited text file for the textbook data found above.

Data sets should always be accompanied by some explanatory information establishing the context and source of the data (as done above for the textbook data). This information can include (but is not limited to): the date(s) when the data was collected; the procedure(s) used to select sample data from the population; detailed descriptions of each variable; the location(s) of the people or things about which the data was collected; the identity and affiliation(s) of the person(s) who recorded the data; the purpose for collecting the data.

Variable types
In algebra and other math classes, variables are usually things like x or y that only take on numbers as values. In statistics, variables can be much more interesting.

We typically classify variables into two basic types: categorical and quantitative. Categorical variables take on distinct categories as values. For example, in the textbook data above, one variable not recorded is the format for each book (hardcover, paperback, looseleaf, e-book); because we can separate the textbooks into distinct categories, the format variable would be categorical. Meanwhile, the price variable is quantitative, as it measures a quantity (the amount of money required to purchase the book from Amazon.com). 

Quantitative variables must take on only numbers as values, but categorical variables can do so as well: the ZIP code to which Amazon might ship a textbook (98036, say) only involves numbers, but these numbers do not measure a quantity and could easily be replaced with letters or other symbols (as is the case with Canadian postal codes, such as "V5T 4V5" for Vancouver Community College). Deciding if a variable is quantitative can sometimes be a bit challenging but one thing to look for is whether the variable has units (such as dollars for the price variable); we should also mention units for any quantitative variable (a price of $170 is not the same as €170, which in turn is vastly different from a price of ¥170). Another clue is whether it makes sense to compute an average: we might very well report the average price for a collection of textbooks, but it wouldn't make sense to report the "average ZIP code" to which they were shipped.

Some variables fall in a gray area between categorical and quantitative. Consider, for example, the edition variable in the textbook data set. The values are (almost all) numbers, but we wouldn't lose any information if editions were labeled A, B, C, ... instead of 1, 2, 3, ... and it might not make sense to compute the "average edition" for a bunch of textbooks. In this situation, edition represents an example of an ordinal variable, as the categories involved possess a natural order but do not measure any sort of quantity.

Another special variable type is an identifier, where each category corresponds uniquely to a different case. For the textbook data, the ISBN (International Standard Book Number) is an identifier, but the title is not (more than one book has the title Elementary Statistics, for example).

Discrete vs. continuous
Some quantitative variables (like the pages variables in the textbook data set) are discrete: a book can have 851 pages or 852 pages, but not 851.7 pages. Other variables (like weight) are continuous: given sufficiently sensitive measuring equipment, we can (theoretically) measure the quantity to finer and finer degrees of accuracy (3.8948716 lbs). In practice, however, we often round off continuous variables to some reasonable number of decimal places (as with the shipping weights in the textbook data). Other variables are technically discrete (the smallest possible difference between two prices is $0.01) but in practice we treat price as a continuous variable because $0.01 is a rather tiny amount in most contexts.

Databases
A database is a collection of data tables that includes relationships between cases in different tables. Amazon.com utilizes a massive database that includes records of customer information, product details, purchase records and much, much more. On a somewhat smaller (but still rather large) scale, professional symphony orchestras maintain databases with tables for patrons, performances, tickets, ticket purchases, seat locations in the concert hall and donations. Each case in the donation table would contain an identifier for the patron who made the donation, while each ticket would include links to information in other tables about the patron who purchased the ticket, the location of the seat in the concert hall and the music being performed at a particular concert. Databases play an important role in creating dynamic Web sites, from simple blogs to massive social media platforms like Facebook, Twitter and Google+.

Exercises

1. Refer to the textbook data provided above.

a) Classify each of the following variables as categorical, quantitative, ordinal or identifier: author, pages, weight, list.

b) For any quantitative variables you listed in part a), specify the units (if possible).

c) Is the author variable an identifier? Explain.

d) Explain why information about the date this data was recorded might be important.

e) The data set does not include a list price for three of the textbooks (this is an example of missing data—deciding how to treat cases with missing data is an important topic among statisticians). Attempt to locate this information online and fill in the blanks. Be sure to include the source(s) you use.

f) Variability is a concept central to the study of statistics. Just by looking at the price and list variables, which appears to vary less?

g) Do you notice any unusual feature(s) of this data set?

2. Amazon.com provides customer ratings for many of the books it sells. Here is a summary of the ratings from reviews for the sixth edition of Introduction to the Practice of Statistics:

a) Thinking of just the reviews for this book, how many cases are there?

b) List the variables and classify each as best you can, based on the information provided above.

3. Mason-Dixon, a nonpartisan polling firm based in Jacksonville, Florida, conducted a phone survey of 800 registered Florida voters (whom the company deemed likely to vote in the November 2012 election) on behalf of the Tampa Bay Times, Miami Herald, El Nuevo Herald, Bay News 9 and Central Florida News 13. The poll, conducted September 17–19, 2012, found that: 48% of those surveyed plan to vote to reelect President Barack Obama; 47% plan to vote for this Republican rival, former Massachusetts governor Mitt Romney; and 1% for the Libertarian candidate, former New Mexico governor Gary Johnson, with the rest undecided.

a) How many cases are included in this data set? 

b) Based only on the information provided above, how many variables are included in the data set?

c) Classify the variable(s) as categorical, quantitative, ordinal or identifier.

d) Follow the link provided to the Tampa Bay Times article about the poll. List any other variables that must have been included in the data set for the September 2012 poll. 

e) Classify each of the variables you listed in part d). 

4. Consider the population of all students enrolled in your college. Classify each of the following as categorical, quantitative, ordinal or identifier. For the quantitative variables, classify each as discrete or continuous and specify units.

a) student identification number

b) telephone number

c) number of credits for which the student has enrolled this quarter

d) distance from the student's home to campus