Introduction to Data Desk

These notes redo some of the examples in the Chapter 3 Resources (and introduce some others) in the context of working with Data Desk.

Bar charts and pie charts with Data Desk

Let's use technology to make a similar bar chart. Throughout most of the text we'll use the TI-83 or TI-84 calculator as well, but these graphing calculators can't handle categorical data, so for now we'll use computer software exclusively. In the past you may have used a spreadsheet program, such as Microsoft Excel, to make graphs such as this. Excel actually does a decent job of sorting and displaying categorical data, so we could use it in this chapter. We will see in future chapters, however, that Excel does not make appropriate displays of quantitative data (and it has many other problems as well). Thus, even though it is handy for entering and sorting data, we should use other tools for most statistical calculations. In place of Excel, we'll use Data Desk, a program that is included on the DVD that came with your textbook.

Note: Some students find Data Desk a bit daunting at first. In part, that's because working with categorical variables in Data Desk is more challenging that working with quantitative variables. I encourage you to follow along with the instructions below, but if you're short on time you can do the exercises in Chapter 3 without Data Desk (drawing graphs by hand if necessary), so just skim through what follows and come back to it when you have time.

If you have worked through the first lessons in ActivStats you should know how to start Data Desk from within the ActivStats program by clicking the magnifying glass icon:

accessing Data Desk from within ActivStats

To access the program directly, make sure the Intro Stats DVD is in your computer's DVD drive and right-click on My Computer on your desktop or in the start menu and choose Open. Right-click on the drive containing the ActivStats DVD and select Open.

in My computer, right-click on the ActivStats DVD drive and select Open

Double-click on the Course folder:

double-click on the Course folder

and then double-click on the Data Desk AS.exe program to run Data Desk from the DVD:

double-click on Data Desk AS.exe to run Data Desk

If you want, you can click once on this file, and hold down the mouse button while dragging this program to your desktop or a folder (or even a USB flash drive); then you can run it directly from your computer or USB drive so that you don't always need the DVD to use Data Desk. I recommend this option, as it's much more convenient than running the program from the DVD.

Now we need to get the data into Data Desk. We'll learn how to create data files from scratch later, but for now we'll use a ready-made data file, called beverage.txt, which you can find in the Data Sets folder here on WAMAP, or access directly by right-clicking on the following link:

beverage.txt

and choosing "Save Link As..." or "Save Target As..." from the drop-down menu. Save the file to your computer or USB drive to use in the current example. (If you're using a Mac and the following instructions don't work, use the Mac version of this file called beverage.dsk in the Data Sets folder).

If you open this file in a text editor (such as Notepad), you'll see something that looks very much like the data set displayed above:

beverage.txt viewed in Notepad

Note that (in addition to the variable names, gender and beverage, in the first row) there are 41 rows, one for each case (i.e. each student), and two columns, one for each variable. This is the format that Data Desk expects to see when we work with a categorical variable.

To open the text file in Data Desk, start the Data Desk program, click on File and Import...,

click File then Import...

then navigate to the file beverage.txt that you saved on your computer's hard drive or USB drive, click on the filename and click Open.

single click on filename, then click Open

Click Use these variable names:

click Use these variable names

The data file is now open in Data Desk. You should see two variables here:

variables in Data Desk

but we only want to work with the beverage variable at the moment, so click on it to select it. A Y should appear over the variable's icon.

click on variable name to select as Y

Click Calc and Frequency Breakdowns:

click Calc and then Frequency Breakdowns

You should then see a frequency table (with relative frequencies included).

frequency table

With the beverage variable still selected as Y, you can click on Plot and then Bar charts:

click Plot and then Bar Charts

to create a bar chart from the beverage data:

bar chart of beverage data

Note that each bar is labeled with its category and that the variable ("beverage") is labeled along the horizontal axis; the scale for the counts is indicated on the vertical axis. In some cases you may need to resize the bar chart window (by clicking on the symbol in the lower right corner of the window and dragging the mouse as you hold down the left mouse button) to see the full category names.

You can easily make a pie chart for the same data set (make sure the beverage variable is still selected as Y) by clicking on Plot and then Pie Charts:

pie chart of the beverage data

Note that the legend to the right of the pie chart tells us which slice of the pie corresponds with which category. If you're printing a report in black and white, you can click on the hyperview menu (the little triangle-shaped symbol in the upper-left corner of the pie chart window in Data Desk) and select Use Patterns:

click hyperview triangle, then click Use Patterns

You should then get a chart like this:

pie chart for beverage data using patterns instead of colors

A particularly useful feature of Data Desk is the ability to copy and paste graphs into other applications, such as Microsoft Word. To do this, click on the graph you want to copy (so that its window is the active window in Data Desk) then click on Edit and Copy Window:

click Edit then Copy Window

The graph is now on your computer's clipboard and you can paste it into a Microsoft Word file (for example) to create a report. You can also paste it into an image editor such as Paint.NET (or Photoshop or Paint) and create an image in *.jpg or *.png format to upload to the Web and use as part of a HW solution.

Working with summary counts

When many statistical analysis programs (including Data Desk) work with categorical variables, they expect the data to be in "raw" form: in other words, a long table of entries with one row for each case and one column for each variable, as we saw with the file beverage.txt.

In practice, however, information like this is often given in summary form as a frequency table (as in the initial way we summarized our beverage data) or a two-way table (when working with two categorical variables at once, as in Exercise 29 of Chapter 3, or as below where we consider beverage preference and gender simultaneously). In the textbook, Exercise 23 (for example) has an orange circle with a T inside next to the exercise number, indicating that the data set for that exercise is included on the DVD that comes with the book; in many such cases the data in these files is given in summary format rather than as "raw data." Unfortunately, Data Desk expects to see the raw data, so what do we do?

Let's begin with an example from the second edition of the textbook (it's not included in the third edition, so you can get the file by right-clicking on the following link:

Ch03_Auditing_reform.txt

and saving it to your desktop. If you open the text file in Notepad you'll see something like this:

tab-delimited text file for Ch. 3 #15, viewed in Notepad

Note that this data file is in summary format (like a frequency table) rather than in "raw data" format with one row for each case.

Close Notepad (if it's open) and open Data Desk, then import the file (you'll want to close any other data files that you have open in Data Desk first) just as we did above with the beverage data. Click on the Response variable so that a Y appears over the variable's icon, then hold down the SHIFT key and click on the Percent variable so that an X appears over that variable's icon:

click to select a variable as Y, shift-click to select as X

Now click on Manip and Replicate Y by X:

click Manip then Replicate Y by X

a new variable, Response:Percent, will appear:

derived variable

Click on this new variable (to select it as Y) and then click Plot and Pie Charts to create a pie chart:

pie chart for Ch. 3 #15

Note that we didn't really have counts in the original data file, but rather percentages. This worked fine, though, since the percentages were all rounded off to the nearest integer and we only wanted to create a pie chart, which is a graphical display of percentages, or relative frequencies. If the percentages had been of the form 39.2% this procedure would not have worked. The best practice in such a case would be to multiply all of the percentages by the total number of cases to get the original counts, then use Replicate Y by X to create a "raw data" file with one row for each case.

Two categorical variables in Data Desk

To create a two-way table from our gender and beverage data in Data Desk, open the beverage.txt file in Data Desk as before, then click on beverage to select it as Y, then hold down the SHIFT key and click on gender to select it as X:

click on beverage to select as Y, then shift-click on gender to select as X

Now click on Calc and Contingency Tables.

click Calc and then Contingency Tables

You should see a table with the same summary counts as in our original two-way table:

contingency table of gender and beverage

Click on the hyperview menu of the active window (the small arrow in the upper-left corner), then select Table Options:

click the hyperview icon, then click Table Options

to access options to display row percentages, column percentages and/or table percentages in addition to (or instead of) counts.

Table Options dialog box

If we select "Percent of column total" (and deselect "Count") we get:

contingency table for beverage and gender (column percentages)

and we can see more easily by examining the percentages that there does seem to be a difference between the distribution of beverage preferences for males and females.

Now let's create a visual display that will allow us to compare males and females. With beverage again selected as Y, and gender again selected as X, click on Manip and Split into Variables by Group....

click Manip then Split inot variables by group ...

Data Desk will open up a new window called gender with two data sets called female and male:

beverage data split into two groups by gender

Now click on female to select it as Y, then hold down the CTRL key and click on male to also select it as Y:

click female then control-click male so both are selected as Y

Now click on Plot and Pie Charts to see side-by-side pie charts comparing the beverage preferences of male and females:

side-by-side pie charts comparing beverage preferences for males and females

We can clearly see that the pie charts appear to be quite different, which leads us to conclude that there is evidence that beverage preference and gender are not independent.

Summary counts (again)

Now let's look at the contingency table in Exercise 29. This is a contingency table because we have one group (students who applied to magnet schools) classified according to two variables (Ethnicity and Admission Decision). Get the data file (in text format or Data Desk format) from the DVD and save it to your desktop. (Note that the data files on the DVD and Web do not include exercise numbers, just the chapter number and the topic of the problem; in this case the file is called Ch03_Magnet_schools_revisite.TXT or Ch03_Magnet_schools_revisite.ise but there is a further problem in this case in that the publisher left this file out of the folder of Chapter 3 data sets on the DVD. You can, however, find the data set in the folder of data sets for the second edition, also on the DVD, and you can also get it in the Data Sets folder here on WAMAP or by right-clicking on this link.)

Save the file Ch03_Magnet_schools_revisite.TXT or Ch03_Magnet_schools_revisite.ise to your desktop or USB drive. If you examine this file using a text editor, you'll note that it looks like this:

exercise #19 data set viewed as a text file

Note that there is one column for each variable and one row for each possible combination of the values of these two variables, with summary counts in the third column. In order to analyze this data using Data Desk, we need to turn these summary counts into a "raw" data file, as we did before, although in this case it is slightly more complicated.

Import the file into Data Desk (after closing any previously open data files) then click on the Ethnicity variable so that a Y appears over its icon, then hold down the CTRL key and click on the Admission Decision variable so that another Y appears over that variable's icon. Finally, hold down the SHIFT key and click on the Counts variable so that an X appears:

control-click for Y and shift-click for X

Now click on Manip and Replicate Y by X: two new variables, Ethnicity:Counts and Admission Decison:Counts will appear.

Ethnicity:Counts and Admission Decision:Counts

Click on Admission Decison:Counts so that a Y appears and then hold down SHIFT and click on Ethnicity:Counts so that an X appears:

click Admission Decision:Counts and shift-click Ethnicity:Counts

Now click Manip and Split into Variables by Group... to split the data into three groups, Asian, Black/Hispanic and White:

Admission Decision split into three groups by Ethnicity

We can now create a pie chart for each of these groups in order to compare them. Hold down the CTRL key and click each of the groups so that all three are selected as Y:

control-click each group to select as Y

and then click on Plot and Pie Charts to see side-by-side pie charts comparing the admission decisions for each ethnic group:

side-by-side pie charts comparing admission decisions across ethnicities

If the variables Ethnicity and Admission Decision were independent, we would expect each of these pie charts to look roughly the same; they don't, so we conclude that there is evidence that Ethnicity and Admission Decision are not independent.