Ch. 10 Resources

Chapter 10: Re-expressing Data

Chapter 10 is optional, for two reasons: first, we have a great deal of material to cover and this is the easiest thing to leave out; and some of the problems involve logarithms, which you may not have seen before unless you've previously taken EdCC Math 131 or Math 140.

At the very least, however, I encourage you to skim through the chapter and look at the ActivStats activities to get the flavor of what is being discussed. If you have the time, I encourage you to read through the text more carefully, work through the ActivStats activities, and work several problems. If you do this, you may post a solution to one of the even-numbered exercises for HW credit, as with the other chapters.

Chapter 10 addresses what to do in certain cases where linear regression is not appropriate, but there nevertheless appears to be a significant non-linear association in a scatterplot. Briefly, if the scatterplot curves up and then heads back down (or vice versa), we're stuck and we really can't do anything that will allow us to perform regression. In other cases, however, it is possible to transform the data so that the scatterplot looks linear, at which point we may be able to perform linear regression computations and analysis; we then re-transform the regression equation back to original the units in order to draw conclusions.

What types of transformations are we able to use? We might take the square root of one of the variables, or perhaps the logarithm, or perhaps transform both variables. This is fairly simple to do with Data Desk or the TI-84. There are brief instructions for both in the textbook, and the ActivStats DVD contains detailed instructions for Data Desk. We'll work one example here.

Census Data

The table below shows the population of Snohomish County according to the U.S. Census Bureau for each of the past nine decennial censuses.

year population
1920 67690
1930 78861
1940 88754
1950 111580
1960 172199
1970 265236
1980 337720
1990 465642
2000 606024

You can find a text file with this data for use in Data Desk (snoco.txt) in the Data Sets folder and save the file to your computer. If we create a scatterplot of year vs. population we get:

scatterplot of population vs. year for Snohomish County

Clearly the Straight Enough Condition is violated here. There does appear to be a very strong association between year and population, however, but it's not linear. What can we do? We can try various transformations of the data that might "straighten out" this curve. The graph above might remind you of an exponential function from a precalculus or calculus class; if the association is exponential, then to "straighten it out" we might try the inverse of the exponential function, the logarithm.

In Data Desk, click on the population variable to designate it as Y, then click on Manip, Transform and Log( y ):

click on Manip, Tranform and Log( y )

A new (derived) variable, Lppn (the logarithm of the population data), will appear:

click on lppn to select as Y

Now create a scatterplot of year vs. log(population):

scatterplot of log(population) vs. year for Snohomish County

This isn't perfectly straight, but it's straighter. Is it straight enough? This may be the best we can do. Certainly the last four censuses look to be relatively straight in the transformed plot, so the population growth may be exponential since about 1970. From 1920 to 1950, the transformed plot also looks straight, indicating exponential growth during that time. It's the years from 1950 to 1970 where the transformed plot does not appear linear, and where the plot of the original data indicates the population was growing even faster. We may not be able to find a perfect mathematical model for the population of Snohomish County, but if we were historians studying the history of the county, we could further investigate the possible reasons for the (relatively) dramatic population growth during these two decades. (Certainly the Baby Boom might be a contributing factor.)

While the Quantitative Variables Condition and the Outlier Condition are now satisfied for the transformed data, the Straight Enough Condition is still questionable, so we might want to check a residuals plot. To do this we must first perform the regression analysis:

Data Desk regression output for log(population) on year

and then create the residuals plot:

residuals vs. predicted values for transformed data

Depending on how we look at this, the residuals either bend down, then up, or there is much more scatter on the left than on the right. In either case, we should conclude that regression is not appropriate here (despite the high R2 value).

If for some reason we had decided to use the regression equation, it would be given by:

In order to convert this back to the original units, we would need to remember that log(y) is the inverse function for 10y and proceed as follows:

If we want to predict the population of Snohomish County in 2010, we would plug year = 2010 into the above equation to get a predicted population of approximately 757,944. We should keep in mind, however, that extrapolation into the future is always dangerous, and that the model itself may not be entirely appropriate.

Homework

If you choose to work though this chapter, the following (optional) exercises in Chapter 10 are suggested: 1, 9, 19 and 29.

Errata

Although not mentioned in the text, the cars data set used on page 253 is on the DVD.

Although not mentioned in the text, the assets data set used on page 255 is on the DVD.

At the end of the first line of the fourth paragraph on page 263, the equation should read “-1÷0.036” (not -1−0.036).

The last line of Exercise 33 should read “1990s” (not “last decade”).

ActivStats

Work the (optional) activities on page 10-1 in the ActivStats lesson book, as time permits. I encourage everyone to view the brief discussion of Exploratory Data Analysis on page 10-2; the other activities on that page look forward to Chapter 28 (which we won't cover, and which can be found on the DVD, not in the printed textbook).