
Regression Odds and Ends
This section collects some useful facts and techniques related to linear regression models.
Graphing the regression line on the TI-84
Once we have the regression equation it is useful to plot this equation along with the scatterplot of the data. Before we do this, it is advisable to clear out any other equations that might be stored on the calculator. So first press the Y= key and then use the CLEAR key to clear out anything you see in the Y= menu:
Now run LinReg(a+bx) again, but after typing LinReg(a+bx) L1,L2 (but BEFORE you press ENTER) type , (another comma):
Then press VARS and move the cursor right so that Y-VARS is highlighted:
Next press ENTER to get to the FUNCTION menu:
Now press ENTER again to so that LinReg(a+bx) L1,L2,Y1 is displayed on the screen:
Press ENTER one more time to run LinReg(a+bx). The output will be the same as before, but if you check the Y= menu you should now see the regression equation after Y1=:
To see the regression line graphed with the scatterplot, use ZoomStat:
Computing predicted values
Once we've stored the regression equation in Y1, we can use the Y1 we found in the VARS menu to easily computed predicted values. For example, if we want to predict the assessed value of a 1750 square foot house, we can enter Y1(1750) into the calculator, press ENTER and get the predicted value:
which we interpret as $314,973. We would predict that a 1750 square foot house in this neighborhood would have an assessed value of $314,973.
Computing residuals
Look at the first house in the data set, which has a size of 1,561 square feet and an actual assessed value of $304,000.
To compute the residual (the difference between the actual and predicted values) for this house, we first compute the predicted value by entering Y1(1561) into the calculator to get 307.843:
and then we subtract this value from the actual assessed value, which can be done most easily by typing 304, then the subtraction key, then 2ND and (-), to get ANS, which recalls the calculator's previous answer:
This yields a residual of -3.843, or -$3,843, meaning that the actual assessed value is nearly $4,000 less than the predicted value. (In other words, this homeowner shouldn't complain about his property taxes.)
Residual plots on the TI-84
The calculator actually computes all of the residuals automatically every time you run LinReg; we just have to uncover where these values are stored. First press STAT, then ENTER to get to the list editor. Move the cursor up to the name of the first list, then move it to the right past L6, where there should be an empty space for a new list name:
Now press 2ND and STAT (in other words, LIST) and scroll down (if necessary) until you see RESID:
Then press ENTER twice. The residuals should appear in this new list:
Now whenever you run LinReg(a+bx) the residuals for the new data set will automatically be stored in this list.
For reasons that will soon become apparent, let's now create a scatterplot of the residuals. Set up a plot in the STAT PLOT menu (you might want to turn off Plot1 and use Plot2 for the residual plot) with L1 (or whatever list contains your explanatory variable) for the Xlist and RESID for Ylist. (Follow the steps above to get to the name of the RESID list in the LIST menu.)
Finally, use ZoomStat to see the residual plot:
The plot looks reasonably boring, with no apparent pattern. This is precisely what you want to see.
A problematic residuals plot
Let's now try to fit a linear model to the following data, the population of Snohomish County as reported by the U.S. Census from 1920 through 2010:
year pop. 1920 67690 1930 78861 1940 88754 1950 111580 1960 172199 1970 265236 1980 337720 1990 465642 2000 606024
2010 722400
A scatterplot of this data:
reveals that, while we have two quantitative variables and no significant outliers, the association is obviously not linear, and hence we should not use a linear regression equation to model this data.
If, however, we proceeded to do so anyway, we would get a regression line that falls below the data at the far ends and above the data in the middle:
This results in the following residuals plot:
which reveals an obvious pattern (the residuals are positive on the far ends, where the line falls below the data, and negative toward the middle, where the line falls above the data). This helps explain why we want to see no pattern whatsoever in a residuals plot: ideally, the data should be scattered above and below the regression line without any sort of apparent regularity.
If a residuals plot reveals a pattern (even one not so striking as this), we should re-evaluate our assumption that the association between the variables is linear.
A caution about correlation
Note also that that the calculator reports a correlation of r = 0.95: this seems quit high. But this does not tell us anything about the linearity of the association. Correlation only measures the strength of a linear association. If the association is not linear, the value of the correlation is meaningless. (Anscombe's data sets help drive home this point.)
Exercises
1. [OIS 7.2] Patterns in the residuals Two plots of residuals remaining after fitting a linear model to two different sets of data appear below. Describe important features and determine if a linear model would be appropriate for each data set. Explain your reasoning.
2. [OIS 7.1] Visualize the residuals The scatterplots shown below each have a superimposed regression line. If we were to construct a residual plot (residuals versus x) for each, describe what those plots would look like.
3. [OIS 7.6] Husbands and wives The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the husbands and wives. The scatterplot on the left shows the wife's age plotted against her husband's age, and the plot on the right shows wife's height plotted against husband's height.
a) Describe the relationship between husbands' and wives' ages.
b) Describe the relationship between husbands' and wives' heights.
c) Which plot shows a stronger correlation? Explain your reasoning.
d) Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between husbands' and wives' heights?