
Ch. 8 Resources
Chapter 8: Linear Regression
In the previous chapter we learned how to make a scatterplot of two quantitative variables, to check the conditions for correlation, and to compute the correlation. Now we go a few steps further, to find and interpret the equation of a linear model that will best fit the data, and examine the residuals. As before, we'll concentrate here on using technology to perform the computations.
Regression on the TI-84
Let's continue working with the size and 2007 assessed value data from the property tax data set (in the houses.txt file) that we considered in the Chapter 7 Resources. Follow the previous instructions to use LinReg(a+bx) to compute the correlation and the calculator will also give you the regression equation:
The calculator displays the equation we want as y = a + bx but it should be written as `hat y = a + bx`. The calculator can't display `hat y` so it uses y instead; you should always write `hat y` when working with pencil and paper, or type "y-hat" (or "assess-hat" or whatever your response variable happens to be called, followed by a "-hat") when typing online. In the case of our house data, our equation should be written as:
`hat y = 249 + 0.038x`
Also note that in the text the regression equation is not written as `hat y = a + bx` but rather as `hat y = b_0 + b_1 x`. We should do this too, so just remember to mentally convert from TI-84 notation to Statistics notation when you are working with the calculator to solve a problem.
Even better, we can write the equation as:
`hat{assess} = 249 + 0.038 times size`
This way we won't forget what the variables x and y refer to. (If you decide to use x and y, instead of more intuitive variable names, you always need to clearly state what they represent.)
What does the regression equation for our house data tell us? The slope is 0.038 with units of $1000/ft2, or put more simply, $38 per square foot. We expect that for a house in this neighborhood, each additional square foot in size will be associated with an increase of about $38 in the assessed value, on average. If my neighbor's house is 100 square feet bigger than mine, I would predict that the assessed value of his home would be about $3800 more than the assessed value of my home. It's important to note, however, that this is just a prediction: his house will almost certainly not be valued exactly $3800 more than mine.
The intercept of 249 tells us that we would expect a house in this neighborhood of size 0 square feet to have an assessed value of about $249,000, but of course no house is 0 square feet! (You could argue that the $249,000 would represent the average value of a plot of land with no house, but there are no such properties in our data set, so this would involve extrapolation.)
Graphing the Regression Line on the TI-84
Once we have the regression equation it is useful to plot this equation along with the scatterplot of the data. Before we do this, it is advisable to clear out any other equations that might be stored on the calculator. So first press the Y= key and then use the CLEAR key to clear out anything you see in the Y= menu:
Now run LinReg(a+bx) again, but after typing LinReg(a+bx) L1,L2 (but BEFORE you press ENTER) type , (another comma):
Then press VARS and move the cursor right so that Y-VARS is highlighted:
Next press ENTER to get to the FUNCTION menu:
Now press ENTER again to so that LinReg(a+bx) L1,L2,Y1 is displayed on the screen:
Press ENTER one more time to run LinReg(a+bx). The output will be the same as before, but if you check the Y= menu you should now see the regression equation after Y1=:
To see the regression line graphed with the scatterplot, use ZoomStat:
Plotting the Residuals with the TI-84
To graph the residuals, do the following: First press STAT, then ENTER to get to the list editor. Move the cursor up to the name of the first list, then move it to the right past L6, where there should be an empty space for a new list name:
Now press 2ND and STAT (in other words, LIST) and scroll down (if necessary) until you see RESID:
Then press ENTER twice. The residuals should appear in this new list:
Now whenever you run LinReg(a+bx) the residuals for the new data set will automatically be stored in this list.
To create a scatterplot of the residuals, set up a plot in the STAT PLOT menu (you might want to turn off Plot1 and use Plot2 for the residual plot) with L1 (or whatever list contains your explanatory variable) for the Xlist and RESID for Ylist. (Follow the steps above to get to the name of the RESID list in the LIST menu.)
Finally, use ZoomStat to see the residual plot:
The plot looks reasonably boring, with no apparent pattern, which confirms our belief that the Straight Enough Condition is satisfied.
Regression with Data Desk
Regression equations and residuals with Data Desk are even easier. Once you have created a scatterplot of the data (as in the Chapter 7 Resources), click the hyperview menu of the scatterplot window and select Regression of assess vs size:
You should then see the Data Desk regression output:
Most of this information will remain a mystery to us until later in the course, but notice the numbers in the lower-left corner:
These are the intercept and slope of the regression line, giving us the same regression equation we got in the TI-84:
`hat{assess} = 249 + 0.038 times size`
There are two other quantities that we will use at the present time. We see that R2 = 67.2%:
This tells us that about 67% of the variability in the assessed values of the houses can be explained by the differences in the sizes of the houses. Notice that R2 = 67.2% = 0.672 ≈ (0.820)2 = r2, where r is the correlation that we computed previously; however (due to tradition) we'll use a lowercase r for the correlation and an uppercase R for R2, which is called the coefficient of determination in some texts (we'll usually just call it R2 or "R-squared").
We can also see from the Data Desk regression output that se = 9.481, or $9,481. This latter quantity (called just s on the calculator and computer) is the standard deviation of the residuals; the closer the residuals are to 0, the better the model will fit the data, and the smaller se will be. Since the residuals have the same units as the response (or y) variable (in this case the assessed value), we can compare this number se to sy, the standard deviation of the assessed values. Using the TI-84 (1-VarStats L2) or Data Desk you can compute sy ≈ 15.9, or $15,900. Since the standard deviation of the residuals is not that big when compared with the standard deviation of the assessed values, this indicates that there should be only a moderate amount of scatter of the data on either side of the regression line (of course we can also see this just by looking at the scatterplot!) and hence the regression equation should do a decent job of predicting the assessed value of these houses based up on their size.
Speaking of the residuals, they are just a click away in Data Desk. Click the hyperview menu of the regression window and select Scatterplot residual vs predicted:
This gives us a slightly different scatterplot than the TI-84 (where we graphed the residuals vs. the explanatory data values) but it reveals the same thing:
As before, this is a reasonably boring plot, with no apparent pattern, which confirms our belief that the Straight Enough Condition is satisfied.
Homework
Work the following exercises in Chapter 8: 1–9 odd, 15, 19, 21, 23, 27, 39, 41, 45, 47, 55 and 59.
Errata
The data set for the Burger King menu items introduced on page 194 can be found on the DVD (look for the file called Ch08_BK_menu_items.txt).
The W's in the margin on page 194 are missing the When.
On page 196, the equation `hat {Fat} = 6.8 + 0.97 ` Protein is computed via technology, although this is not indicated in the text.
On page 197, it's not clear where the formula `b_1 = r(s_y)/(s_x)` comes from; this can be derived using algebra.
The wildfire data set introduced on page 199 is on the DVD, even though this isn't mentioned in the text.
On page 200, even though the authors work out the slope and intercept using the formulas, we will almost always compute the slope and intercept directly using technology.
The first equation on page 201 should read `hat z_y = r z_x ` (not `z_y`).
Exercise 7 on page 216 should read "variation in fiber" (not "amount of fiber").
Exercise 8 on page 216 should read "by the regression on horsepower" (not "by the horsepower").
The answer in the back of the book for part b of Exercises 15 and 16 should read: "The units of slope are " (not "Slope is").
Part d of Exercise 35 on page 218 should read "far off is the prediction based on the model in part b from" (rather than "prediction in part b").
In part d of Exercise 39, the standard deviation of the math scores should be 98.1 (not 96.1).
The orange T symbol is missing from Exercise 41; the data set is on the DVD.
The answer in the back of the book for part e of Exercise 41 should be 559.6 (not 359.6).
At the end of the first paragraph in Exercise 55, "°C" should appear in parentheses (not just "C"); in part g, it's not clear which year "this year" refers to.
ActivStats
Work the activities on pages 8-1 through 8-3 in the ActivStats lesson book, as time permits
Additional Resources
- Describing Relationships
- Episode 8 from Against All Odds features a discussion the linear regression model, while Episode 9 discusses the meaning of R2.
- Carnegie Mellon: Introduction to Statistics
- Carengie Mellon's open source statistics course includes a lesson called "Examining Relations" that includes a discussion of linear regression.
- Sofia: Elementary Statistics
- Lesson 12.3 of the Sofia Open Content Initiative's Elementary Statistics course includes a discussion of the regression equation and Lesson 12.5 discusses making predictions. (Some of the terminology may be unfamiliar here since this course covers regression far later in the game than we do.)
- TI-83 Resource: Linear Regression
- Instructions for using the TI-83 for regression analysis.
- Scatterplot, correlation and regression on the TI-83/84
- Instructions for using LinReg on the TI-84.
- LinReg tutorial
- A Flash tutorial on using the TI-83 for regression analysis, using data about the Seattle Mariners. (Ignore the discussion of "critical values in Table A-6.")
- Least Squares
- A Java applet that helps visualize the meaning of least squares regression.
- Least Squares Down and Dirty
- An exposition of the algebra behind the least squares regression formulas.