Review Problems: Week 3

1. Real Estate Craigslist is a Web site that allows users to post online classified advertisements at no charge (with the exception of job postings in certain metropolitan areas). The table below shows the street address, size (in square feet), asking price (in dollars) and number of bedrooms (abbreviated BR) for nine houses located in the city of Lynnwood listed for sale on Craigslist on October 9, 2011.

address            size  price   BR
18712 57th Ave W   1805  349950  3
19014 24th Ave W   2404  329900  5
3203A 204th St SW  1912  250000  4
17112 6th Ave W    3200  509797  4
19631 9th Pl W     2369  339950  4
17402 62nd Ave W   1200  185000  3
21011 54th Ave W   1660  136950  3
14018 20th Pl W    2244  254950  4
14517 40 Ave W     2450  335000  3

a) If appropriate, compute the regression equation for a linear model to predict the price of a house based on its size. (Explain why such a model would, or would not, be appropriate.)

b) If appropriate, predict the price of a 1560 sq. ft. house in the city of Lynnwood.

c) If appropriate, predict the price of a 560 sq. ft. house in the city of Lynnwood.

d) What are the units for the slope of the regression line?

e) Interpret the value of the slope in terms of Lynnwood houses.

f) What are the units for the intercept of the regression line?

g) If appropriate, interpret the meaning of the intercept.

h) Compute the correlation.

i) Compute r2 and explain what this number represents.

j) Compute the residual for the house at 17402 62nd Ave W.

k) Examine a residuals plot and describe any pattern you see. What does the pattern (or lack thereof) it tell you?

2. Toyota Prius The Prius is a popular gas-electric hybrid car manufactured by Toyota. The table below shows the VIN (vehicle identification number), color, age (in years), mileage (in miles) and asking price (in dollars) for 13 used Toyota Prius automobiles advertised for sale on the Web site of The Seattle Times on January 23, 2011.

VIN               age  color  mileage  price
JTDKN3DU9A0056349   1  black     9277  28995
JTDKN3DU8A0165157   1  black     4180  28995
JTDKN3DU1A0057303   1  blue     32105  25995
JTDKN3DU9A0147198   1  gray      8129  24995
JTDKB20U197821193   2  pewter   28434  23995
JTDKB20U683348798   3  green    40762  22995
JTDKB20U187716331   3  green    24531  22995
JTDKB20U387727363   3  blue     16262  22995
JTDKN3DU8A0059050   1  silver   32830  21995
JTDKB20U383347267   3  gray     32604  21995
JTDKB20U583417996   3  gray     43827  18995
JTDKB20U697840628   2  white    24632  20995
JTDKB20U297880205   2  white    33651  18680

a) If appropriate, compute the regression equation for a linear model to predict the price of a Prius based on its mileage. (Explain why such a model would, or would not, be appropriate.)

b) If appropriate, predict the price of a Prius with 120,000 miles.

c) If appropriate, predict the price of a Prius with 12,000 miles.

d) What are the units for the slope of the regression line?

e) Interpret the value of the slope in terms of Prius prices.

f) What are the units for the intercept of the regression line?

g) If appropriate, interpret the meaning of the intercept.

h) Compute the correlation.

i) Compute r2 and explain what this number represents.

j) Compute the residual for the Prius with VIN JTDKB20U683348798.

k) Examine a residuals plot and describe any pattern you see. What does the pattern (or lack thereof) it tell you?

l) In addition to the 13 automobiles in the data set provided above, The Seattle Times also listed two significantly older Toyota Prius automobiles:

VIN               age  color  mileage  price
JT2BK12U630070267   8  green    83996  10995
JT2BK18U720060613   9   blue   110919   8995

If we included these two cars in the data set, would it be appropriate to construct a linear model to predict price based on mileage?

3. More Priuses On October 17, 2007, the classified ads on the Web site of The Seattle Times listed the following 14 used Toyota Prius automobiles for sale; the data set below shows the year, color, mileage (in miles) and asking price (in U.S dollars) for each car.

year  color  mileage  price
2006  green    17043  25995
2007  gray     12628  24980
2005  maroon   24039  24885
2005  silver   48226  23995
2006  black    10522  22995
2004  silver   66345  21995
2007  white     5611  21995
2005  gold     24479  21595
2004  white    14618  20995
2005  silver   53699  20980
2001  unknown 171700   8300
2004  silver   47649  17995
2003  white    39600  17500
2005  black   103126  16995

a) If appropriate, compute the regression equation for a linear model to predict the price of a Prius based on its mileage. (Explain why such a model would, or would not, be appropriate.)

b) If appropriate, predict the price of a Prius with 120,000 miles.

c) If appropriate, predict the price of a Prius with 310,000 miles.

d) What are the units for the slope of the regression line?

e) Interpret the value of the slope in terms of Prius prices.

f) What are the units for the intercept of the regression line?

g) If appropriate, interpret the meaning of the intercept.

h) Compute the correlation.

i) Compute r2 and explain what this number represents.

j) Compute the residual for the maroon Prius.

k) Examine a residuals plot and describe any pattern you see. What does the pattern (or lack thereof) it tell you?

l) If we converted the mileage numbers from miles to kilometer and the prices from dollars to Euros, how would this affect the correlation?

4. Inkjet printers For their May 2005 issue, the editors of Consumer Reports compared the cost and effectiveness of a variety of inkjet printers. The following table lists the model, retail price (in dollars) and the text speed (in pages per minute, or ppm) for the 13 top-ranked models.

model                             price  speed
HP Deskjet 6540                     130   11.0
Canon Pixma iP4000                  140   10.0
HP PhotoSmart 7760                  150    6.0
HP Deskjet 5850                     235    6.0
HP PhotoSmart 7960                  230    6.0
HP PhotoSmart 8450                  245    7.0
Canon Pixma iP5000                  190    9.0
Canon Pixma iP2000                   80  10.0
Canon Pixma iP8500                  345    4.5
HP Deskjet 6127                     250    7.0
Lexmark P915 Photo                  135    9.0
Epson Stylus Photo R800             375    2.5
Lexmark Color Jetprinter Z816        90    9.5

a) Before computing the correlation between price and speed, we should check three things. List these conditions and in each case indicate whether or not the condition has been satisfied for this data.

b) Regardless of your answer to part a, compute the correlation between price and speed.

c) Compute r2 and explain the meaning of this numberin the context of this problem using a complete sentence or two.

For the remainder of this problem, assume that a linear model is appropriate.

d) Find the equation of the regression line for this data. Use appropriate notation.

e) Explain the meaning of the slope of the line, or state why it is meaningless in this context.

f) Explain the meaning of the intercept of the line, or state why it is meaningless in this context.

g) Compute the residual for the printer that costs $135.

h) If appropriate, predict the text speed for an inkjet printer that costs $800; if it is not appropriate, use a complete sentence to explain your answer.

i) If appropriate, predict the text speed for an inkjet printer that costs $200; if it is not appropriate, use a complete sentence to explain your answer.

j) Examine a residuals plot and describe any pattern you see. What does the pattern (or lack thereof) it tell you?

5. [OIS 7.6] Husbands and wives The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the husbands and wives. The scatterplot on the left shows the wife's age plotted against her husband's age, and the plot on the right shows wife's height plotted against husband's height.

a) Describe the relationship between husbands' and wives' ages.

b) Describe the relationship between husbands' and wives' heights.

c) Which plot shows a stronger correlation? Explain your reasoning.

d) Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between husbands' and wives' heights?

e) What would be the correlation between the ages of husbands and wives if men always married woman who were

i) 3 years younger than themselves?

ii) 2 years older than themselves?

iii) half as old as themselves?

f) Would it be appropriate to construct a linear model for the association between husbands' ages and wives' ages?

g) Assuming that such a model would be appropriate, computer output from a regression analysis reports that the intercept of the regression line as 1.5740 and the slope as 0.9112. Write down an equation for the regression line.

h) If appropriate, predict the age of the wife of a 55-year old man in Great Britain. (If this is not appropriate, explain.)

i) If appropriate, predict the age of the wife of an 85-year old man in Great Britain. (If this is not appropriate, explain.)

j) If appropriate, predict the age of the wife of a 45-year old man in Canada. (If this is not appropriate, explain.)

k) The regression analysis also reports that r2 = 0.88 for this data set. Explain the meaning of this number.

l) What is the correlation of ages in this data set?

6. Brendan Ryan The following graphical display, which shows the April 2011 batting average vs. the May 2011 batting average for American League hitters with more than 50 at-bats during each month, comes from The Signal and the Noise: Why So Many Predictions Fail—But Some Don't by Nate Silver (Penguin, 2012)

 

a) What type of graphical display is this?

b) Describe any association you see in this display.

c) Would it be appropriate to compute the correlation between these two variables? Explain.

d) If you did compute the correlation, would it be positive or negative?

e) If you did compute the correlation, would it be close to 0 or close to 1?

f) Would it be appropriate to construct a linear model for the relationship between these two variables? Explain.

g) Would a linear model for the relationship between these two variables accurately predict a player's batting average in May based on their batting average in April? Explain.

h) Brendan Ryan (a shortstop for the Seattle Mariners), batted .184 in April and .384 in May. What term best applies to Brendan Ryan in this graphical display?

i) If we were to construct a linear model for this relationship, would Brendan Ryan have:

i) high leverage?

ii) a positive residual or a negative residual?

iii) a large residual or a small residual?

j) Are there any other players as unusual as Brendan Ryan?

7. James Bond The following information about the 23 James Bond films produced by Eon Productions over the past 50 years appears in the table below: production order; year of release; total box office gross (in millions of US dollars); budget (in millions of U.S. dollars); adjusted box office gross (in millions of US dollars, adjusted to the 2008 Consumer Price Index); and the duration (in seconds) of the title song for each film. The information comes from Wikipedia and Amazon.com (for the song duration).

no  title                           year  gross budget adj BO song
 1  Dr. No                          1962   59.6    1.2  425.5  107
 2  From Russia With Love           1963   78.9    2.5  555.9  153
 3  Goldfinger                      1964  124.9    3.5  868.7  168
 4  Thunderball                     1965  141.2   11.0  966.4  182
 5  You Only Live Twice             1967  111.6    9.5  720.4  165
 6  On Her Majesty's Secret Service 1969   87.4    7.0  513.4  193
 7  Diamonds Are Forever            1971  116.0    7.2  617.5  161
 8  Live and Let Die                1973  161.8   12.0  785.7  193
 9  The Man With the Golden Gun     1974   97.6   13.0  426.8  154
10  The Spy Who Loved Me            1977  187.3   28.0  666.4  208
11  Moonraker                       1979  210.3   34.0  624.5  188
12  For Your Eyes Only              1981  202.8   28.0  481.0  183
13  Octopussy                       1983  187.5   27.5  405.9  182
14  A View to a Kill                1985  157.8   30.0  316.2  214
15  The Living Daylights            1987  191.2   40.0  362.9  283
16  Licence to Kill                 1989  156.2   32.0  271.6  251
17  GoldenEye                       1995  353.4   60.0  500.0  209
18  Tomorrow Never Dies             1997  346.6  110.0  465.6  292
19  The World Is Not Enough         1999  390.0  135.0  504.7  237
20  Die Another Day                 2002  456.0  142.0  546.5  276
21  Casino Royale                   2006  599.2  150.0  640.8  241
22  Quantum of Solace               2008  586.1  230.0  586.1  263
23  Skyfall                         2012      ?  150.0      ?  286

a) How many cases are included in this data set?

b) How many variables?

c) List each variable and classify it according to type (categorical, ordinal, quantitative, identifier).

d) Create an appropriate graphical display to investigate a possible relationship between year of release and total box office gross. (You'll need to omit Skyfall, which will not premiere until October 23, 2012.)

e) Which variable was the explanatory variable in your display?

f) Which was the response variable?

g) Describe any association you see in this display.

h) Convert the years to "years since Dr. No" (so that Skyfall, for example, would be 50 years after Dr. No) and create a similar graphical display. Does the strength of the association change?

i) If appropriate, compute the correlation between year and total box office gross? (If not appropriate, explain.)

j) If you computed the correlation between the years since Dr. No (instead of year) and total box office in dollars (instead of millions of dollars) would the correlation change?

k) If appropriate, find a linear regression model for the association between year and total box office gross? (If not appropriate, explain.)

8. James Bond will return Refer to the data set from the previous problem.

a) Create an appropriate graphical display to investigate total box office gross and budget. (You'll need to omit Skyfall, which will not premiere until October 23, 2012.)

b) Which variable was the explanatory variable in your display?

c) Which was the response variable?

d) Describe any association you see in this display.

e) If appropriate, compute the correlation between total box office gross and budget. (If not appropriate, explain.)

f) If you computed the correlation, compute r2 and use a complete sentence to explain its meaning.

k) If appropriate, find a linear regression model for the association between total box office gross and budget. (If not appropriate, explain.)

l) If appropriate, predict the total box office gross for Skyfall, which has a reported budget of $150,000,000. (If not appropriate, explain.)

m) If a linear model is appropriate, compute the residual for Goldeneye.

9. Connery, Lazenby, Moore, Dalton, Brosnan, Craig Over the past 50 years, six actors have portrayed Agent 007 in the 23 Eon James Bond films: Sean Connery, George Lazenby, Roger Moore, Timothy Dalton, Pierce Brosnan and Daniel Craig. The longest gap between Bond films was between 1989 (Dalton's second—and final—appearance as Bond in Licence to Kill) and 1995 (Pierce Brosnan's debut as 007 in GoldenEye). Refer to the data set used in the previous two exercises.

a) Create a scatterplot to investigate a possible association between year of release and adjusted box office revenue.

b) Describe any association you observe in the scatterplot.

c) Is a linear model appropriate for this association?

d) Would a linear model be appropriate for the pre-GoldenEye Bond films? If so, find the equation of the regression line.

e) If appropriate, use your model from part d to predict the adjusted box office revenue for a Bond film released in 1976, had one been made that year.

f) If appropriate, use your model from part d to predict the adjusted box office revenue for a Bond film released in 1991, had one been made that year.

g) If appropriate, use your model from part d to predict the adjusted box office revenue for Skyfall.

h) Would a linear model be appropriate for the post-Licence to Kill Bond films? If so, find the equation of the regression line.

i) If appropriate, use your model from part h to predict the adjusted box office revenue for Skyfall.

10. This is the end At 00:07 BST on October 5, 2012, British singer Adele released "Skyfall," the title song to the newest James Bond film, on her Web site. Within 10 hours, became the number one iTunes download, and later debuted at No. 3 on the Billboard Hot 100 chart. Refer to the data set employed in the previous three problems.

a) Create a scatterplot to investigate a possible association between year of release and duration of title song.

b) Which variable was the explanatory variable in your display?

c) Which was the response variable?

d) Describe any association you see in this display.

e) If appropriate, compute the correlation between these variables. (If not appropriate, explain.)

f) Dr. No, the first James Bond film, did not feature a title song (the data set lists the duration for "The James Bond Theme," an instrumental credited to Monty Norman and arranged by John Barry, as performed by The John Barry Orchestra) and On Her Majesty's Secret Service featured a John Barry instrumental track for the opening credits (the data set lists the duration for "We Have All the Time in the World," written by Barry and Hal David, which Louis Armstrong sings over the film's end titles). Do either of these songs appear to be outliers? If so, is there a legitimate reason for omitting them from the data set? 

g) If you identified any songs as outliers in part f, and had a legitimate reason to omit them, do so and construct another scatterplot.

h) If appropriate, compute the correlation now. (If not appropriate, explain.)

k) If appropriate, find a linear regression model for the association between year and song duration. (If not appropriate, explain.)

l) If appropriate, predict the duration for a James Bond title song released in 1993, had a 007 film been released that year. (If not appropriate, explain.)

m) If a linear model is appropriate, compute the residual for "The Living Daylights," the title song to the film of the same name, written by Paul Waaktaar-Savoy and John Barry, and performed by A-ha.

11. Incorrect statements Sometimes statistics students write down inaccurate or incorrect interpretations of regression analyses and correlation computations. For each statement below, explain what is inaccurate or incorrect and then rewrite the statement to fix the problem.

a) A correlation of r = 0.96 between two variables tells us that a linear model is appropriate for those variables.

b) An r2 value of 0.81 means that 81% of the response variable is explained by the explanatory variable.

c) A regression equation of `hat(y) = 56.8 + 3.59x` means that each increase of 1 unit in the explanatory variable causes the response variable to increase by 3.59 units.

d) A correlation of r = -0.73 is weaker than a correlation of r = 0.56.

e) A correlation of r = 0.12 between two variables tells us that a linear model is not appropriate for those variables.

12. Woodway real estate The Northwest Multiple Listing Service (NWMLS) operates a database of homes and property for sale throughout Washington State and provides this information to realtors and real estate Web sites. The table below includes the street address, year built, number of bedrooms (BR), number of bathrooms (BA), size (in square feet) and asking price (in thousands of dollars) for the ten houses located in the city of Woodway listed for sale on NWMLS on October 16, 2012.

address                year  BR  BA  size  price
24323 Timber Lane      1923   4   3  3000    729
22109 Woodway Park Rd  1950   4   2  2430    769
23407 Woodway Park Rd  1921   3   2  4625    995
11402 239th Pl SW      1962   5   3  2868    800
23503 Timber Lane      1973   4   3  4231   1100
23920 115th Pl W       2000   4   5  4577   1025
11312 S Dogwood Lane   1940   5   6  6527   1650
22714 106th Ave W      2003   6   7  7746   1750
22505 Woodway Park Rd  1960   4   4  4339   1840
24120 114th Ave W      1964   5   3  3468    800

a) Create an appropriate graphical display to investigate a possible association between size of these houses and their asking prices.

b) Describe the association you see in the display, noting any unusual features.

c) Is it appropriate to compute the correlation for size and price? Explain.

d) Is the house at 22505 Woodway Park Road an outlier?

e) If you answered yes to the previous question, does that house have high leverage?

f) If you answered yes to part d, is that house an influential point?

g) Create an appropriate graphical display to investigate a possible association between the year each house was built and its size.

h) Which variable is most likely to be the explanatory variable and which the response variable in this relationship? Explain.

h) Describe any association you see in the display, noting any unusual features.

i) Is it appropriate to compute the correlation for year and size? Explain.

13. Return to Woodway In the data set from the previous problem, the property at 22505 Woodway Park Road also includes a guest house, whose size may not be included in the 4,339 square feet listed for this property.

a) Does the information provided above constitute a legitimate reason for omitting this house from further analysis of this data set? Explain.

b) If we omit this house from the data set, would it be appropriate to compute the correlation between size and price?

c) Compute the correlation between size and price with the property at 22505 Woodway Park Road omitted.

d) Compute r2 and use a complete sentence to explain the meaning of this number in the context of the remaining nine houses.

e) If we omit the house at 22505 Woodway Park Road, would it be appropriate to construct a linear model for the association between size and price? Explain.

f) Find the regression equation relating size and price. Use proper notation.

g) If appropriate, predict the asking price for a 5,300 sq. ft. home in Woodway.

h) If appropriate, predict the asking price for a 1,300 sq. ft. home in Woodway.

i) Compute the residual for the house at 24323 Timber Lane.

14. [EESEE] BAC A study conducted during February 1986 in a student dormitory at Ohio State University had 16 student volunteers blow into a breathalyzer to verify their blood alcohol content (BAC) was 0. They each then drew a number (from 1 though 9) from a bowl and drank that number of beers. Thirty minutes after drinking their last beer, an OSU police officer measured their BAC (in g/dl). The data for these 16 students appears below.

student  1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16
beers    5     2     9     8     3     7     3     5     3     5     4     6     5     7     1     4
BAC   0.10  0.03  0.19  0.12  0.04 0.095  0.07  0.06  0.02  0.05  0.07  0.10 0.085  0.09  0.01  0.05  

a) How many cases are included in this data set?

b) How many variables are included in this data set?

c) Which variable is the explanatory variable in this study?

d) Which variable is the response variable in this study?

e) Should you compute the correlation between student and beers? Explain.

f) Construct an appropriate graphical display to investigate the association being studied.

g) Describe the association apparent in your display.

h) Should you compute the correlation between the two variables used in your display? Explain.

i) If appropriate, compute the correlation.

j) If you computed the correlation, compute r2 and explain its meaning in the context of this study.

k) If a student drank a number of beers 1.5 SDs above average, what would you predict about his or her BAC?

15. Another round Refer to the data set from the previous problem.

a) Create a graphical display to investigate the association between number of beers consumed and BAC.

b) Is it appropriate to use a linear regression equation to model the association apparent in the scatterplot? Explain. (If not appropriate, stop here.)

c) Find the regression equation for this data.

d) If appropriate, use your equation to predict the BAC of a student who consumed 14.5 beers. (If not appropriate, explain.)

e) If appropriate, use your equation to predict the BAC of a student who consumed 4.5 beers. (If not appropriate, explain.)

f) Compute the residual for the student who consumed 6 beers.

g) Explain the meaning of the slope in the context of this study, or explain why the slope is meaningless.

h) Explain the meaning of the intercept in the context of this study, or explain why the intercept is meaningless.

i) Construct a residuals plot for this data.

j) Does the residuals plot confirm the appropriateness of a linear model for this data? Explain.

16. Florida 2000 The U.S. presidential election in the year 2000 resulted in a disputed outcome in the state of Florida, which led to a case before the U.S. Supreme Court (Bush v. Gore) that ultimately decided the winner of the presidency. This tab-delimited text file (which you should be able to open in any text editor or spreadsheet program such as Excel or Google Docs) contains vote totals for George W. Bush (the Republican candidate), Al Gore (the Democratic candidate), Ralph Nader (the Green Party Candidate) and Pat Buchanan (the Reform Party candidate).

a) How many cases are in this data set?

b) How many variables are in this data set?

c) Create an appropriate graphical display to investigate a possible association between the number of votes received by Bush and the number received by Gore.

d) Describe any association you see in the display.

e) Does this association mean that increased turnout by Bush voters caused a corresponding turnout by Gore voters?

f) Should you compute the correlation between Bush votes and Gore votes? Explain.

g) Create an appropriate graphical display to investigate a possible association between the number of votes received by Nader and the number received by Buchanan.

h) Describe any association you see in the display, taking care to mention any unusual features.

i) Should you compute the correlation between Nader votes and Buchanan votes? Explain.

17. Facebook A statistics student who reported she used Facebook "almost every day" wanted to know if use of a social network such as Facebook had an effect on college students' GPAs. On June 8, 2012, she asked two people seated at each table in the student lounge to provided answers to three questions: how many hours they spent on Facebook each week, their college GPA and whether or not they felt use of social media had an effect on their grades. The data she collected appears below.

hours  GPA  effect
    7  3.5     yes
   14  3.8     yes
   28  3.0     yes
   16  3.0     yes
   21  2.8     yes
   40  3.6     yes
    9  3.8     yes
   15  3.9      no
   40  2.0      no
   21  3.4     yes
   20  3.7     yes
   20  3.2     yes
   20  2.9     yes
   11  3.8      no
   15  3.8     yes
    8  3.5     yes
    3  3.1     yes
   13  3.9      no
    2  3.8     yes
    5  3.94    yes
    1  3.8      no
   20  3.0     yes
   14  3.1     yes
    7  3.3     yes
   14  3.2     yes
   18  3.0     yes
   20  3.0     yes
   25  3.6     yes
   15  3.2     yes
   20  3.1     yes
    1  3.2      no
    2  1.9      no
    5  2.5      no
   28  3.0     yes
   24  2.14     no
    1  2.5      no
   41  3.3     yes
   16  2.5     yes
    8  2.8     yes
   21  3.8      no

a) How many cases are included in this data set?

b) How many variables are included in this data set? Classify each as categorical, quantitative, ordinal or identifier.

c) Which of these variables might be considered explanatory?

d) Which might be considered response variables?

e) Create an appropriate graphical display to investigate a possible association between use of social media and GPA.

f) Describe any association evident in your display. Be sure to mention any unusual features.

g) Would it be appropriate to compute the correlation between the two variables in your display? Explain.

h) Construct graphical displays using only those who "answered" yes to the third question and only those who answered "no." Are any new associations evident? Would it be appropriate to compute the correlation for the variables in each of these new displays?

18. Track and field A statistics student who participated in the triple jump collected data about the the top 15 female athletes (including herself), including their nationally qualified marks (measured in meters) compared to the finishing marks they jumped at the national meet. Some of this data, from the 2011-2012 indoor season, appears below.

surname     pre   meet
Ouedraogo 12.92  12.84
Zweifel   12.51  12.22
Yingling  12.41  12.03
Danville  12.29  12.10
Hewett    12.24  12.13
Segbor    12.14  12.13
Potter    12.06  12.40
Wyatt     12.05  11.75
Boyd      12.04  11.78
McDaniel  12.03  12.24
Bowens    12.01  11.91
Yates     12.00  11.87
Bemis     11.99  12.20
Bourne    11.95  11.62
Schmidt   11.95  11.91

a) Construct an appropriate graphical display to investigate an association between the pre-meet qualifying distances and the distances jumped at the national meet.

b) Describe any association evident in your display, and mention any unusual features.

c) Would it be appropriate to compute the correlation between the variables in your display? Explain.

d) Is there a significant outlier in the display? If so, do you have a legitimate reason to omit this outlier from your analysis?

e) If you did omit the outlier, would you expect the correlation to be weaker or stronger after omitting the outlier.

f) Compute the correlation with and without the outlier and compare your results to what you expected to find in part e.

g) Compute the differences between qualifying distance and meet distance for each athlete.

h) Construct a graphical display of those differences.

i) Is there a significant outlier in this display?

19. Subaru Outbacks The Web site cars.com contains listings for used cars from throughout the United States. The table below contains information for the 16 Subaru Outback automobiles listed for sale under $15,000 within 10 miles of Lynnwood, Washington, as of October 29, 2012. This information contains the Vehicle Identication Number (VIN), color, model year, mileage (in miles) and price (in dollars).

VIN                color         year mileage  price
4S4BP67C264323292  red           2006   96756  14995
4S4BP61C976312854  silver        2007   91920  13991
4S4BP61C657387115  white         2005  129118  11991
4S4BP61CX67355012  silver        2006   92355  10995
4S3BH686747631189  white         2004  106000  10988
4S4BP62C257301801  silver        2005  150596   9995
4S3BH675537622627  silver        2003  107779   9991
4S3BH686227664422  black cherry  2002  102603   9400
4S3BH806627663958  white         2002  140937   8995
4S3BH6865Y6672979  white         2000  144088   8787
4S3BE686147200243  green         2004  173993   8398
4S3BH6651Y6659124  green         2000  126872   7999
4S3BH665826621154  red           2002  142082   7998
4S3BH806827621324  white         2002  107627   7991
4S3BH675327658539  burgundy      2002  115234   7990
4S3BH686417636362  black         2001  142432   6900

a) Create a scatterplot to investigate a possible association between mileage and price.

b) Which variable did you select as the explanatory variable when creating your graphical display?

c) Which variable did you select as the response variable?

d) Is it appropriate to compute the correlation between mileage and price? Explain.

e) Compute the correlation.

f) Compute R2 and explain what it means in the context of these automobiles.

g) If you switched the explanatory and response variables, would the correlation change? Explain.

20. Outbacks again Refer to the data set from the previous problem.

a) Create a scatterplot to investigate an association between the mileage and price of these 16 automobiles.

b) Is it appropriate to construct a linear regression model for the association between mileage and price? Explain.

c) Find the linear regression equation that you could use to predict price based upon mileage for Subaru Outbacks.

d) What is the slope of the regression equation?

e) What are the units for the slope?

f) Use a complete sentence or two to explain the meaning of the slope in the context of Subaru Outbacks.

g) What is the intercept of the regression equation?

h) What are the units for the intercept?

i) Use a complete sentence or two to explain the meaning of the intercept in the context of Subaru Outbacks, or explain why such an interpretation of the intercept is meaningless.

j) If appropriate, use the model you constructed to predict the price of a Subaru Outback with 112,000 miles.

k) If appropriate, use the model you constructed to predict the price of a Subaru Outback with 12,000 miles.

21. Outback outliers Refer to the data set used in the previous two exercises. The Web site actually listed one more Subaru Outback for sale in Lynnwood:

VIN                color        year mileage  price
4S3BH6656Y7635735  red          2000   42712   9995

a) Construct a graphical display of mileage vs. price that includes this Outback with the other 16 cars.

b) If you included this Outback in the graphical display with the other 16, would it still be appropriate to compute the correlation? Explain.

c) If you included this Outback in the graphical display with the other 16, would it still be appropriate to construct a linear regression model to predict price based upon mileage? Explain.

d) Would you consider this car to be an outlier? Explain.

e) Would you consider this car to be a high-leverage point? Explain.

f) Would you consider this car to be influential? Explain.

g) This car was the only one among the 17 Outbacks listed for sale in Lynnwood that spent most of its time in Hawaii before being sold and shipped to the mainland for resale. Would this information about the car be sufficient to omit it from the full data set before constructing a linear regression model? Explain.

22. Outback residuals Refer to the data set from #19.

a) Construct a linear regression model to predict price based upon mileage.

b) Compute the residual for the black cherry Outback.

c) If you were selling an Outback, would you prefer that your car have a positive residual or a negative residual? Explain.

d) If you were buying an Outback, would you prefer that your car have a positive residual or a negative residual? Explain.

e) Construct a residuals plot for this data set.

f) What does the residuals plot tell you about the appropriateness of a linear model?

23. Ford Focus The Ford Focus is a compact car introduced to North America in 1999 for model year 2000. The table below shows the model year, mileage and asking price for all 14 used Ford Focus automobiles advertised for sale on the Web site of The Seattle Times on January 31, 2010.

year  mileage  price
2007    25426  14595
2008    49223  13991
2008    49028  13991
2008    27690  11994
2008    36216  11980
2002    71646  10991
2007    41107   9671
2002    83454   8991
2007    49443   7988
2007    34179   7499
2002    63439   7475
2005    43012   5400
2001    86681   4494
2002   113000   2000

a) Create a scatterplot to investigate a possible association between mileage and price.

b) Which variable did you select as the explanatory variable when creating your graphical display?

c) Which variable did you select as the response variable?

d) Is it appropriate to compute the correlation between mileage and price? Explain.

e) Compute the correlation.

f) Compute R2 and explain what it means in the context of these automobiles.

g) If you switched the explanatory and response variables, would the correlation change? Explain.

h) Compute the correlation between age and price.

i) Which association is stronger: mileage and price, or age and price? Explain.

j) What should you have done before computing the correlation between age and price in part h?

24. Focuses (Foci?) again Refer to the data set from the previous problem.

a) Create a scatterplot to investigate an association between the mileage and price of these automobiles.

b) Is it appropriate to construct a linear regression model for the association between mileage and price? Explain.

c) Find the linear regression equation that you could use to predict price based upon mileage.

d) What is the slope of the regression equation?

e) What are the units for the slope?

f) Use a complete sentence or two to explain the meaning of the slope in the context of these automobiles.

g) What is the intercept of the regression equation?

h) What are the units for the intercept?

i) Use a complete sentence or two to explain the meaning of the intercept in the context of these automobiles, or explain why such an interpretation of the intercept is meaningless.

j) If appropriate, use the model you constructed to predict the price of a Ford Focus with 135,000 miles.

k) If appropriate, use the model you constructed to predict the price of a Ford Focus with 35,000 miles.

25. Focus residuals Refer to the data set from the previous two problems.

a) Construct a linear regression model to predict price based upon mileage.

b) Compute the residual for the 2005 Ford Focus.

c) If you were buying a Ford Focus, would you prefer that your car have a positive residual or a negative residual? Explain.

d) If you were selling a Ford Focus, would you prefer that your car have a positive residual or a negative residual? Explain.

e) Construct a residuals plot for this data set.

f) What does the residuals plot tell you about the appropriateness of a linear model?

26. Honda Odyssey The following data includes the year, make, model, mileage (in thousands of miles) and asking price (in US dollars) for each of 13 used Honda Odyssey minivansadvertised for sale on the Web site of the Seattle P-I on April 25, 2005.

year  make   model        mileage  price
2004  Honda  Odyssey EXL       20  26900
2004  Honda  Odyssey EX        21  23000
2002  Honda  Odyssey           33  17500
2002  Honda  Odyssey           41  18999
2001  Honda  Odyssey EX        43  17200
2001  Honda  Odyssey EX        67  18995
2000  Honda  Odyssey LX        46  13900
2000  Honda  Odyssey EX        72  15250
2000  Honda  Odyssey EX        82  13200
2000  Honda  Odyssey           99  11000
1999  Honda  Odyssey           71  13900
1998  Honda  Odyssey           85   8350
1995  Honda  Odyssey EX       100   5800

a) Create a scatterplot to investigate a possible association between mileage and price.

b) Which variable did you select as the explanatory variable when creating your graphical display?

c) Which variable did you select as the response variable?

d) Is it appropriate to compute the correlation between mileage and price? Explain.

e) Compute the correlation.

f) Compute R2 and explain what it means in the context of these minivans.

g) If you switched the explanatory and response variables, would the correlation change? Explain.

h) Compute the correlation between age and price.

i) Which association is stronger: mileage and price, or age and price? Explain.

j) What should you have done before computing the correlation between age and price in part h?

27. Another Odyssey Refer to the data set from the previous problem.

a) Create a scatterplot to investigate an association between the mileage and price of these minivans.

b) Is it appropriate to construct a linear regression model for the association between mileage and price? Explain.

c) Find the linear regression equation that you could use to predict price based upon mileage.

d) What is the slope of the regression equation?

e) What are the units for the slope?

f) Use a complete sentence or two to explain the meaning of the slope in the context of these minivans.

g) What is the intercept of the regression equation?

h) What are the units for the intercept?

i) Use a complete sentence or two to explain the meaning of the intercept in the context of these minivans, or explain why such an interpretation of the intercept is meaningless.

j) If appropriate, use the model you constructed to predict the price of a Honda Odyssey with 53,000 miles.

k) If appropriate, use the model you constructed to predict the price of a Honda Odyssey with 153,000 miles.

28. Odyssey residuals Refer to the data set from the previous two problems.

a) Construct a linear regression model to predict price based upon mileage.

b) Compute the residual for the 1995 Honda Odyssey.

c) If you were buying a Honda Odyssey, would you prefer that your car have a positive residual or a negative residual? Explain.

d) If you were selling a Honda Odyssey, would you prefer that your car have a positive residual or a negative residual? Explain.

e) Construct a residuals plot for this data set.

f) What does the residuals plot tell you about the appropriateness of a linear model?

29. Odyssey ages Refer to the data set used in the previous three problems.

a) Create a scatterplot to investigate an association between the age and price of these minivans.

b) Is it appropriate to construct a linear regression model for the association between age and price? Explain.

c) Find the linear regression equation that you could use to predict price based upon age.

d) What is the slope of the regression equation?

e) What are the units for the slope?

f) Use a complete sentence or two to explain the meaning of the slope in the context of these minivans.

g) What is the intercept of the regression equation?

h) What are the units for the intercept?

i) Use a complete sentence or two to explain the meaning of the intercept in the context of these minivans, or explain why such an interpretation of the intercept is meaningless.

j) If appropriate, use the model you constructed to predict the price of a 2003 Honda Odyssey.

k) If appropriate, use the model you constructed to predict the price of a 2008 Honda Odyssey.