Two Quantitative Variables
So far we've examined relationships between two categorical variables and between a quantitative variable and a categorical variable, which leaves us with the situation involving two quantitative variables.
Property taxes
Recall the data about the single-family residences on a street in Edmonds, Washington, that we examined previously. The data set includes 14 cases (each house is a case), and the variables are: house number, the size (in square feet), the 2007 assessed value (in thousands of dollars), the lot size (in acres), the 2006 taxes (in dollars) and the number of stories. Here is the complete data set for reference:
house | size | assess | lot | taxes | stories |
20911 | 1561 | 304 | 0.2 | 2604 | 1 |
20912 | 1038 | 297.6 | 0.2 | 280 | 1 |
20918 | 1224 | 289.5 | 0.17 | 2353 | 1 |
20921 | 1232 | 292.8 | 0.17 | 756 | 1 |
20924 | 1995 | 314.6 | 0.17 | 2620 | 2 |
20927 | 1714 | 322.7 | 0.18 | 2632 | 1 |
20930 | 1832 | 336.1 | 0.18 | 2779 | 2 |
21003 | 1095 | 279 | 0.18 | 2321 | 1 |
21006 | 2011 | 319.5 | 0.18 | 2663 | 2 |
21015 | 1366 | 289.3 | 0.18 | 2415 | 1 |
21018 | 1292 | 301.4 | 0.18 | 2477 | 1 |
21023 | 1458 | 314.3 | 0.18 | 1386 | 1 |
21028 | 2031 | 320.9 | 0.18 | 2676 | 2 |
21105 | 1366 | 304 | 0.18 | 2473 | 1 |
The house number is an identifier, but the next two variables are quantitative: size and assessed value. Let's graph this data to investigate a possible relationship between them. Because we have pairs of numbers (for example: size = 1561 and assess = 304 for the first house) we can plot these points on a Cartesian coordinate plane (which should be familiar from algebra). But which number should we plot first? In other words, which variable belongs on the x-axis and which variable belongs on the y-axis? We don't know if there is a cause-and-effect relationship (we don't even know yet whether there is a relationship) between these two variables, although is would be reasonable to think that the assessed value of a house is related to its size. And it's possible that a change in the size of the house (for example building an addition, or demolishing an attached garage) would somehow cause the assessed value of the house to change, but it's not at all likely that a fluctuation in the assessed value of a home would cause a change in it's size. Furthermore, the size of the house is known before the assessed value. For these reasons we plot the size first (along the x-axis) and then the assessed value (along the y-axis).
Plotting the point (1561,304) for the first house, the point (1038,297.6) for the second house, and so on, we get the following graph:
We call this graphical display a scatterplot.
We call the variable along the horizontal (x) axis the explanatory variable and the variable along the vertical (y) axis the response variable. A change in the size of a house appears to explain (in part)—but not necessarily cause—a corresponding response in the assessed value. We've see this at work before, but did not use these terms. For example in our mosaic plot of two categorical variables (gender and beverage preference):
we placed gender along with the horizontal axis as the explanatory variable and beverage along the vertical axis as the response variable. A person's gender might explain their preference for Coke or Pepsi (and there's a remote possibility that their gender course cause—through some sort of chromosomal influence—their preference for one of these beverages), but it would be ridiculous to assert that drinking Coke or Pepsi somehow causes you to be male or female.
Similarly, for the checker data:
we positioned angle as the explanatory variable and distance as the response variable: here it's reasonable that changing the angle of inclination when launching the checker somehow affects the distance the checker travels (plus, we set the angle first and then launch the checker to record the distance).
Returning to the scatterplot of the house data, what pattern (if any) do you see in the display? If there were no relationship at all between the size and assessed value of the houses, we would expect to see no pattern whatsoever—for example, something like this:
But we do see pattern in the house data, so there does appear to be a relationship (or association) between size and assessed value for these houses. How can we describe that association?
First it appears that the bigger the size of the house, the bigger its assessed value, so we say call this a positive association. (A relationship where bigger values of the explanatory variable are associated with smaller values of the response variable would be a negative association.) In the examples below, all of the scatterplots in the top row have a positive association while all of those in the bottom row have a negative association.
You might also notice that in both rows the scatterplot at the far left exhibits a very weak association, while the next one appears to have a moderate association, the third one a fairly strong association, and the last one a perfect association. Referring back to the house scatterplot, we might describe the association between size and assessed value as being "moderately strong" (certainly not "weak" but neither would it be "very strong").
In the eight examples above, you may notice a common feature: all of these associations appear to be linear. We use "linear" here not to mean that all of the points fall along a perfectly straight line (as in the case of the rightmost scatterplots in each row) but rather to indicate an absence of bending or curving in the association, as we might see here:
Finally, we look for any unusual features, such as an outlier that strays far from the general pattern visible in the rest of the data. In the house scatterplot there is certainly some scatter present but no obvious outliers, as would be the case if someone built a small but expensive house in the neighborhood:
Exercises
1. The Ford Focus is a compact car introduced to North America in 1999 for model year 2000. The table below shows the model year, mileage (in miles) and asking price (in dollars) for all 14 used Ford Focus automobiles advertised for sale on the Web site of the Seattle Times on January 31, 2010.
year | mileage | price |
2007 | 25426 | 14595 |
2008 | 49223 | 13991 |
2008 | 49028 | 13991 |
2008 | 27690 | 11994 |
2008 | 36216 | 11980 |
2002 | 71646 | 10991 |
2007 | 41107 | 9671 |
2002 | 83454 | 8991 |
2007 | 49443 | 7988 |
2007 | 34179 | 7499 |
2002 | 63439 | 7475 |
2005 | 43012 | 5400 |
2001 | 86681 | 4494 |
2002 | 113000 | 2000 |
a) What type of association would you expect to see between mileage and price?
b) Which would be the explanatory variable?
c) Which would be the response variable?
d) Create an appropriate graphical display to investigate this association.
e) Describe the association visible in your graph.
f) Does the graph confirm what you expected to see in part a?
g) What type of association would you expect to see between year and mileage?
h) Create another graphical display to investigate that association.
i) Does the graph confirm what you expected to see in part g?
2. [OIS 7.3] Describing relationships For each of the six scatterplots below, describe the strength of the relationship (e.g. weak, moderate, or strong), the form of the relationship (linear, non-linear) and the direction of the relationship (positive, negative, something else), and make note of any unusual features.
3. [OIS 7.4] More relationships For each of the six scatterplots below, describe the strength of the relationship (e.g. weak, moderate, or strong), the form of the relationship (linear, non-linear) and the direction of the relationship (positive, negative, something else), and make note of any unusual features.