
Beware of Outliers
This material is adapted from Section 7.3 of Open Intro Statistics, second edition.
Outliers in scatterplots are observations that fall a significant distance away from the rest of the data (the "cloud" of points we often see in scatterplots for which a linear model might be appropriate). These points are especially important because they can strongly influence the slope (and intercept) of the regression line.
The figure below displays six scatterplots, each of which includes the regression line for the data (whether or not a linear model is appropriate), along with a residual plot for each.
In (1), notice the one outlier sitting a significant distance below the other points; it appears to slightly influence the regression line only slightly.
In (2) one outlier sits at the far right the right; the regrssion line would look much the same with this point omitted, which suggests it is not very influential.
In (3), one point again sits at the far right, but this outlier appears to pull the regression line up on the right side, increasing its slope; notice how the line doesn't appear to fit the primary "cloud" very well. (Notice the negative association among the main "cloud" of points in the residuals plot.)
In (4), a small cluster of four outliers sits at the far right. This cluster appears to be influencing the slope of the regression line significantly, making the line a poor fit for the data almost everywhere. (Notice how the residuals are all positive at the left and all negative toward the center, then all positive for the small cluster.) An investigation into the data values in this cluster might reveal an interesting explanation for the dual "clouds."
In (5), no obvious pattern (linear or otherwise) is apparent in the main "cloud" of points; the lone outlier at the far right appears to largely control the slope of the regression line. This point is highly influential.
In (6), a single outlier appears at the far left, but falls quite close to the regression line and does not appear to be very influential.
Leverage
Points that sit far away from the center of the "cloud" in the horizontal direction can potentially exert more influence on the slope of the regression line, so we call them points with high leverage.
If one of these high-leverage points does appear to actually influence on the slope of the regression line [as in cases (3), (4), and (5) above], then we call it an influential point. Usually we can say a point is influential if, had we omitted the point and found the new regression line, the outlier would sit unusually far from the new line.
Omitting outliers
It is tempting to remove outliers. Don't do this without a very good reason. Models that ignore exceptional (and interesting) cases often perform poorly. For instance, if a financial firm ignored the largest market swings (the "outliers") they would soon go bankrupt by making poorly chosen investments.
Caution: Don't ignore outliers when fitting a model If outliers are present in the data, they should not be removed or ignored without a good reason. Any model fit to the data would not be very helpful if it ignores the most exceptional cases.
Exercises
1. [OIS 7.24] Outliers Identify the outliers in the scatterplots below, and determine their type (e.g. influential, high leverage). Explain your reasoning.
2. [OIS 7.25] More outliers Identify the outliers in the scatterplots below, and determine their type (e.g. influential, high leverage). Explain your reasoning.
3. [OIS 7.26] Crawling babies A study conducted at the University of Denver investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. Infants born during the study year were split into twelve groups, one for each birth month. We consider the average crawling age of babies in each group against the average temperature when the babies are six months old (that's when babies often begin trying to crawl). Temperature is measured in degrees Fahrenheit (°F) and age is measured in weeks. A scatterplot of these two variables reveals a potential outlying month when the average temperature is about 53°F and average crawling age is about 28.5 weeks.
a) Does this point have high leverage?
b) Is it an influential point?