Influential Points in Regression

A random sample of size from a bivariate normal distribution with mean , unit variances, and correlation coefficient is generated. The sample correlation is shown as well as the Cook's distance corresponding to the locator point. Several methods of fitting the regression line are available.
The LS (least-squares) method uses Mathematica's built-in function LinearModelFit. See the Details section for more information about L1 (least absolute deviation) and RLINE (resistant line). Cook's distances provide an indication of points that have a large influence on the slope of the LS regression. As a rough rule, points that exceed , where is the sample size, may be influential. The recommended practice is to look at a plot of all Cook's distances. The Cook's distances are determined using LinearModelFit to fit the LS regression. Two plots are available for the Cook's distances. See Details for more information.
The slider zoom can be used to zoom out and move the locator some distance away to explore its influence on the regression, correlation, and Cook's distance. The effect of sample size and correlation may also be explored. By varying the random seed, you can explore the stochastic variation for a fixed initial data configuration.
  • Contributed by: Ian McLeod
  • (University of Western Ontario)



  • [Snapshot]
  • [Snapshot]
  • [Snapshot]
  • [Snapshot]
  • [Snapshot]


For the definition of Cook's distance, see [1]. For discussion of its use in detecting influential points in regression, see [2, 3].
Pages 67–68 of [2] suggest that observations with Cook's distances with values exceeding may be influential but that it is better to look at a plot of the Cook's distances versus with a benchmark line at .
Page 70 of [3] suggests looking at the half-normal plot of the Cook's distances to see those that are relatively large compared with the rest.
L1 Regression: minimizes the absolute sum of errors. This is computed using linear programming; see eqn. (3) in [4]. L1 regression is more robust than LS when moderate outliers are present, but it is still sensitive to extreme outliers.
RLINE: resistant regression line, discussed in §5 of [5], is based on medians.
[1] Cook's distance, Wikipedia.
[2] S. J. Sheather, A Modern Approach to Regression with R, New York: Springer, 2009.
[3] J. J. Faraway, Linear Models with R, Boca Raton: Chapman & Hall/CRC, 2005.
[4] S. C. Narula and J. F. Wellington, "The Minimum Sum of Absolute Errors Regression: A State of the Art Survey," International Statistical Review, 50(2), 1982 pp. 317–326.
[5] P. F. Velleman and D. C. Hoaglin, Applications, Basics and Computing of Exploratory Data Analysis, Boston: Duxbury Press, 1981.
    • Share:

Embed Interactive Demonstration New!

Just copy and paste this snippet of JavaScript code into your website or blog to put the live Demonstration on your site. More details »

Files require Wolfram CDF Player or Mathematica.