On the Interpretation of Regression Plots
Cook, R. Dennis, Journal of the American Statistical Association
A framework is developed for the interpretation of regression plots, including plots of the response against selected covariates, residual plots, added-variable plots, and detrended added-variable plots. It is shown that many of the common interpretations associated with these plots can be misleading. The framework also allows for the generalization of standard plots and the development of new plotting paradigms. A paradigm for graphical exploration of regression problems is sketched.
KEY WORDS: Added-variable plots; Elliptically contoured distributions; Nonlinear models; Regression graphics; Residual plots.
The ability to interpret correctly the various plots that arise in a regression analysis is always useful, particularly in situations where a parsimoniously parameterized true model is unavailable. This ability has become increasingly important in recent years with the rapid development of software that provides easy access to modern graphical methods such as spinning, brushing, and animation.
A full interpretation of any regression plot requires two distinct but related tasks. Out of necessity the first task is to characterize the plot itself. Characterizing a two-dimensional scatterplot is relatively easy. Accurately characterizing a three-dimensional scatterplot is possible on a routine basis, albeit with more effort. It is also possible to obtain useful insights into higher-dimensional plots, but for the most part their interpretation must rely on lower-dimensional constructions. Having characterized the plot at hand, the second task is to use this information to form conclusions that can guide the remaining analysis. A key point here is that the first task is data analytic in nature, whereas the second is inferential. Together they comprise the interpretation of the plot.
Detecting a fan-shaped pattern in the usual plot of residuals versus fitted values from a first-order ordinary least squares (OLS) regression, for example, leads to the data analytic observation that the residual variance changes monotonically in the direction of the fitted values. The second step comes when this conclusion is used to infer heteroscedastic errors and to justify a weighted regression. As a second example, Cook and Weisberg (1989) suggested that a saddle-shaped pattern in certain detrended added-variable plots can be used to infer the need for interaction terms. Although the regression literature is replete with this sort of advice, bridging the connection between the data analytic characterization of a plot and the subsequent inference often requires a leap of faith regarding the behavior of the true regression.
In this article I study the interpretation of regression plots, concentrating on aspects of both what to look for and how to infer from what is found. The requirements of the first task can often be met easily in practice, although this is certainly not the case at the level of generality of the main results that follow. The interpretation of any plot depends on the range of possibilities that is allowed for the true regression. Inferring heteroscedastic errors from a fan-shaped pattern in a plot of residuals versus fitted values, for example, is appropriate only under certain restrictions (Sec. 7). In Section 3 I describe an essentially nonrestrictive regression model that will be used to guide plot interpretation.
It turns out that the behavior of the covariates is critical to the interpretation of regression plots in a largely nonparametric setting. This aspect of the problem is studied in Sections 4 and 5. Plots of the response versus selected covariates, added-variable plots, detrended added-variable plots, and some extensions are studied in Section 6. A discussion of residual plots is presented in Section 7. A paradigm for graphical exploration of a regression problem is sketched in Section 8, and concluding comments are given in Section 9. …