On Asymmetric Properties of the Correlation Coefficient in the Regression Setting
Dodge, Yadolah, Rousson, Valentin, The American Statistician
Although correlation is a symmetric concept of two variables, this is not the case for regression where we distinguish a response from an explanatory variable. This article presents several ways of expressing the correlation coefficient as an asymmetric formula of the two variables involved in the regression setting. Contrary to some well-known results, those are not necessarily preserved in the sample when the model is wrong. As a consequence, they may be used for model checking or model selection. In particular, we propose a criterion for choosing the response variable in a simple linear regression problem. An example from the domain of finance illustrates our purpose. We find evidence under our model that the U.S. dollar influenced other currencies during the period of consideration.
KEY WORDS: Causality; Cumulants; Higher order correlations; Linear regression; Model selection; Normality; Skewness.
Concepts like regression and correlation are as old as statistics. While the Galton-Pearson correlation coefficient is perhaps the most widely used formula in data analysis, regression is certainly the most important concept of all statistics, which led to many generalizations and new techniques. Though intimately related to each other, these two concepts are different in nature. They both involve (at least) two variables, but while these two variables share the same role in correlation, we distinguish the explanatory variable from the response variable in regression. Thus, a correlation coefficient is defined by a symmetric formula in two variables such that [[rho].sub.XY] = [[rho].sub.YX]. These formulas are, in a sense, not well adapted to the regression setting where the two variables have a different interpretation. In Section 2 we present simple ways of expressing the Galton-Pearson correlation coefficient as asymmetric formulas of the response and the explanatory variable. In Section 3 we propose a cr iterion that uses these asymmetric properties of the correlation coefficient for deciding which variable should be the response in a simple linear regression problem. In Section 4 we illustrate our procedure on a dataset from the domain of finance.
2. CORRELATION AND REGRESSION
The correlation coefficient between two random variables X and Y is defined as
[[rho].sub.XY] = cov(X, Y)/[[sigma].sub.X][[sigma].sub.Y], (1)
where cov(X, Y) is the covariance between X and Y, and [[[sigma].sup.2].sub.X] and [[[sigma].sup.2].sub.Y] are the variances of X and Y, respectively. If we assume a linear regression model
Y = [alpha] + [beta]X + [epsilon] (2)
where [alpha] and [beta] are some constants and [epsilon] is an error term independent from X, we have
[[rho].sub.XY] = [beta][[sigma].sub.X]/[[sigma].sub.Y] (3)
Many other formulas and interpretations for the correlation coefficient may be found, for example, in Rodgers and Nicewander (1988) or in Rovine and Von Eye (1997).
Contrary to the definiton (1), result (3) expresses the correlation coefficient as an asymmetric formula in X and Y under (2). In this section, we point out other simple asymmetric expressions for [[rho].sub.XY] that are valid in the regression setting. The first one uses the concept of cumulant (for a formal definition see, e.g., Kendall and Stuart 1963, p. 71). Using fundamental properties of cumulants, we have for each r [greater than or equal to] 3
[[kappa].sub.r](Y) = [[beta].sup.r] [[kappa].sub.r](X) + [[kappa].sub.r]([epsilon]), (4)
where [[kappa].sub.r](X), [[kappa].sub.r](Y), and [[kappa].sub.r]([epsilon]) denote the rth cumulants of X, Y, and [epsilon], respectively. Therefore, if [[kappa].sub.3]([epsilon]) = 0 (which is achieved if the error term is symmetric), we get by (3) and (4) (as long as [[kappa].sub.3](X) [not equal] 0)
[[[rho].sup.3].sub.XY] = [[gamma].sub.Y]/[[gamma].sub.X], (5)
where [[gamma]. …