# New Directions in Econometric Practice: General to Specific Modelling, Cointegration, and Vector Autoregression

Chapter 2
Data Mining

2.1
Model Selection through 'Data Mining'
'Data mining' in its various forms reflects the general problem of not being in a position to conduct controlled experiments. This may lead to procedures which use a fixed data sample in some sequential manner to arrive at the final model specification. Suppose that we define a 'good' model as a specification which exhibits a high coefficient of determination, 'significant' Student-t ratios and possibly a Durbin-Watson statistic close to 2. A tempting and quite common practice is the taking of the widest possible set of variables (called here the 'candidates') which might eventually enter the model, running numerous regressions using as the regressors subsets of the entire set of the 'candidates', and then selecting the 'best' regression, that is that with the highest coefficient of determination or with the highest Student-t ratios.Imagine that the aim is to model the consumption function. The investigator might prepare a set of potential regressors such as lagged consumption, total personal disposable income, expected income (measured, for instance, as the one-step-ahead prediction of income), income adjusted for the stock of liquid assets, total wealth of the last period and total wealth averaged over the last two periods. These variables can in turn be deflated by at least two different price indices: the retail price index and the cost-of-living index. This gives rise to a set of 12 potential explanatory variables and enables the formulation of some sensible looking consumption functions defined for a particular choice of deflator, such as:
 1 consumption explained by its own lagged value and current unadjusted income; 2 consumption explained by current and expected income; 3 consumption explained by its own lagged value and current and adjusted income; 4 consumption explained by current unadjusted income and lagged wealth.

By changing the deflator, another four equations can be formulated. Why not estimate them all and choose the one that has good Student-t ratios and a high R2? If, on these criteria, the 'best' is equation 4 and the 'worst' is equation 2, we may even conclude that the analysis shows the significant impact of current income on consumption and that expected income does not influence current consumption. Would such a conclusion be justified?

