On Measuring and Correcting the Effects of Data Mining and Model Selection

Journal article by Jianming Ye; Journal of the American Statistical Association, Vol. 93, 1998

Journal Article Excerpt


On Measuring and Correcting the Effects of Data Mining and Model Selection.

by Jianming Ye

1. INTRODUCTION

In the theory of linear models, the concept of degrees of freedom plays an important role. This concept has several different interpretations. The degrees of freedom in regression are the number of variables in the model. Accordingly, degrees of freedom are often used as a model complexity measure in various model selection criteria, such as Akaike information criterion (AIC) (Akaike 1973), [C.sub.p] (Mallows 1973), and Bayesian information criterion (BIC) (Schwarz 1978), generalized cross-validation (GCV) (Craven and Wahba 1979), and risk inflation criterion (RIC) (George and Foster 1994). Degrees of freedom can also be interpreted as the cost of the estimation process and thus can be used for obtaining an unbiased estimation of the error variance. Finally, the degrees of freedom in regression are the trace of the so-called "hat" matrix; that is, the sum of the sensitivity of each fitted value with respect to the corresponding observed value.

An extension of degrees of freedom to general model structures is useful both practically and theoretically. The last two decades have brought rapid progress in modeling high-dimensional data by means of complex statistical procedures. These procedures typically require minimum assumptions about the structures of the underlying models and try to capture the structures through adaptive fitting. But their flexibility often leads to substantial overfitting. The complex nature of these procedures makes it difficult to study their statistical behavior and to assess their performance objectively.

Traditionally, for general modeling problems, statisticians tend to define degrees of freedom based on the number of parameters in the final model, because of the coincidence of these two quantities in linear models. The focus is often on how to find the correct number of parameters. As an example, in tree-based regression models, the S-PLUS software uses the number of terminal nodes as the degrees of freedom for calculating the error variance (Venables and Ripley 1994). Friedman (1991) and many of the discussants of his paper proposed various definitions of ad hoc degrees of freedom. It has been suggested that the degrees of freedom may depend on more subtle aspects of the modeling procedures. Owen (1991) suggested that searching for a knot cost roughly 3 df.

The major difficulty in handling complex modeling procedures is that the fitted values are often complex, nondifferentiable, or even discontinuous functions of the observed values. For example, in fitting of a linear model with variable selection, a small change in Y may lead to a different selected model, resulting in a discontinuity in the fitted values. Another example is that of the tree-based models (Breiman, Friedman, Olshen, and Stone 1984), which has discontinuous boundaries.

Consider a general modeling process involving application of a sequence of statistical and nonstatistical tools to a dataset in an attempt to obtain a final model. This process is sometimes also called data mining. The main goal of this article is to develop a concept of generalized degrees of freedom (GDF) that is applicable for evaluation of the final model or fits produced by such a process. The GDF is defined as the sum of the sensitivity of each fitted value to perturbations in the corresponding observed value. It is nonasymptotic in nature and thus is free of the sample-size constraint. I show that the GDF can be used as a measure of the complexity of a general modeling procedure, so the procedure's goodness of fit can be measured with an extended AIC. One can also view the GDF as the cost of the modeling process, so that under suitable conditions, an unbiased estimate of the error variance can be obtained. This concept allows one to establish a unified theory under which complex modeling procedures can be analyzed in the same way as classical linear models. It allows one to understand modeling procedures in terms of their complexity and tendency to overfit, rather than merely in terms of how well they fit a specific dataset.

There are several differences between GDF and the traditional degrees of freedom. I show that the GDF of a parameter may be substantially larger than 1; there is no longer an exact correspondence between the degrees of freedom and the number of parameters. In general, GDF depends on both the modeling procedure and the underlying true model.

In Section 2 I briefly summarize the concept of degrees of freedom in linear models and motivate its extension to general modeling procedures. A computational procedure is also given. ...









































End of free preview...

 To continue reading this publication, you must have a Questia Subscription.

Try Us Today! Click Here

Questia provides the world's largest online library of scholarly books and journal articles, with integrated footnote and bibliography tools, highlighting, note taking and book marking. With a Questia subscription, you'll have access to the full text of more than 67,000 books and 1.5 million articles.

Already a subscriber? Login:

Sponsored Links
Read more than 5,000 classic books FREE!
Free Newsletter
Get helpful how-to's, writing tips, search strategies, quizzes & more!
Search the Library

Customize your search: Search within the topic


Search in:
Books Journals Magazines
Newspapers Encyclopedia Research Topics
  • Type your specific word or phrase in the box above after the word and, then click Search.
  • Put exact phrases in double quotation marks. Do not put single words in quotation marks.
Back to top