Model Selection and the Principle of Minimum Description Length

Journal article by Mark H. Hansen, Bin Yu; Journal of the American Statistical Association, Vol. 96, 2001

Journal Article Excerpt


Model Selection and the Principle of Minimum Description Length.

by MARK H. HANSEN , BIN YU

This article reviews the principle of minimum description length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed attention within the statistics community. Here we review both the practical and the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines we find many interesting interpretations of popular frequentist and Bayesian procedures. As we show, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate the MDL principle by considering problems in regression, nonparamet ric curve estimation, cluster analysis, and time series analysis. Because model selection in linear regression is an extremely common problem that arises in many applications, we present detailed derivations of several MDL criteria in this context and discuss their properties through a number of examples. Our emphasis is on the practical application of MDL, and hence we make extensive use of real datasets. In writing this review, we tried to make the descriptive philosophy of MDL natural to a statistics audience by examining classical problems in model selection. In the engineering literature, however, MDL is being applied to ever more exotic modeling situations. As a principle for statistical modeling in general, one strength of MDL is that it can be intuitively extended to provide useful tools for new problems.

KEY WORDS: AIC; Bayesian methods; Bayes information criterion; Cluster analysis; Code length; Coding redundancy; Information theory; Model selection; Pointwise and minimax lower bounds; Regression; time series.

1. OVERVIEW

The principle of parsimony, or Occam's razor, implicitly motivates the process of data analysis and statistical modeling and is the soul of model selection. Formally, the need for model selection arises when investigators must decide among model classes based on data. These classes might be indistinguishable from the standpoint of existing subject knowledge or scientific theory, and the selection of a particular model class implies the confirmation or revision of a given theory. To implement the parsimony principle, one must quantify "parsimony" of a model relative to the available data. Applying this measure to a number of candidates, we search for a concise model that provides a good fit to the data. Rissanen (1978) distilled such thinking in his principle of minimum description length (MDL): Choose the model that gives the shortest description of data. In this framework a concise model is one that is easy to describe, whereas a good fit implies that the model captures or describes the important features e vident in the data.

MDL has its intellectual roots in the algorithmic or descriptive complexity theory of Kolmogorov, Chaitin, and Solomonoff (cf. Li and Vitanyi 1996). Kolmogorov, the founder of axiomatic probability theory, examined the relationship between mathematical formulations of randomness and their application to real-world phenomena. He ultimately turned to algorithmic complexity as an alternative means of expressing random events. A new characterization of probability emerged based on the length of the shortest binary computer program that describes an object or event. (A program can "describe" an object by "printing" or in some way exhibiting the object. Typically, an object is a binary string, and exhibiting the string is nothing more than printing the individual 0's and 1's in order and stopping in finite time.) We refer to this quantity as the descriptive complexity of the object. Up to a constant, it can be defined independent of any specific computing device, making it a universal quantity Kolmogorov 1965, 196 8 (cf., Cover and Thomas 1991). Because this descriptive complexity is universal, it provides a useful way to think about probability and other problems that build on fundamental notions of probability. In theory, it can also be used to define inductive inference in general (or statistical inference in particular) as the search for the shortest program for data.

Unfortunately, the descriptive complexity of Kolmogorov is not computable (cf. Cover and Thomas 1991) and thus cannot be used as a basis for inference given real data. Rissanen modified this concept when proposing MDL, sidestepping computability issues. First, he restricted attention to only those descriptions that correspond to probability models or ...


















































































End of free preview...

 To continue reading this publication, you must have a Questia Subscription.

Try Us Today! Click Here

Questia provides the world's largest online library of scholarly books and journal articles, with integrated footnote and bibliography tools, highlighting, note taking and book marking. With a Questia subscription, you'll have access to the full text of more than 67,000 books and 1.5 million articles.

Already a subscriber? Login:

Sponsored Links
Read more than 5,000 classic books FREE!
Free Newsletter
Get helpful how-to's, writing tips, search strategies, quizzes & more!
Search the Library

Customize your search: Search within the topic


Search in:
Books Journals Magazines
Newspapers Encyclopedia Research Topics
  • Type your specific word or phrase in the box above after the word and, then click Search.
  • Put exact phrases in double quotation marks. Do not put single words in quotation marks.
Back to top