Bootstrap Model Selection. by Jun Shao 1. INTRODUCTION In a regression problem, typically there is a vector x of p explanatory variables to be used to fit a model between x and a response variable y. Because some of the components of x may not be related to y, using all p components of x does not necessarily produce a better model than using part of the components of x. Because the relative performance of each model (corresponding to a set of components of x) is usually unknown, we have to select a set of explanatory variables (components of x) based on a data set {([x.sub.i], [y.sub.i]), i = 1,..., n}, where [y.sub.i] is the response at x = [x.sub.i]. This variable selection problem is equivalent to a model selection problem in which each model corresponds to a particular set of the p components of x. There exist many variable/model selection procedures in the case where the relationship between x and y is linear; for example, the Akaike information criterion (AIC) (Akaike 1970); the [C.sub.p] method (Mallows 1973); the Bayes information criterion (BIC) (Hannan and Quinn 1979; Schwartz 1978); the final prediction error ([FPE.sub.[Lambda]]) method (Shibata 1984); the generalized information criterion (Rao and Wu 1989) and its analogs (Potscher 1989); the delete-one cross-validation (Allen 1974; Stone 1974); the generalized cross-validation (Craven and Wahba 1979); and the delete-d cross-validation (Burman 1989; Geisser 1975; Shao 1993; Zhang 1993a). This article introduces some selection methods based on the bootstrap. Besides the theoretical and empirical properties of the bootstrap selection procedures established in this article, there are at least two other reasons to use a bootstrap model selection procedure: 1. In the linear regression context, the bootstrap method provides inference procedures (e.g., confidence sets) that are asymptotically more accurate than those produced by the other methods (Adkins and Hill 1990; Hall 1989). It may be preferable to use the same method both in model selection and in the subsequent inference based on the selected model. In addition, if we use the bootstrap for both model selection and the subsequent inference, then the bootstrap observations generated for model selection can also be used in the subsequent inference; that is, in terms of generating bootstrap observations, there is no extra cost for using a bootstrap model selection procedure when the bootstrap is also used for inference. If a cross-validation method is used for model selection and the bootstrap is used for the subsequent inference, then the extra computations in generating resamples for cross-validating cannot be avoided. 2. The bootstrap selection procedure developed in the linear regression case can be extended, without any theoretical derivation, to more complicated problems such as the nonlinear regression models, generalized linear models, and autoregression models. The cross-validation method, which is also a data-resampling method, can also be easily extended to nonlinear regression and generalized linear models, but not to autoregression models. In Section 2 we focus on the case where the relationship between x and y is linear. We consider two different ways of generating bootstrap observations: bootstrapping residuals and bootstrapping pairs (x, y). The main theoretical study of a bootstrap selection procedure is its consistency; that is, whether the probability of selecting a nonoptimal model vanishes as the sample size n increases to infinity. Finite-sample performances of some bootstrap selection procedures are studied by simulation. We consider more complicated cases in Section 3 and establish some results similar to those in Section 2 in nonlinear regression, generalized linear, and autoregression models.Our main discovery is that a straightforward application of the bootstrap does not yield a consistent model selection procedure - although some simple modifications can be used to rectify this inconsistency. Consider, for example, the method of bootstrapping pairs. One usually generates n independent and identically distributed (iid) bootstrap observations from [Mathematical Expression Omitted], the empirical distribution putting mass [n.sup.-1] on each pair ([x.sub.i], [y.sub.i]), i = 1,..., n (Efron 1982, 1983; Freedman 1981). But our results in Sections 2 and 3 show that this leads to an inconsistent bootstrap selection procedure. A simple modification that results in a consistent bootstrap selection procedure is to generate fewer bootstrap observations from [Mathematical Expression Omitted]. More precisely, if rn (instead of n) iid bootstrap observations are generated from [Mathematical Expression Omitted], then the bootstrap selection procedure is consistent if and only if m/n [approaches] 0 and m [approaches] [infinity]. Changing the bootstrap sample size to rectify the inconsistency of the bootstrap has been shown to be successful in various other problems ... |
To continue reading this publication, you must have a Questia Subscription.Questia provides the world's largest online library of scholarly books and journal articles, with integrated footnote and bibliography tools, highlighting, note taking and book marking. With a Questia subscription, you'll have access to the full text of more than 67,000 books and 1.5 million articles.