Bootstrap Methods for Developing Predictive Models

Article excerpt

1. INTRODUCTION

Researchers frequently develop regression models to predict dichotomous outcomes. Investigators need to maintain a balance between including too many variables and model parsimony (Murtaugh 1998; Wears and Lewis 1999). Omitting important prognostic factors results in a systematic misestimation of the regression coefficients and biased prediction, but including too many predictors will result in loss of precision in the estimation of the regression coefficients and the predictions of new responses (Murtaugh 1998).

Automated variable selection methods such as backwards elimination are frequently used for the purpose of identifying independent predictors or for developing parsimonious regression models (Miller 1984, 2002; Hocking 1976). Several studies have shown that automatic variable selection methods in ordinary least squares regression result in spurious noise variables being mistakenly identified as independent predictors of the outcome (Derksen and Keselman 1992; Flack and Chang 1987) and that global measures of goodness of fit are overly optimistic (Flack and Chang 1987; Copas and Long 1991). Similarly, the use of automated variable selection methods with logistic regression results in the identification of nonreproducible models (Austin and Tu in press).

The purpose of this article is to propose a method for developing predictive models that combines bootstrap resampling with automated variable selection methods. The article is divided into three sections. First, we describe a model selection method that uses backwards elimination on multiple bootstrap samples. Second, we apply our methods to a clinical dataset to develop a model for predicting mortality within 30 days of a heart attack. Third, we summarize our results.

2. BOOTSTRAP METHODS FOR MODEL SELECTION

The bootstrap is a well-known statistical method used to assess the variability of test statistics (Efron and Tibshirani 1993; Davison and Hinkley 1997). The nonparametric bootstrap allows one to estimate an empirical distribution function by repeated sampling from the observed data. The use of bootstrap methods allows one to approximate the distribution of test statistics in settings in which analytic calculations are intractable or in small samples in which large-scale asymptotic results may not hold.

Earlier studies have described the instability of automated variable selection methods. Studies have demonstrated that spurious noise variables are mistakenly identified as independent predictors of the outcome (Derksen and Keselman 1992; Flack and Chang 1987). Furthermore, the number of noise variables included increased as the number of candidate variables increased, and the probability of correctly identifying variables was inversely proportional to the number of variables under consideration (Murtaugh 1998).

Our proposed model selection method is based upon drawing repeated bootstrap samples from the original dataset. Within each bootstrap sample, backwards elimination is used to develop a parsimonious predictive model. For each candidate variable, the proportion of bootstrap samples in which that variable was identified as an independent predictor of the outcome is determined. Candidate variables are then ranked according to the proportion of bootstrap samples in which they were identified as independent predictors of the outcome. A preliminary predictive model would consist of those variables that were identified as significant predictors in all bootstrap samples. Variables could then be sequentially added to this preliminary model according to the proportion of bootstrap samples in which they were selected as significant predictors. Each candidate model can then be assessed for its predictive accuracy and a final model identified. Our approach is a simplification of one proposed by Sauerbrei and Schumacher (1992) for identifying strong and weak factors for predicting survival, based upon repeated bootstrap sampling. …