Academic journal article
By Cohen, Steven B.
The American Statistician , Vol. 51, No. 3
National surveys conducted by government organizations, industry, political organizations, and market research firms often share the same survey design objective to minimize the variance in survey estimates, subject to fixed cost and time constraints. As a consequence, most large-scale national health care surveys are characterized by sample designs with varying degrees of complexity, with design features that include clustering, stratification, disproportionate sampling, and multiple stages of sample selection. Most of the standard statistical software packages such as SAS, SPSS, SYSTAT, and BMDP assume that the data were obtained from a simple random sample in which the observations are independent and identically distributed, and selected with equal probability. When the data have been collected from a survey with a complex sample design, variance estimates of survey statistics derived under simple random sampling assumptions generally underestimate the true variance, which results in artificially lower confidence intervals and anticonservative hypothesis testing, that is, rejecting the null hypothesis when it is true, more frequently than indicated by the nominal Type I error level (Carlson, Johnson, and Cohen 1993).
In the past decade a number of statistical software packages have been developed, specifically tailored to facilitate the analysis of complex survey data (Cohen, Burt, and Jones 1986). This evaluation is directed to three software packages that have been developed to facilitate the analysis of complex survey data in a personal computing environment. The software packages under review are: Stata release 5.0 with the survey design software added, SUDAAN Version 7.0, and WesVarPC Version 2.02. Data from the household component of the 1987 National Medical Expenditure Survey (NMES), which has a multistage complex survey design, was used to facilitate the evaluation. The comparisons focus on analytical capacity, programming ease, computer run time, documentation, and data preparation issues.
A number of alternative methods have been developed for approximating sampling variances for survey estimates derived from surveys with complex sample designs. Three generally accepted and frequently used techniques are the Taylor Series linearization method, the method of balanced repeated replication (BRR), and the jackknife method (Wolter 1985). A number of prior software evaluations have focused on software packages developed for mainframe computing applications (Cohen et al. 1986; Cohen, Xanthopolous, and Jones 1988). With the enhanced computing capacity made available on the PC, more attention has been given to the development of software packages tailored to the analysis of complex survey data in a PC environment. In addition to the packages that are part of this evaluation, another software package developed for the analysis of complex survey data in a personal computing environment, PC CARP, is available. The advantages and disadvantages of this software procedure relative to an earlier version of SUDAAN were considered in a prior evaluation. Because no upgrades have been implemented to the PC CARP statistical package at the time of this evaluation, the readers are directed to the results of that evaluation (Carlson et al. 1993). It should be noted that SAS, SPSS, and Systat were also invited to be part of the evaluation in order to identify new capabilities with respect to the analysis of complex survey data, but they declined.
To conduct the software comparison, data from the household component of the 1987 National Medical Expenditure Survey were used. The household component of the 1987 National Medical Expenditure Survey (NMES) was designed to produce unbiased national and regional estimates of the health care utilization, expenditures, sources of payment, and health insurance coverage of the U.S. civilian noninstitutionalized population. …