Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data

Article excerpt

even when compliance is completely at random and censoring is independent.

The results of Theorems 1 and 2 hold, however, even if the explanatory variables [X.sub.i] are not independent and identically distributed, provided that the vectors [Y.sub.i], i = i, . . ., n, are independently distributed conditional on the sequence [X.sub.i], i = 1, . . ., n, because in that case {[X.sub.i]: i = 1, . . ., n} will be ancillary for [[Beta].sub.0]. An example of a setting where this occurs would be the randomized trial of Section 2 if treatment assignment was based on a randomized block design.

When (as in the environmental tobacco smoke example of Section 2) data on [] are not available when a subject misses occasion t (i.e., 1. INTRODUCTION

In both randomized and nonrandomized follow-up studies, it is often of interest to estimate the evolution over time of the mean of an outcome variable of interest for the ith subject [], t = 1, . . ., T, as a function of explanatory variables [X.sub.i]. In randomized studies [X.sub.i] typically includes subject i's treatment arm indicator and may also include his or her baseline characteristics such as age, sex, and pretreatment clinical status. The goal of this article is to provide methods for estimating the parameters [[Beta].sub.0] of models for the regression of [] on [X.sub.i] when some [] are regarded as censored (i.e. missing), either because subjects do not comply with their assigned protocols or because they drop out of the study prior to the end of follow-up.

As discussed further in Section 2, in practice the probability that a subject is missing at the tth occasion may depend on [X.sub.i], on past values both of the outcome variable [Y.sub.ij] and of a vector of time-dependent covariates [V.sub.ij], j = (0, . . . , t - 1). In such cases it is well known that fully parametric likelihood methods can provide valid inferences concerning the parameters [[Beta].sub.0] of the regression of [], t = 1, . . . , T, on [X.sub.i], if a model for the joint distribution of [], [X.sub.i] and [], t = 0, . . . , T, is correctly specified, and the probability of nonresponse at t does not depend on ([Y.sub.ij], [V.sub.ij]) for j [greater than or equal to] t (Rubin 1976). But with incomplete data, likelihood methods can be sensitive to model misspecification, because they implicitly impute the missing data from their conditional distribution given the observed data (Dempster, Laird, and Rubin 1977). In addition, even with complete data, if the focus is on models for the marginal distribution of the response, then fully parametric models for certain non-Gaussian data that preserve the marginal expectation of [] given [X.sub.i] can often be cumbersome and computationally difficult when [X.sub.i] is multivariate with continuous components (Prentice 1988).

Liang and Zeger (1986) proposed a class of generalized estimating equations (GEE) whose solutions are consistent for [[Beta].sub.0] whenever provided only that the model for the marginal means of the outcomes at each occasion is correctly specified. Their approach is an extension of quasi-likelihood methods (McCullagh and Nelder 1989) to the multivariate regression setting and results in iteratively reweighted least squares estimators of [[Beta].sub.0]. Similar estimators were also considered by Gourieroux, Monfort, and Trognon (1984). As Liang and Zeger pointed out, inferences with the GEE are valid only under the stronger assumption that the data are missing completely at random; that is, given [X.sub.i], the non-response process is independent of both observed and unobserved []'s and []'s.

In this article we propose a class of weighted estimating equations that lead to consistent and asymptotically normal estimators of [[Beta].sub.0] provided that the probability of nonresponse at time t, given both the entire past [Mathematical Expression Omitted], . …