P Values for Composite Null Models

Journal article by M. J. Bayarri, James O. Berger; Journal of the American Statistical Association, Vol. 95, 2000

Journal Article Excerpt


P Values for Composite Null Models.

by M. J. BAYARRI , JAMES O. BERGER

The problem of investigating compatibility of an assumed model with the data is investigated in the situation when the assumed model has unknown parameters. The most frequently used measures of compatibility are p values, based on statistics T for which large values are deemed to indicate incompatibility of the data and the model. When the null model has unknown parameters, p values are not uniquely defined. The proposals for computing a p value in such a situation include the plug-in and similar p values on the frequentist side, and the predictive and posterior predictive p values on the Bayesian side. We propose two alternatives, the conditional predictive p value and the partial posterior predictive p value, and indicate their advantages from both Bayesian and frequentist perspectives.

KEY WORDS: Bayes factors; Bayesian p values; Conditioning; Model checking; Predictive distributions.

1. INTRODUCTION

1.1 Background

In parametric statistical analysis of data X, one is frequently working at a given moment with an entertained model or hypothesis [H.sub.0]: X [sim] f(x; [theta]). We will call this the null model or null hypothesis, even though no alternative is explicitly formulated. We assume that f(x; [theta]) is either a discrete density or a continuous density (with respect to Lebesgue measure). A statistic T = t(X) is chosen to investigate compatibility of the model with the observed data, [X.sub.obs]. We assume that T has been expressed in such a way that large values of T indicate less compatibility with the model. The most commonly used measure of compatibility is the p value, defined as

p = Pr(t(X) [greater than or equal to] t([X.sub.obs])). (1)

When [theta] is known, the probability computation in (1) is with respect to f(x; [theta]). The focus in this article is on the choice of the probability distribution used to compute (1) when [theta] is unknown. In Section 2 we present two new types of p values, which we argue are superior to existing choices. The rest of this section describes the most common of the existing choices. We abuse notation by using f(t; [theta]) and f(t; [theta]) to denote the marginal density of t(X) and the conditional density of t(X) given u(X) = u.

The most obvious way to deal with an unknown [theta] in computation of the p value is to replace [theta] in (1) by some estimate, [theta]. In this article we consider only the usual choice for [theta], namely the maximum likelihood estimator (MLE). We call the resulting p value the plug-in p value ([p.sub.plug]). Using a superscript to denote the density with respect to which the p value in (1) is computed, the plug-in p value is thus defined as

[P.sub.plug] =[Pr.sup.f(.;[theta])](t(X) [greater than or equal to] t([X.sub.obs]). (2)

The main strengths of [P.sub.plug] are its simplicity and intuitive appeal. Its main weakness appears to be a failure to account for uncertainty in the estimation of [theta], although as we show, this issue is rather involved.

Another natural device for eliminating the unknown [theta] is to condition on a sufficient statistic, U, for [theta]. Then f(x\[u.sub.obs;][theta]) does not depend on [theta], and computations in (1) can be carried out using the completely specified f(X\[u.sub.obs]). [In fact, U need only be sufficient for [theta] with respect to f(t; [theta]).] We call these p values similar p values, a term borrowed from the related notion of similar tests and confidence regions. A similar p value is thus defined as

[p.sub.sim] = [Pr.sup.f(.\[u.sub.obs])](t(X) t([X.sub.obs])). (3)

The main strength of [p.sub.sim] is that it is based on a proper probability computation, which imbues the end result with various desirable properties (discussed later). Its main weaknesses are that the computation can be burdensome and that a suitable sufficient U typically does not exist.

Bayesians have a natural way to eliminate nuisance parameters: integrate them out. Thus if [pi])[theta]) is a prior distribution for 9, then the marginal or (prior) predictive distribution is

m(x) = [integral of] f(x; [theta])[pi]([theta]) d[theta]. (4)

Because this is free of [theta], it can be used to compute a p value, leading to the prior predictive p value, given by

[p.sub.prior] = [Pr.sup.m](.)(t(X)[greater than or equal to] t([X.sub.obs])). (5)

The main strengths of [p.sub.prior] are that it is also based on a proper probability computation (at least if [pi]([theta]) is proper), and that it suggests a natural and simple T, namely t(x) = 1/m(x). The main weakness Of [p.sub.prio]r for pure model checking is its dependence on the prior [pi]([theta]); in essence, m(x) measures the likelihood of x relative to both the model and the prior, and an excellent model could come under suspicion if a poor prior distribution were used. For this reason, and because model checking is often considered at early stages of an analysis before careful prior elicitation is performed (and/or because a nonsubjective analysis might be desired from the beginning), it is attractive to attempt to use noninformative priors. Unfortunately, noninformative priors are typically improper, in which case the prior predictive m(x) would also be improper, precluding computation of (5). Box (1980) popularized the use of [P.sup.prior].

The concerns mentioned in the preceding paragraph have led many Bayesians, beginning with Guttman (1967) and Rubin (1984), to eliminate [theta] from f(x; [theta]) by integrating with respect to the posterior distribution, [pi]([theta]/[X.sub.obs]), instead of the prior, before computing a p value. The posterior predictive p value is thus defined as

[P.sub.post] = [[[Pr.sup.m].sub.post](/[X.sub.obs](t(X) [greater than or equal to] t([x.sub.obs]])), (6)

where

[m.sub.post](X/[X.sub.obs]) = [integral] f(x; [theta])[pi]([theta]/[X.sub.obs]) d[theta]. (7)

The main strengths of [P.sub.post] are as follows:

a. Improper noninformative priors can readily be used (since [pi]([theta]/[X.sub.obs]) will typically be proper).

b. [m.sub.post](X/[X.sub.obs]) typically will be much more heavily influenced by the model than by the prior; indeed, as the sample size goes to infinity, the posterior distribution will essentially concentrate at [theta], so that [P.sub.post] will (for large n) be very close to [P.sub.plug].

c. It typically is very easy to compute using output from modern Markov chain Monte Carlo (MCMC) Bayesian analyses.

Its main weakness is that there is an apparent "double use" of the data in (6), first to convert the (possibly improper) prior [pi]([theta]) into a proper distribution [pi]([theta]/[X.sub.obs]) for determining the reference distribution [m.sub.post](X/[X.sub.obs]), and then to compute the tail area corresponding to t([X.sub.obs]). This double use of the data can induce unnatural behavior. From a Bayesian perspective, defenders of the prior predictive also point out that the posterior predictive lacks a pure Bayesian interpretation; although this was our original motivation for the developments herein, the arguments in the article are not directly based on such reasoning.

Generalizations of (6) were considered by Meng (1994), Gelman, Carlin, Stern, and Rubin (1995), Gelman, Meng, and Stern (1996), and references therein; in particular, t(X) could be replaced by a function t(X, [theta]), and f(x; [theta]) in (7) could be replaced by f(X/[theta], A), where A is some other statistic. We do not discuss such generalizations in this article.

There are also many other related works. Aitkin (1991) used the posterior distribution to compute actual Bayes factors, instead of p values. Evans (1997) introduced a related concept for model checking based on the ratio of the posterior and prior predictive densities.

Other approaches that have been suggested for dealing with the nuisance parameter, [theta], in computing (1) include those of Tsui and Weerahandi (1989) (primarily for one-sided testing) and Berger and Boos (1994). The latter authors sought to provide a practical implementation of the conservative frequentist approach that deals with unknown 9 by maximization:

[P.sub.sup] = [sup.sub.[theta]] [Pr.sup.f(.;[theta])] (t(X) [greater than or equal to t([X.sub.obs])).

This p value is of rather limited usefulness, because the supremum is often too large to provide useful criticism of the model. For instance, in the examples of Sections 2 and 3, [P.sub.sup] can easily be seen to equal 1. Berger and Boos (1994) overcame this difficulty by restricting the supremum to [theta] in a confidence set for [theta] (with the noncoverage probability being added to the p value). Although potentially useful to frequentists in formal testing situations, in which conservatism is typically deemed desirable, the approach is less appropriate for model checking in which conservatism would mean that one is often not alerted to the fact that the model is inadequate.

1.2 Evaluation of p Values

What do we want in a p value? For a frequentist, one appealing property would be for p, considered as a random variable, to be uniform[0, 1] under the null, f(x; [theta]), for all [theta]. In some sense, being U[0, 1] defines a proper p value, allowing for its common interpretation across problems. Statistical measures that lack a common interpretation across problems are simply not very useful. (For more extensive discussion of this point, see the companion article Robins, van der Vaart, and Ventura 2000, which we henceforth denote by RVV 2000; earlier articles that refer to and/or discuss this "defining" property of a p value include De la Horra and Rodriguez-Bernal 1997, Meng 1994, Robins 1999, Rubin 1996, and Thompson 1997.)

For most problems, exact uniformity under the null for all [theta] cannot be attained for any p value. Thus one must weaken the requirement to some extent. A natural weaker requirement is that a p value be U[0, 1] under the null in an asymptotic sense; this is the subject of RVV (2000). Here we focus on studying the degree to which the various p values deviate from uniformity in finite-sample scenarios.

It is not obvious that Bayesians should be concerned with establishing that a p value is uniform under the null for all [theta]. For instance, the prior predictive p value is U[0, 1] under m(x) (if the prior is proper), which means that it is U[0, 1] in an average sense over [theta]. If the prior distribution is chosen subjectively, then a Bayesian could well argue that this is sufficient; indeed, Meng (1994) suggested that uniformity under m(x) is a useful criterion for the evaluation of any proposed p value. (The more basic issue that a p value is a tail area, and not compatible with true Bayesian measures, is discussed briefly in the next section.)

As mentioned earlier, however, preliminary model checking is most typically done (by Bayesians) with noninformative priors, and if these are improper, there is no "average over [theta]" that can be used. (We later give an example with a proper noninformative prior, in which a Bayesian--or non-Bayesian--might settle for "average" uniformity.) Of course, if a p value is uniform under the null in the frequentist sense, then it has the strong Bayesian property of being marginally U[0, 1] under any proper prior distribution. This explains why Bayesians should, at least, be highly satisfied if the frequentist requirement obtains. Perhaps more to the point, if a proposed p value is always either conservative or anticonservative in a frequentist sense (see RVV 2000 for definitions), then it is likewise guaranteed to be conservative or anticonservative in a Bayesian sense, no matter what the prior. A similar conclusion would hold for large sample sizes (under mild conditions) if a proposed p value were always conserv ative or anticonservative in a frequentist asymptotic sense. (Interesting related discussions concerning the posterior predictive p value have been given by Gelman et al. 1996, Meng 1994, and Rubin 1996.)

Actually, Bayesians might well go further, not only requiring unconditional uniformity for p values, but also seeking reasonable conditional performance. In this article we limit discussion of this issue to presentation of some examples in which it is clear that study of conditional performance is of value in comparing p values; we do not, however, attempt to present general results in this direction.

There is a vast literature on other methods of evaluating p values. Much of the literature is concerned with power comparisons against alternatives. There is also a significant literature concerned with decision-theoretic evaluations of p values (e.g., Blyth and Staudte 1995; Hwang, Casella, Robert, Wells, and Farrell 1992; Hwang and Pemantle 1997; Hwang and Yang 1997; Schaafsma, Tolboom, and Van Der Meulen 1989; Thompson 1997). Neither of these evaluation techniques is within the scope of this article, because we are specifically concerned with the situation in which no alternative is present (see the next section). From a non-Bayesian perspective, however, evaluation of the new p values by these criteria might well prove very illuminating; see RVV (2000) for interesting results in this direction.

1.3 To Be and Not To Be

This article has five sections. In Section 2 we consider two new p values introduced by Bayarri and Berger (1999), the partial posterior predictive p value ([p.sub.ppost]) and the conditional predictive p value ([p.sub.cpred]), and compare them with previous p values in specific examples. Some results are also given in Section 2 concerning equality of various p values; of particular interest is a result (Theorem 2) that allows ready computation of [p.sub.sim] in certain situations. In Section 3 we compare the various p values in the normal linear model, where exact computation is possible; this section was directly motivated by RVV (2000). In Section 4 we discuss the situation of discrete sample spaces, with emphasis on analysis of contingency tables; this has long been a highly problematic area, with the discreteness of the sample space causing many p values to be very conservative. We present conclusions in Section 5.

A number of relevant issues are not considered in this article. First, we do not explicitly discuss the choice of T, in part because this is a contentious issue and is not directly related to our development; one of the strengths of the methodology that we propose is that it can be applied to essentially any choice of T. Second, our primary focus is on model checking at initial, exploratory stages of the statistical analysis, and consideration of a wide variety of intuitive T is often useful at that stage. If one has a clearly formulated alternative to [H.sub.0], then we would not recommend using any p value to perform the test, and would instead use either Bayes factors or conditional frequentist tests (Bayarri and Berger 1999; Berger, Boukai, and Wang 1997; Berger, Brown, and Wolpert 1994; Berger and Delampady 1987; Berger and Sellke 1987; Delampady and Berger 1990; Edwards, Lindman, and Savage 1963). The decision to even formulate an alternative to [H.sub.0], however, is often undertaken on the basis of a n analysis designed to indicate incompatibility of the model with the data, based on intuitive departure statistics T = t(x). If determination of T required hard work, then we would suggest spending the time instead on actual formulation of the alternative.

The other major issue that we mostly avoid is discussion of measures other than p values of data compatibility with the model. (For a review of a number of other measures that have been proposed, see Bayarri and Berger 1997.) The primary reason for considering only p values is their ubiquitous presence in statistics, together with the fact that they do have some desirable properties (such as invariance to transformations of X). Balanced against this is the near ubiquitous misinterpretation of p values as either frequentist error probabilities or (worse) as the probability of [H.sub.0]. Luckily, a rather simple calibration is available that allows p values to be given an intuitive interpretation: compute B(p) = -ep log(p), when p [less than] [e.sup.-1], and interpret this as the odds (or Bayes factor) of [H.sub.0] to [H.sub.1], where [H.sub.1] denotes the (unspecified) alternative to [H.sub.0]. For those who prefer to think in terms of a frequentist error probability [alpha] (in rejecting [H.sub.0]), the cali bration is [alpha](p) = [(1 + [[-ep log(p)].sup.1]).sup.-1]. As an example, p = .05 translates into odds B(.05) = .41 (roughly 1-2.5) of [H.sub.0] to [H.sub.1], and frequentist error probability [alpha](.05) = .29 in rejecting [H.sub.0].

These calibrations were developed and motivated from a various perspectives by Sellke, Bayarri, and Berger (1999). On the Bayesian side, they arise from robust Bayesian arguments, as lower bounds on Bayes factors for testing [H.sub.0]. [B(p) arises exactly in testing against a general nonparametric alternative, and arises approximately in parametric analyses.)] On the frequentist side, [alpha](p) arises as a lower bound on the type I error probability, over a large class of conditional frequentist tests, where one conditions on the "strength of evidence" in the data. It is of interest that the calibrations are based in part on starting with a proper p value; that is, a p value that is U[0, 1] in some sense.

Further discussion of some of the philosophical issues surrounding the use of p values has been given by Bayarri and Berger (1999). From now on, we ignore these issues and simply assume that (possibly calibrated) p values are useful, for whatever reasons, and focus on the issue of which p values are most satisfactory.

2. CONDITIONAL PREDICTIVE [rho] VALUES

Section 2.1 introduces the two new p values that we consider and illustrates their definitions (and those of the other p values) in a standard example; some interesting features of the various p values are also observed. Section 2.2 presents the motivations for the new p values from both Bayesian and frequentist perspectives. Section 2.3 addresses computational issues.

2.1 Methodology

Consider first the partial posterior predictive p value, defined for a prior [pi]([theta]) (typically noninformative) as

[P.sub.ppost] = [Pr.sup.m](./[x.sub.obs]\[t.sub.obs])(T [greater than or equal to] [t.sub.obs]); (8)

here T = t(X), [t.sub.obs] = t([X.sub.obs]), and m(./x.sub.obS] \ [t.sub.obs]) and the (assumed proper) partial posterior [pi]( ./x.sub.obS] \ [t.sub.obs]) are given by

m(t/[x.sub.obs] \ [t.sub.obs]) = [integral] f(t/[theta])[pi]([theta]/[x.sub.obs] \ [t.sub.obs]) d[theta]

and

[pi]([theta]/[x.sub.obs] \ [t.sub.obs]) [varies] f([x.sub.obs]/[t.sub.obs];[theta])[pi]([theta]) [varies] f([x.sub.obs];[theta])[pi]([theta])/f([t.sub.obs];[theta]). (9)

Intuitively, this avoids the double use of the data that occurs in the posterior predictive p value, because the contribution of [t.sub.obS] to the posterior is "removed" before [theta] is eliminated by integration. (The notation [x.sub.obs] \ [t.sub.obs] was chosen to indicate this.)

The second p value that we propose is a specific case of what can be termed a U-conditional predictive p value, defined, for some conditioning statistic U = u(X), as

[P.sub.cpred(u)] = [P[r.sup.m](./[u.sub.obs])](T [greater than or equal to] [t.sub.obs]); (10)

here [u.sub.obs] = u([x.sub.obs]) and (formally)

m(t/u) = [integral] f(t/u;[theta])[pi]([theta]/u)d[theta], (11)

assuming that

[pi]([theta]/u) = f(u;[theta])[pi](theta])/[integral] f(u;[theta])[pi]([theta]) d[theta] (12)

is proper. [Recall that f(t/u;[theta]) and f(u;[theta]) are defined as the conditional and marginal densities of T and U under [H.sub.0].]

The specific proposal that we recommend, for the case of continuous data, is obtained by choosing U in (10) to be the conditional MLE of [theta], given t(x) = t, defined as

[[theta].sub.cMLE](X) = arg max f(x/t,[theta]) = arg max f(x;[theta])/f(t;[theta]). (13)

We suppress [[theta].sub.cMLE] and call the resulting p value simply the conditional predictive p value, denoted by [p.sub.cpred] = [p.sub.cpred([[theta].sub.cMLE])]. Note that m(t/u) is unaffected by one-to-one transformations of u(x), so that any one-to-one transformation of (13) is satisfactory as the choice of [[theta].sub.cMLE].

Note that when T is conditionally independent of [[theta].sub.cMLE] and (T,[[theta].sub.cMLE]) are jointly sufficient, both of the foregoing proposals for p values agree; that is, [P.sub.ppost] = [P.sub.cpred]. This occurs in the following example from Meng (1994), which we use to exhibit the various p values that have been defined so far.

Example 1. Assume that under the null, the [X.sub.i] are iid N(0, [[sigma].sup.2]), with [[sigma].sup.2] unknown. The statistic t(X) = /X/ is chosen to measure departure from the model (which would be natural for detecting a discrepancy in the mean of the model). The various p values are given by

p = Pr{/X/ [greater than] /[X.sub.obs]/}, (14)

with different distributions used to compute the probability. For the Bayesian p values, we utilize the usual non-informative prior for [[sigma].sup.2] : [pi]([[sigma].sup.2]) [varies] 1/[[sigma].sup.2]. Finally, define [s.sup.2] = [sigma][([x.sub.i] - x).sup.2/n].

[P.sub.plug] : Because X [sim] N(0, [[sigma].sup.2]/n) and [[sigma].sup.2] = 1/n [[[sigma].sup.n].sub.i=1] [[x.sup.2].sub.i] = [s.sup.2] + [x.sup.2] is the MLE, it follows from (2) and (14) that

[P.sub.plug] = 2 [1 - [phi] ([square root]n/[x.sub.obs]//[square root][[s.sup.2].sub.obs] + [[x.sup.2].sub.obs])]. (15)

One obvious inadequacy with this p value is that [P.sub.plug] [right arrow] 2[1 - [phi]([square root]n)], a positive constant, as /[x.sub.obs]/[S.sub.obs] [right arrow] [infinity]. Thus, even with arbitrarily strong evidence against the null model, the p value will not go to 0 (for fixed n). For large n, this limiting constant will of course be small, so that it would not pose a practical problem. In practice, however, the number of observations is often not large in comparison to the number of parameters, so that concerns of this type can be relevant. In any case, such behavior is indicative of a fundamental flaw in the procedure. (In this example, one could achieve results that are more satisfactory by plugging in [S.sub.obs] rather than the MLE; indeed, this is related to the "conditional plug-in" p value, which, however, is shown in RVV 2000 to also have deficiencies.)

[P.sub.sim]: A sufficient statistic for [[sigma].sup.2] is V = [[[sigma].sup.n].sub.i=1] [[X.sup.2].sub.i] = [[parallel]X[parallel].sup.2]. The distribution of X, given [v.sub.obs] = [[parallel][x.sub.obs][parallel].sup.2], is uniform on {x = [[parallel]x[parallel].sup.2] = [[parallel][x.sub.obs][parallel].sup.2]}, so that (3) and (14) yield

[P.sub.sim] = Pr(/X//[parallel][x.sub.obs][parallel] [greater than] /[x.sub.obs]//[parallel][x.sub.obs][parallel])

= Pr (/Z/ [greater than] [x.sub.obs]//[parallel][x.sub.obs][parallel]), (16)

where Z has a uniform distribution on {z: [[parallel]z[parallel].sup.2] = 1}. Although this might appear to be difficult to compute, it is shown later (using Theorem 2) that [P.sub.sim] is exactly equal to [P.sub.ppost] and [P.sub.cpred], which in turn are equal to the classical p value for the problem given in (18); we found this result surprising.

[p.sub.prior]: The prior predictive p value cannot be computed for this example, because the prior distribution is improper.

[p.sub.post]: The posterior density, [pi]([[sigma].sup.2]\[x.sub.obs]), is [Ga.sup.-1](n/2, n([s.sup.2] + [x.sup.2])/2), and the posterior predictive distribution of X is [m.sub.post](x\[x.sub.obs]) = [t.sub.n](x\0, 1/n([[s.sup.2].sub.obs] + [[x.sup.2].sub.obs])); here [Ga.sup.-1] and [t.sub.n] denote the inverse gamma distribution and the t distribution with n degrees of freedom. From (6) and (14), it follows that

[p.sub.post] = 2 [1 - [[[Upsilon]].sub.n] ([square root]n[x.sub.obs]/[square root][[s.sup.2].sub.obs] + [[x.sup.2].sub.obs])], (17)

where [[[Upsilon]].sub.n], represents the distribution function of the t distribution with n degrees of freedom. As could be expected, (17) is very similar to [p.sub.plug], given in (15). Indeed, it has the similar inappropriate behavior that [p.sub.post] [right arrow] 2[1 - [[[Upsilon]].sub.n]([square root]n)], a positive constant, as \[x.sub.obs]\[s.sub.obs] [right arrow] [infinity]. For instance, when n = 4 this constant is .12, and the posterior predictive p value never drops below this constant, no matter how many standard deviations [x.sub.obs] is from 0. The inadequacy of [p.sub.post] here (or of [p.sub.plug]) can be directly traced to the double use of the data, in particular to the fact that [x.sub.obs] is involved in computing both the posterior (or the MLE) and the tail area. Interestingly, the problem with [p.sub.plug] is less severe than that with [p.sub.post] in that the limiting constant is smaller (.046 when n = 4, for instance).

[p.sub.cpred]: Computation shows that

f(x\t;[[sigma].sup.2]) [varies] f(x;[[sigma].sup.2])/f(t;[[sigma].sup.2]) [varies] [([[sigma].sup.2]).sup.[(n-1)/2]] exp {-n[s.sup.2]/2[[sigma].sup.2]},

which is maximized at [[[sigma].sup.2].sub.cMLE] = n[s.sup.2]/(n - 1). As observed earlier, it is equivalent to take [S.sup.2] as the conditioning statistic. It is then easy to show that [pi]([[sigma].sup.2]\[s.sup.2]) is [Ga.sup.-1]((n - 1)/2, n[s.sup.2]/2) and that m(x\[[s.sup.2].sub.obs]) = [t.sub.n-1](x\0,[1/(n - 1)][[s.sup.2].sub.obs]). The resulting conditional predictive p value is

[p.sub.cpred] = 2 [1 - [[[Upsilon]].sub.n-1] ([square root]n - 1[x.sub.obs]/[s.sub.obs])]. (18)

This is perfectly satisfactory and indeed equals the usual classical p value for the problem based on the usual one-sample t statistic, which is known to be uniform under the null.

[p.sub.ppost]: In this case T = X is independent of [[[sigma].sup.2].sub.cMLE] [varies] [S.sup.2] (and they are clearly jointly sufficient), so that the partial posterior predictive p value equals the conditional predictive p value, [p.sub.cpred], in (18).

It should be noted that Meng (1994) also considered use of the departure statistic t(x) = \x\/[s.sub.obs] in the foregoing example, and with this statistic, the posterior predictive p value and the plug-in p value perform fine (being then equal to the other p values). Note, however, that in more complex problems, it may be quite difficult to find "appropriate" departure statistics for use with the posterior predictive or plug-in p values (RVV 2000).

2.2 Motivation and Comparison

2.2.1 Bayesian Motivations for [p.sub.ppost] and [p.sub.cpred]. The U-conditional posterior predictive p values appear to combine the positive features of both the prior predictive and the posterior predictive p values. First, they are based on the prior predictive m(x), which has natural Bayesian meaning; indeed, when [pi]([theta]) is proper, m(t\u) is simply the conditional distribution of T given U arising from the prior predictive m(x). Second, with appropriate choice of U, (10) can be made to primarily reflect surprise in the model, with the prior playing only a secondary role. Third, noninformative priors can be used, as long as [pi]([theta]\u) is proper. Finally, there is no double use of the data, because only part of the data ([u.sub.obs]) is used to produce the posterior to eliminate [theta], whereas another part ([t.sub.obs]) is used when computing the tall area.

Of course, the key to the U-conditional predictive p value is a suitable choice of the conditioning statistic U. Different possible choices of U have been explored by Bayarri and Berger (1997). (See also Evans 1997, where the conditional predictive distribution was used to develop alternate measures of surprise, with U and T chosen to be separate subsamples of the data. A rather different possibility was given by the cross-validatory predictive distribution as described in Gelfand, Dey, and Chang 1992; see Carlin 1999 and the rejoinder in Bayarri and Berger 1999 for more discussion.) The intuition behind suitable choice of U is that one wants U to contain as much information about [theta] as possible, so that [pi]([theta]\[u.sub.obs]) will effectively eliminate [theta] (via integration), subject to the constraint that U should not involve T, as this could lead to a reduction in discriminatory power of the procedure. In Example 1, for instance, [sigma] [[x.sup.2].sub.i]/n would contain all information about [[sigma].sup.2] (being a sufficient statistic under the presumed model), but does involve t(x) = \x\. The obvious solution (used in Example 1) is to define u(x) = [s.sup.2] = [sigma][([x.sub.i] - x).sup.2]/n, because this contains the information about [[sigma].sup.2] that is independent of t(X).

Investigations by Bayarri and Berger (1997) also suggest that u(x) should have the same dimension as [theta]. The simplest general algorithm that achieves these various aims, for the case of continuous data, is to define U to be the conditional MLE of [theta], given t(x) = t, as defined in (13). [The situation of discrete data is considerably more difficult; whereas [[theta].sub.cMLE] in (13) is still typically well defined, it will not be suitable as a conditioning statistic if the resulting conditional sample space contains too few values.]

While logically appealing, the conditional predictive p value, with the conditioning statistic [[theta].sub.cMLE] chosen as in (13), can be difficult to compute. An attractive alternative is to directly use f(x\t; [theta]) [see (13)] to integrate out [theta], rather than simply using it to define [[theta].sub.cMLE]. This leads to the partial posterior predictive p value, defined in (8) and (9), which is typically much easier to work with. Furthermore, the parallel with (13) suggests that the partial posterior predictive p value will be very similar to the conditional predictive p value. This is shown in our (continuous) examples and is further reinforced by RVV (2000), who show that [p.sub.cpred] and [p.sub.ppost] are asymptotically equivalent.

The foregoing Bayesian motivations may appear rather "loose," but history in other areas of statistics has shown that when sound Bayesian reasoning and noninformative priors are used to develop procedures, these procedures typically also have very desirable non-Bayesian properties. That this is so for [p.sub.cpred] and [p.sub.ppost] is discussed in the next section.

2.2.2 Frequentist Motivations and Comparisons. RVV (2000) show that [p.sub.cpred] and [p.sub.ppost] are asymptotic frequentist p values; that is, their asymptotic distribution is U[0, 1] for all [theta]. In this section we study whether this is so for small samples; we say that a p value is a frequentist p value for all [theta] if it is U[0, 1] for all [theta]. We present an illustrative example and two relevant theorems, the first of which follows.

Theorem 1. Let p(X) be any U-conditional predictive p value for a proper [pi]([theta]), and consider it as a random variable with respect to the distribution f(x; [theta]). Assume that the distribution of p(X) does not depend on [theta], and that the conditional distribution of T, given U, is absolutely continuous. Then p(X) is a frequentist p value for all [theta]. The conclusion also holds for improper [pi](theta]) under condition (A. 1) in the Appendix, which is in particular satisfied if U has a location or scale-parameter distribution and [pi]([theta]) is the reference prior.

Proof Suppose that [pi]([theta]) is proper. Then both the conditional predictive distribution for T, m(t\u) = [integral] f(t\u; [theta])[pi]([theta]\u) d([theta], and the prior predictive for U, m(u) = [integral] f(u; [theta])[pi]([theta]) d{theta], are proper. Also, because p(X) is by definition a proper p value with respect to m(t\u), it follows that

[Pr.sup.m(.)] (p(X) [less than or equal to] [alpha]) = [E.sup.m(u) [E.sup.m(t\u) [p(X) [less than or equal to] [alpha]]

= [E.sup.m(u) [[alpha]] = [alpha].

But because

[Pr.sup.m(.)] (p(X) [less than or equal to] [alpha]) = [E.sup.[pi]([theta])] ([E.sup.f(x;[theta])] [p(X) [less than or equal to] [alpha]]),

it follows that if p(X) has a distribution that does not depend on [theta], then

[Pr.sup.m(.)] (p(X) [less than or equal to] [alpha]) = [E.sup.[pi]][theta])] [c([alpha])] = c([alpha]),

where c([alpha]) is some function of [alpha]. It is immediate that c([alpha]) = [alpha] and hence that p(X) is an exact p value. The proof for the improper case is given in the Appendix.

An obvious situation in which Theorem 1 applies is when U can be taken to be a sufficient statistic for [theta]. In that case m(t\u) = f(t\u), and the U-conditional predictive p value equals the frequentist similar p value. Another application of Theorem 1 is to [p.sub.cpred] (and [p.sub.ppost]) in Example 1. From (18), it is clear that their distributions do not depend on [[sigma].sup.2], because the distribution of [square root](n - 1)X/S is independent of [[sigma].sup.2]. Also, [[[sigma].sup.2].sub.cMLE] has a scale-parameter distribution, so it can be immediately concluded that [p.sub.cpred] and [p.sub.ppost] are frequentist p values for all [[sigma].sup.2].

In Example 1, [p.sub.plug] and [p.sub.post] will not be frequentist p values, but the extent to which they deviate from uniformity must be studied numerically. We thus turn to a simpler situation where exact computations can be performed.

Example 2. Assume that [X.sub.1], [X.sub.2],..., [X.sub.n] is a random sample from the exponential([lambda]) distribution. Let T = [X.sub.(1)](which could be used to investigate the lower tail of the null distribution) and assume that the usual noninformative prior [pi]([lambda]) = 1/[lambda] is to be used. In the following, [X.sub.(1)] [less than] [X.sub.(2)] [less than] ... [less than] [X.sub.(n)] denote the order statistics for the observations. Also, define S = [[[sigma].sup.n].sub.i=1] [X.sub.i] and let [s.sub.obs] be the sum of the observed [x.sub.i]. We derive the different p values and investigate their properties. The following fact, established in the Appendix, as used repeatedly:

Pr (T/S [less than or equal to] c) = 1 - [(1 - nc).sup.n-1]. (19)

[p.sub.plug]: Clearly, [lambda] = n/S and T [sim] Ex(n[lambda]), so that

{p.sub.plug] = [e.sup.[-n.sup.2][t.sub.obs]/s.sub.obs]]. (20)

That this is conditionally unsatisfactory can be seen by taking [nt.sub.obs]/[s.sub.obs] [right arrow] 1, in which case the model would clearly be contraindicated, yet [P.sub.plug] [right arrow] [e.sup.-n], a nonzero constant. To investigate whether [P.sub.plug] is a frequentist p value for all [lambda], an easy computation using (20) and (19) yields, for [alpha] [greater than] [e.sup.-n],

Pr([p.sub.plug](X) [less than or equal to] [alpha]) = Pr (T/S [greater than or equal to] - log [alpha]/[n.sup.2])

= [(1 + log [alpha]/n).sup.n-1]. (21)

Thus [p.sub.plug](X) does not have a U[0, 1] distribution and is not a frequentist p value. Figure 1 graphs the density corresponding to (21) when n = 2, to show the substantial deviation from uniformity that can occur. Note, however, that [p.sub.plug] (X) is an asymptotic frequentist p value. Indeed, for large n, (21) is approximately given by

Pr([p.sub.plug](X) [less than or equal to] [alpha]) [approximate] [alpha] [1 - 1/n (log [alpha] + 1/2 [log.sup.2] [alpha])],

which does go to a [alpha] as n [right arrow] [infinity].

[p.sub.sim]:Because S is sufficient, the distribution of [X.sup.1], [X.sub.2],...,[X.sub.n] given s is uniform on the set {X : [[[sigma].sup.n].sub.i=1] [X.sub.i] = s}, and so

[p.sub.sim] = Pr(T [greater than] [t.sub.obs]\[s.sub.obs]) = Pr ([W.sub.(1)] [greater than] [nt.sub.obs]/[s.sub.obs])

= [(1 - [nt.sub.obs]/[s.sub.obs]).sup.(n-1),

where [W.sub.(1)] = min ([W.sub.1], [W.sub.2],..., [W.sub.n] and the [W.sub.i] are iid U(0, 1). This will be seen to be equal to [p.sub.ppost].

[P.sub.prior]: The prior predictive p value cannot be computed for this example, because the prior distribution is improper.

[P.sub.post]: The posterior distribution of [lambda] is easily seen to be Ga(n, [s.sub.Sobs]), and the posterior predictive density of T is ([n.sup.2]/[s.sub.obs])[[(s.sub.obs]/(nt+[s.sub.obs]))].sup.n+1]. The posterior predictive p value can then be computed as

[p.sub.post] = [Pr.sup.mpost(t\xobs)] (T [greater than] [t.sub.obs) = [(1 + [nt.sub.obs]/[s.sub.obs]).sup.-n].

It can be seen that [p.sub.post] [right arrow] [2.sup.-n], a nonzero constant, as [nt.sub.obs]/[s.sub.obs] [right arrow] 1, which is not appropriate behavior. Moreover, the distribution of [p.sub.post] is not U[0,1]. Indeed, for [alpha] [greater than] [2.sup.-n],

Pr([p.sub.post](X) [less than or equal to] [alpha]) = Pr [T/S [greater than or equal to] - 1/n (1/[[alpha].sup.1/n])]

= (2 - [[[alpha].sup.-1/n]).sup.n-1]. (22)

The corresponding density function is graphed in Figure 1 when n = 2, and is even further from uniformity than is the density corresponding tO [p.sub.plug]. Again, however, [p.sub.post] can be shown to be asymptotically U[0, 1].

[p.sub.ppost]: An easy computation shows that

f(X)\t;[lambda]) [alpha] [[lambda].sup.n-1] exp {-[lambda] ([sigma] [x.sub.i] - nt)}, (23)

so that the partial posterior for [lambda] is

[pi]([lambda]\[x.sub.obs] \ [t.sub.obs]) = [[lambda].sup.n-2][e.sup.-[lambda]([s.sub.obs]-[nt.sub.obs])/T(n - 1)[([s.sub.obs] - [nt.sub.obs]).sup.-(n-1)]

and the partial posterior predictive density is

m(t\[x.sub.obs] \ [t.sub.obs]) = n(n - 1)[([s.sub.obs] - [nt.sub.obs]).sup.n-1]/[(nt + [s.sub.obs] - [nt.sub.obs]).sup.n]

The partial posterior predictive p value can then be computed as

[p.sub.ppost] = [Pr.sup.m(t\[x.sub.obs]\[t.sub.obs]) (T [greater than] [t.sub.obs)

= [(1 + [nt.sub.obs]/[s.sub.obs] - [nt.sub.obs]).sup.-(n-1)

= [(1 - [nt.sub.obs]/[s.sub.obs]).sup.n-1],

which is identical to the similar p value. It can be shown that [P.sub.ppost] [right arrow] 0 as [n.sub.obs]/[s.sub.obs] [right arrow] 1, so that there are no apparent conditional difficulties with the partial posterior predictive p value. It will also be seen that [p.sub.ppost] is a frequentist p value for all n.

[p.sub.cpred]: Maximization of (23) over [lambda] yields [[lambda].sub.cMLE] [alpha] [[[sigma].sup.n].sub.i=1] [X.sub.i] - n[X.sub.1] = S - nT. It can be seen that [X.sub.(i)] - [X.sub.(1)] is independent of [X.sub.(l)], from which it follows directly that [[lambda].sub.cMLE] is independent of T. As discussed in the paragraph preceeding Example 1, it follows that [p.sub.cpred] = [p.sub.ppost]. Note that the derivation of [p.sub.ppost] was considerably simpler than that of [p.sub.cpred]. Finally, as in the argument leading to (19), it can be shown that Pr([p.ppost])X) [less than or equal to] [alpha]) does not depend on [lambda]. Also, [[lambda].sub.cMLE] has a scale-parameter distribution, and so, by Theorem 1, [p.sub.cpred] (and hence also [p.sub.ppost] and [p.sub.sim]) is a frequentist p value. Notice that Theorem 1 cannot be directly applied to [p.sub.ppost].

It is something of a curiosity that in both Examples 1 and 2, [p.sub.sim] coincides with [p.sub.cpred] and [p.sub.ppost], especially because [p.sub.sim] and [p.sub.cpred] are determined from distributions on completely different (conditional) spaces. This is useful methodologically for those who wish to use [p.sub.cpred] or [p.sub.sim], because it is typically much easier to derive [p.sub.ppost] than either of the other two p values. The following theorem gives more general conditions under which this equivalence holds. (It is easy to see that Examples 1 and 2 both satisfy the conditions of the theorem.)

Theorem 2. Suppose that f(x;[theta]) is a continuous density from the natural scale exponential family and that statistics T [greater than] 0 and U [greater than] 0 exist such that S = T + U is sufficient and

f(t,U;[theta]) = k[[theta].sup.[alpha]][t.sup.[gamma]][u.sup.[alpha]-[gamma]-2] exp{-[theta](t + u)},

for some constants k, [gamma] [greater than] -1, and [gamma] [less than] [alpha] - 1. Under the usual noninformative prior, [pi]([theta]) = 1/[theta], it will be the case that [p.sub.cpred], [p.sub.ppost], and [p.sub.sim] are all equal.

Proof. That [p.sub.cpred] and [p.sub.ppost] are equal follows from direct calculation. To show their equality with [p.sub.sim], first integrate f(t,u;[theta]) with respect to [pi]([theta]) = 1/[theta], obtaining

m(t, u) [alpha] [t.sup.[gamma]][u.sup.[alpha]-[gamma]-2/[(t + u).sup.[alpha]].

Thus m(t/[u.sub.obs]) = [ct.sup.[gamma]][(t+[u.sub.obs]).sup.-[varies]], where c is the appropriate normalizing constant, and hence

[p.sub.cpred] = [[[integral].sup.[infinity]].sub.[t.sub.obs]] m (t/[u.sub.obs])dt = c [[[integral].sup.[infinity]].sub.[t.sub.obs]] [t.sup.[gamma]]/[(t + [u.sub.obs]).sup.[varies]] dt. (24)

An easy computation shows that the conditional density of T given S is

[f.sup.*](t/s) = [c.sup.*] [t.sup.[gamma]][(s - t).sup.[varies]-[gamma]-2]/[S.sup.[varies]], for 0 [less than] t [less than] s,

where [c.sup.*] is the appropriate normalizing constant. Hence [p.sub.sim] is given by

[p.sub.sim] = [[[integral].sup.[infinity]].sub.[t.sub.obs]] f(t\[S.sub.obs])dt = [c.sup.*] [[[integral].sup.[S.sub.obs]].sub.[t.sub.obs]] [t.sup.[gamma]][([S.sub.obs] - t).sup.[varies]-[gamma]-2] dt.

Changing variables to t = ([S.sub.obs][omega])/([omega] + [u.sub.obs]), the latter integral reduces to that in (24), and the thereon follows.

2.3 Computation

Simulation methods are typically needed to compute the partial posterior predictive p value. These simulations will typically be only modestly more difficult than those involved in computation of either the prior predictive p value or the posterior predictive p value, providing that f([t.sub.obs]; [theta]) is available in closed form.

Noting that the partial posterior predictive p value can be rewritten as

[p.sub.ppost] = [integral]Pr(T [greater than or equal to] [t.sub.obs];[theta])[pi]([theta]\[x.sub.obs] \ [t.sub.obs]) d[theta],

an obvious strategy is to repeatedly generate [theta] from [pi]([theta]\[x.sub.obs] \ [t.sub.obs]) and then T from f(t;[theta]) [which could of course be done by simply generating X from f(x;[theta]) and computing t(X)], estimating [p.sub.ppost] by the fraction of generated T that are greater than [t.sub.obs].

There are various possibilities for generating from [pi]([theta]\[x.sub.obs] \ [t.sub.obs]). If generation from the full posterior [pi]([theta]\[x.sub.obs]) is easy, then a natural possibility is to use the following simple Metropolis chain: use [pi]([theta]\[x.sub.obs] as the probing distribution to obtain a candidate [[theta].sup.*], and then move from the current [theta] to the candidate with probability min {1, f([t.sub.obs];[theta])/f([t.sub.obs];[[theta].sup.*])}.

In some sense, a "bad" discrepancy statistic T is one for which f([t.sub.obs];[theta])is highly variable in [theta]. (Casually chosen T for model checking will often have this property.) Whereas such T will not yield good p values by standard methods, the results in this article (and in RVV 2000) indicate that the new conditional p values will still be quite satisfactory. The price to be paid, however, is that computation of the new p values can be more difficult with such T, because [pi]([theta]\[x.sub.obs]) may no longer be a good probing distribution. A slight modification of the foregoing Metropolis chain can then be considerably more efficient: generate U, a uniform random variable on (0, 1); generate [theta]' from [pi]([theta]\[x.sub.obs]); and choose the candidate [theta]' = [theta]' + U ([theta] - [[theta].sub.cMLE]), where [theta] and [[theta].sub.cMLE] are the MLE and the conditional MLE [see (13)] of [theta]. Then, move from the current [theta] = [theta][degrees] + U[degrees]([theta] - [[theta].su b.cMLE]) to the candidate [[theta].sup.*] with probability

min {1, f([t.sub.obs];[theta])/ f([t.sub.obs];[[theta].sup.*]) [pi]([[theta].sup.*])/[pi]([theta]) [pi]([theta][degrees])/[pi]([theta]') f([x.sub.obs];[[theta].sup.*])/f([x.sub.obs];[theta]) f([x.sub.obs];[theta][degrees])/f([x.sub.obs];[theta]')}.

Alternatives to such direct Monte Carlo computation of [p.sub.ppost] include importance sampling schemes. For instance, if a (possibly dependent) sample [{[theta].sub.j],j = 1,...,m} from [pi]( [theta]\[x.sub.obs]) were available, then one could estimate [p.sub.ppost] by

[p.sub.ppost] = [[[sigma].sup.m].sub.j=1] Pr(T [greater than or equal to] [t.sub.obs];[[theta].sub.j])/f([t.sub.obs];[[theta].sub.j])/[[[sigma] .sup.m].sub.j=1] 1/f([t.sub.obs];[[theta].sub.j])

[This would work because f([x.sub.obs];[theta]) is typically considerably more concentrated than f([t.sub.obs];[theta]).]

When f([t.sub.obs];[theta]) is not available in closed form, it must be estimated, possibly through some type of kernel estimate; note that T typically is a one-dimensional statistic, and estimation of a one-dimensional density at a point usually is not excessively difficult. Of course, this estimation must be done in conjunction with the Metropolis or importance sampling schemes mentioned earlier, and efficiency might improve if one keeps only widely spaced (i.e., approximately independent) [theta].

Computing [p.sub.cpred](u) is usually considerably more difficult, unless the densities on the left side or the right side of (11) are available in closed form. Various Gibbs and Metropolis-Hasting schemes for its computation were given by Bayarri and Berger (1999) and are not repeated here. The computational difficulty of [p.sub.cpred](u) (see also Pauler 1999) is the main reason why we recommend [p.sub.ppost] for routine use.

3. COMPARISONS IN THE NORMAL LINEAR MODEL

In this section we derive the various p values for the normal linear model and give characterizations of their degree of uniformity (frequentist sense). This section was motivated by RVV (2000), who derive corresponding results under assumptions that yield asymptotic normality. Seeing the results in the finite-sample setting (under normality) should help alleviate any concerns about "asymptopia." Note that in this section we do not attempt to distinguish between the MLE and its value at the observed data; both are denoted by [theta].

Let Y = [([Y.sub.1], [Y.sub.2],...,[Y.sub.n]).sup.t] be the n x 1 vector of response variables, let [theta] = [([[theta].sub.l], [[theta].sub.2],...,[[theta].sub.k]).sup.t] be the k x 1 vector of regression coefficients, let V be a full-rank n x k matrix of covariables, and let [epsilon] be an n x 1 vector of errors. Assume that we are testing

[H.sub.0]: Y = V[theta] + [epsilon], [epsilon] [sim] [N.sub.n] (0, [[sigma].sup.2]1), [[sigma].sup.2] known. (25)

Consider a linear departure statistic T = [w.sup.t]Y, with given w = [([w.sub.1], [w.sub.2],...,[w.sub.n]).sup.t]. It follows from (25) that

T\[theta] [sim] N([w.sup.t]V[theta], [[sigma].sup.2] [\\w\\.sup.2]). (26)

Also, with the usual noninformative prior, [pi]([theta]) = 1, the posterior distribution, [pi]([theta]\y), is [N.sub.k]([theta]\[theta], [[sigma].sup.2] [([V.sup.t]V).sup.-1]), where [theta] = [([V.sup.t]V).sup.-1])[V.sup.t]y is the usual least squares estimate.

3.1 Plug-In p Value

It follows from (26) that [p.sub.plug] is given by

[p.sub.plug] = [Pr.sup.f(t;[theta])](T[greater than][t.sub.obs])=1-[phi] ([t.sub.obs]-[w.sup.t]V[theta]/[sigma]\\w\\).

To study the distribution of [p.sub.plug](Y) (to assess its frequentist uniformity), note that

T -[w.sup.t]V[theta][sim]N([w.sup.t]BV[theta], [[sigma].sup.2][w.sup.t][BB.sup.t]w), (27)

Where B = I-V[([V.sup.t]V).sup.-1][V.sup.t]. Because BV = 0 and [BB.sup.t] = B, it follows that

[p.sub.plug](Y) = 1 - [phi]([square root][w.sup.t]Bw/[\\w\\.sup.2]Z), (28)

where Z [sim] N(0, 1). Thus [p.sub.plug](Y) will have a U[0, 1] distribution only if [w.sup.t]Bw/[\\w\\.sup.2] = 1 which in turn can happen only if [V.sup.t]w = 0. Although the latter will be satisfied by common choices of T, such as a linear function of the vector of residuals, it clearly need not hold in general. When it does not hold, [w.sup.t]Bw/[\\w\\.sup.2] will be smaller than 1, so that [p.sub.plug] will be conservative.

3.2 Posterior Predictive p Value

The posterior predictive distribution of T, given [Y.sub.obs] is N([W.sup.t]V[theta],[[sigma].sup.2][w.sup.t]Cw), where C = I + V[([V.sup.t]V).sup.-1][V.sup.t]. It follows that the posterior predictive p value is given by

[p.sub.post] = [Pr.sup.[m.sub.post(t\[y.sub.obs])]](T[greater than][t.sub.obs] = 1-[phi]([t.sub.obs]-[w.sup.t]V[theta]/[sigma][square root][w.sup.t]Cw)

When considered as a random p value and using (27), [p.sub.post] can be expressed as

[p.sub.post](Y) = 1-[phi]([square root][w.sup.t]Bw/[w.sup.t]Cw Z) (29)

Again, this will be U[0, 1] only if [V.sup.t]w = 0. Otherwise, [w.sup.t]Cw will be larger than [\\w\\.sup.2] and, comparing (28) and (29), [p.sub.post] will then be even more conservative than [p.sub.plug] This observation was first made in the asymptotic setting by Robins (1999) and RVV (2000).

3.3 Partial Posterior Predictive p Value

Calculation yields

f(y\[t.sub.obs];[theta])[varies]exp{-[1/2[[sigma].sup.2][[([theta]-[t heta]).sup.t][V.sup.t]V([theta]-[theta])-[([w.sup.t]V[theta]-T).sup.t ] [([\\w\\.sup.2]).sup.-1]([w.sup.t]V[theta]-T)]}. (30)

Because the partial posterior distribution [pi]([theta]\[y.sub.obs]\[t.sub.obs]) is proportional to (30), expanding the quadratic forms in (30) and rearranging terms yields

[pi]([theta]\[y.sub.obs]\[t.sub.obs])=[N.sub.k]([theta]\[u.sub.obs], [[sigma].sup.2][sigma]), (31)

where U = [([V.sup.t]HV).sup.-1][V.sup.t]HY,[sigma]=[([V.sup.t]HV).sup.-1], H=[I-([ww.sup.t]/[\\w\\.sup.2])], and the right side of (31) denotes the k-variate normal density in [theta] with the given mean and covariance matrix. From (26) and (31), it follows that the partial predictive distribution of T is given by

T\[Y.sub.obs]\[t.sub.obs][sim]N([w.sup.t]V[u.sub.obs],[[sigma].sup.2] [w.sup.t] [I+V[sigma][V.sup.t]]w),

so that the partial posterior predictive p value is

[p.sub.ppost] = [Pr.sup.[m(t\[y.sub.obs]\[t.sub.obs])]] (T[greater than][t.sub.obs]) = 1 - [phi]([t.sub.obs]-[w.sup.t]V[u.sub.obs]/ [sigma][square root][w.sup.t][I+V[sigma][V.sup.t]w)

To study the distribution of [p.sub.ppost] (Y), note that

T -[w.sup.t]VU\[theta][sim]N([w.sup.t]DV[theta],[[sigma].sup.2][w.sup.t ] [DD.sup.t]w),

where D = I - V[([V.sup.t]HV).sup.-1][V.sup.t]H and H is as in (31). Algebra shows that [w.sup.t]DV = 0 and [w.sup.t][DD.sup.t] w = [w.sup.t][ I + V [sigma][V.sup.t]]w, so that [p.sub.ppost](Y) = 1-[phi](Z), where Z [sim] N(0,1). Thus [p.sub.ppost] is a valid frequentist p value.

3.4 Conditional Predictive p Value

It can easily be seen from (30) and (31) that the U maximizing (30) is precisely the statistic U given in (31); that is, U = [([V.sup.t]HV).sup.-1][V.sup.t]HY. Because T and U have a joint (k + 1)-variate normal distribution such that cov(T, U) = [w.sup.t]HV[([V.sup.t]HV).sup.-1] = 0, they are independent. It follows that [P.sub.cpred] equals [p.sub.ppost] and hence it is also a valid frequentist p value.

4. DISCRETE SAMPLE SPACES

For discrete sample spaces, the most common classical approach to defining p values is to condition on a statistic U such that f(x\u;[theta]) does not depend on 9. The Fisher exact test is the prototypical example that we consider here. Note that conditioning on any U can severely constrain the sample space, resulting in serious conservatism of the resulting p value (because there may then be very few possible observations in the "tail" of the departure statistic, T). We see that [p.sub.ppost] can substantially overcome this difficulty. [We do not consider [p.sub.cpred], because the choice of the conditioning statistic in (13) typically will not work in discrete problems, and because any conditional p value can fall prey to the same difficulty mentioned earlier.]

We specifically consider the problem of testing homogeneity and independence in 2 x 2 contingency tables, comparing the similar p value (which is the Fisher exact test) and the partial posterior predictive p value. Many other p values for contingency tables have been proposed (a nice survey of proposed tests was given in Agresti 1992; see also Hwang and Yang 1997), and many of these perform considerably better than the Fisher exact test. Our attitude here is not that of seeking to determine an optimal p value for these situations, but rather to see if straightforward implementation of [p.sub.ppost] can offer significant gains. (Recall that we hope to see [p.sub.ppost] used in situations of considerable complexity, in which there is little hope of determining optimal p values; in judging the effectiveness of [p.sub.ppost], however, it is useful to consider moderately difficult situations, such as this, to see whether an easy implementation works.)

Consider the following 2 x 2 contingency table: We analyze two common scenarios involving such tables.

Case 1. One of the margins, say [X.sub.+1] = [n.sub.1], [X.sub.+2] = [n.sub.2], is fixed by the design, so that [X.sub.11] and [X.sub.12] can be viewed as independent binomial random variables. We want to study the null model of homogeneity, that the two binomial distributions have the same success probability, [theta].

Case 2. The design fixes only the overall sample size, n. A common null model is that classification by A and B is independent, so that the probability of each cell is the product of the corresponding marginal probabilities.

For convenience of notation in this section, we denote an observed value of X.. by a superscript "o"; that is, [[x.sup.o].sub...].

4.1 Case 1: Test of Homogeneity

Here the null model is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

[x.sub.11] = 0,..., [n.sub.1], [x.sub.12] = 0,..., [n.sub.2]. (32)

The Fisher exact test conditions on the other marginal total, say [X.sub.1+]. It is common in textbooks to then take the test statistic (in the conditional problem) to be T = [X.sub.11]. An easy computation shows that the (one-tailed) p value corresponding to the Fisher exact test ([p.sub.fet]) is given by

[p.sub.fet] = [[[sigma].sup.min{[[x.sup.o].sub.1+],[n.sub.1]}].sub.j=[t.sub.obs]] f(j\[[x.sup.o].sub.1+])

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Unconditionally, T = [X.sub.11] is not a particularly sensible statistic for measuring departure from homogeneity. Indeed, Suissa and Shuster (1985) proposed a sensible unconditional T for this particular problem. In illustrating [p.sub.ppost], however, we first study what happens if one naively "follows the textbooks" and chooses T = [X.sub.11], even though this is not sensible unconditionally. (Our point is to show that [p.sub.ppost] behaves admirably even with a simple, but rather inappropriate, choice of T.) Then we consider a choice of T that is more reasonable from an unconditional perspective.

Choosing the constant prior [pi]([theta]) = 1, an easy computation shows that the partial posterior distribution is beta([x.sup.o].sub.12] + 1, [n.sub.2] - [[x.sup.o].sub.12] + 1) and

[p.sub.ppost] = [[[sigma].sup.[n.sub.1]].sub.j=[t.sub.obs]] m(t\[X.sub.obs] \ [t.sub.obs])

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Incidentally, it can be shown in this problem (for the given choice of T) that [p.sup.cpred] = [p.sub.ppost]; this is thus a discrete situation in which conditioning as in (13) does not unduly restrict the sample space.

A more sensible unconditional choice of the discrepancy statistic is T [(1/[n.sub.1])[X.sub.11] - (1/[n.sub.2])[X.sub.22]] (because the null model is that the two binomial populations have the same success probability). The partial posterior predictive p value for this choice does not admit a simple closed-form expression but can be readily computed numerically.

Example 3. As a rather extreme test case, consider [n.sub.1] = [n.sub.2] = 3. Here conditioning on [x.sub.1+] severely restricts the support of the distribution of [p.sub.fet], which reduces to {.05, .2, .5, .8, .95, 1}. The supports of the distributions of [p.sub.ppost], for either choice of T, are considerably richer and include more values closer to 0 and 1.

Figure 2 gives the distribution functions of [p.sub.fet] (X) and [p.sub.ppost] (X) (for both. choices of T) at two different values of [theta]. Recall that the goal is to have p values with close to uniform distributions, and [p.sub.ppost] clearly fares much better in this regard (the straight dotted lines being the unattainable uniform ideal). As expected, the Fisher exact test is very conservative, which translates into a severe lack of discriminatory power.

Note that [p.sub.ppost] seems to perform somewhat better with the "sensible" discrepancy statistic T = [(1/[n.sub.i])[X.sub.11] - (1/[n.sub.2])[X.sub.22]] than with T = [X.sub.11], in the sense that it then has a distribution somewhat closer to uniform in the most interesting region of small values of p. However, [p.sub.ppost] seems to be quite satisfactory (and much better than the Fisher exact test), even when the intuitively unsuitable T = [X.sub.11] is used.

Various other [theta] were also considered. The distribution of [p.sub.ppost] for the "sensible" choice of T is remarkably stable and performs very well for all values of [theta]. The other two p values ([p.sub.fet] and [p.sub.ppost] with T = [X.sub.11]) were excessively conservative for small [theta], although [p.sub.ppost] began to perform noticeably better even for values of [theta] as small as .2. For large values of [theta], [p.sub.fet] was again very conservative, whereas [p.sub.ppost] performed remarkably well, unless [theta] was very large; we discuss this latter situation more fully later.

4.2 Case 2: Test of Independence

Referring to the contingency table with fixed n and defining [theta] = Pr([A.sub.1]) and [xi] = Pr([B.sub.1]), the null model under independence of classification can be expressed as

f(X;[theta],[xi]) = (n!/[x.sub.11]![x.sub.12]![x.sub.21]![x.sub.22]!) [[theta].sup.[x.sub.+1]][(1-[theta]).sup.x+2][[xi].sup.[x.sub.1]+][(1 -[xi]).sup.[x.sub.2+].

Here the Fisher exact test conditions on both margins, and again the "textbook" conditional departure statistic is typically chosen to be T = [X.sub.11]. The ensuing conditional density of T is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

which produces the same p value as in the test for homogeneity. (Note that here, [[x.sup.o].sub.1+] plays the role of [n.sub.1] in Case 1.)

In deriving [p.sub.ppost], we restrict attention to the statistic T = [X.sub.11], even though this is not particularly sensible from an unconditional perspective. We do this in part so that it cannot be argued that we obtain better results than [p.sub.fet] by choice of a better T and, in part, to indicate the quality of [p.sub.ppost] even with an inferior choice of T.

Using uniform independent priors for [theta] and [xi], the partial posterior, [pi]([theta], [xi]\[x.sub.obs]\[t.sub.obs]), is proportional to f(X\[t.sub.obs]; [theta], [xi]), and [p.sub.ppost] can most conveniently be expressed as

[p.ppost] = [[[integral].sup.1].sub.0] [[[integral].sup.1].sub.0] [pi]([theta],[xi]\[x.sub.obs] \ [t.sub.obs] [[[sigma].sup.n].sub.t=[t.sub.obs]]

binomial(t\n, [theta][xi]) d[theta] d[xi], (33)

where

[pi]([theta],[xi]\[X.sub.obs] \ [t.sub.obs])

[varies] [[theta].sup.[[x.sup.0].sub.21]] [(1 - [theta]).sup.[[x.sup.0].sub.+2]][[xi].sup.[[x.sup.0].sub.2+]][(1 - [xi]).sup.[[x.sup.0].sub.2+]][(1 - [theta][xi]).sup.-(n-[t.sub.obs])]. (34)

(Note that for this case, [p.sup.ppost] and [p.sub.cpred] differ, and indeed the latter can be problematical because of possibly restrictive conditioning.)

To compute [p.sub.ppost], we use importance sampling based on the importance function

1/2U([theta]\0, 1)beta([xi]\[[x.sup.0].sub.12] + 1,[[x.sup.0].sub.22] + 1) - 1/2 beta([theta]\[[x.sup.0].sub.22] + 1)U([xi]\0, 1). (35)

Not only is this an easy importance function to use in terms of random variable generation, but it also is highly efficient computationally, for even very large n. The reasons for this are given in the Appendix, which also presents other computational details.

Example 4. We again consider a rather extreme case, namely n = 5. It can be shown that the support of [p.sub.fet] (X) is limited to {.l, .2, .3, .4, .6, .7, .8, .9}, whereas the support of [p.sub.ppost] (X) is noticeably richer. The distribution functions of these two p values are given in Figure 3 for various values of the parameters. For all but very large values of the parameters, [p.sub.ppost] seems considerably more uniform than [p.sub.fet].

Further investigations revealed that when both [theta] and [xi] are small, either p value is quite conservative, with [p.sub.ppost] the less conservative. Both p values perform at their best for [theta], [xi] [approximate] .5, with [p.sub.ppost] performing much better. When one of [theta] or [xi] is small and the other is large, [p.sub.fet] is again very conservative, whereas [p.sub.ppost] performs remarkably well.

Both [p.sub.fet] and [p.sub.ppost] are probably asymptotic frequentist p values, and it is of interest to ask how large n must be for their distributions to be approximately uniform. This is especially interesting for smaller values of p, which are typically of most interest when n is large. An illustrative such feature is the sample size needed for the distribution function of a p value at .05 to be within 20% of .05 (the value under the desired uniformity). When ([theta], [xi]) = (.6, .5), this obtains for [p.sub.fet] only when n [approximate] 500; in contrast, this occurs for [p.sub.ppost] when n [approximate] 10. When ([theta], [xi]) (.3, .9), [p.sub.fet] requires n [approximate] 1200, whereas [p.sub.ppost] needs only n [approximate] 110.

The apparent breakdown of both [p.sub.fet] and [p.sub.ppost] for large values of ([theta], [xi]) [such as (.9, .9) in Fig. 3] deserves special discussion. First, note that [p.sub.fet] becomes almost hopelessly conservative, never stating that the data is incompatible with the model. In contrast, [p.sub.ppost] is markedly anticonservative for this situation. At a very intuitive level, the behavior of [p.sub.ppost] seems more sensible. After all, we declared large values of T to be evidence against the null model, and when ([theta], [xi]) are both large, the values of T = [X.sub.11] clearly will typically be very large; [p.sub.ppost] reacts to this with ready "rejection" of the null model, whereas [p.sub.fet] ignores all but incredibly large T. This anticonservative behavior of [p.sub.ppost] arises because a very large value of T = [X.sub.11] contains a great deal of information about the parameters, but relatively little information about deviance from the model. This is one negative consequence of using an in ferior choice of T.

The most extreme example of an inappropriate choice of T is a sufficient statistic for the parameter; such a statistic is nearly useless for model checking. We examine this further in a very simple example, so as to better understand the nature of [p.sub.ppost] in such a situation.

Example 5. Assume that the null model is [X.sub.i] [sim] Bernoulli([theta]), i = 1,... ,n, and that T = [sigma] [X.sub.i], a sufficient statistic. Here m(t/u) = m(t/[x.sub.obs] \ [t.sub.obs]) = m(t) = 1/(n + 1) for t = 0,...,n, and [p.sub.ppost] 1 - [t.sub.obs]/(n + 1). For large n, the distribution of [p.sub.ppost] is approximately N(1 - [theta], [theta](1 - [theta])/n), which concentrates tightly around 1 - [theta]. Thus when [theta] is large, the distribution function of [p.sub.ppost] jumps immediately, giving rise to anticonservative p values; in contrast, for small [theta], the situation reverses, and [p.sub.ppost] is conservative. Figure 4 shows the resulting distribution functions for three values of [theta] when n = 100.

Of course, this behavior of [p.sub.ppost] is entirely natural according to Bayesian intuition; large values of T are essentially equivalent to large values of [theta], and as such are declared to be "surprising." As another argument, note that [p.sub.ppost] is equivalent to [p.sub.prior] here, and choosing T to be sufficient is effectively stating that we will also allow its presence in the tail of the prior to discredit the model.

In contrast, the distributions of both [p.sub.plug] and [p.sub.ppost] can be seen to concentrate tightly about 1/2 when n is large, for any value of [theta]. (That [p.sub.post], when it differs from uniformity, does so by concentrating closer to 1/2 was discussed in Meng 1994; see also Rubin 1996.) This is illustrated in Figure 4 for [p.sub.plug] when n = 100. Thus, unlike [p.sub.ppost] which provides some kind of information, [p.sub.plug] and [p.sub.post] provide completely useless answers here. (Of course, non-Bayesians may argue that it is better to infer nothing than to in effect base a conclusion on the prior; but recall that in the context we are considering, this means essentially refusing to consider alternatives to the null model, at least when T is chosen poorly.)

As a final comment concerning this issue, recall that requiring uniformity of p values for all values of [theta] might well be too restrictive for a Bayesian (and also possibly for a frequentist). The natural (Bayesian) requirement (see Meng 1994) is to require a p value to be uniform under the prior predictive distribution. Because in this example, the partial posterior predictive reduces to the prior predictive, it follows that [P.sub.ppost] is indeed a p value for a Bayesian. (The "average" of all the distribution functions of [P.sub.ppost] in Fig. 4 is uniform.) In contrast, no Bayesian averages of the distribution functions of [P.sub.plug] (or [P.sub.post]) can be uniform. Of course, the Bayesian reasoning in this example is facilitated by the fact that the noninformative prior is actually proper, but related arguments involving averages with respect to classes of priors probably can be made for the improper case; we do not pursue this here.

When considering the choice of T in discrete situations, this issue of sufficiency can arise in subtle ways. In Example 3, for instance, consider T = [(1/[n.sub.1])[X.sub.11] -- (1/[n.sub.2])[X.sub.22]] when [n.sub.1] and [n.sub.2] are different prime numbers. From a given nonzero value of T, it is then clear that [X.sub.11] and [X.sub.22] can be completely reconstructed. In this situation, T thus is nearly sufficient, and its use would encounter the aforementioned difficulties. (In contrast, when [n.sub.1] = [n.sub.2], this problem with T does not arise.) Such "technical" near-sufficiency of T in discrete settings can be eliminated by the simple device of replacing T by a binned version, [T.sup.*], with the bin size chosen so that each value of [T.sup.*] corresponds to several sample points.

5. CONCLUSIONS

Our comparisons have not included [P.sub.prior], because this cannot typically be used with noninformative priors. Also, [P.sub.sim] is just a version of a conditional predictive p value, obtained by choosing the conditioning statistic U to be a sufficient statistic (when available). Indeed, in all of our continuous examples it happened that [P.sub.sim] was equal to [P.sub.cpred], although this certainly is not true in general. For those wishing to use [P.sub.sim], this equality is a fortunate occurrence when it obtains (see Theorem 3), because [.sub.cpred] is typically much easier to compute directly in those situations than [P.sub.sim]. The following discussion is thus limited to the other four p values.

A surprising observation in our examples (first discussed in Robins 1999) is that [P.sub.plug] seems superior to [P.sub.post], in the sense that it is closer to being a frequentist p value; in particular, it is less conservative. This would seem to contradict the common Bayesian intuition that it is better to account for parameter uncertainty by using a posterior than by simply replacing [theta] by [theta]. The explanation is that [P.sub.post] does not account for parameter uncertainty in a legitimate Bayesian way, because it involves a double use of the data. (Indeed, the original motivation for [P.sub.cpred] and [P.sub.ppost] was precisely to account for parameter uncertainty in a legitimate Bayesian fashion.) We have considered only a few situations here, but, together with the similar asymptotic conclusion of RVV (2000), it would seem that [P.sub.plug] should be preferred to [P.sub.post] in practice. This is especially so because [P.sub.plug] is typically easier to compute than [P.sub.post]. (In some sit uations in which Bayesian analysis is being performed via MCMC, a posterior sample of [theta]'s might be more readily available than an MLE, but one could then plug in, say, a posterior mean for [theta] rather than the MLE.) At the very least, our observations here and those of RVV (2000) indicate that it cannot simply be assumed that [P.sub.post] is better than [P.sub.plug], as has typically been the case in the literature. It should be noted, however, that posterior predictive p values are also commonly used today with discrepancy statistics that depend on [theta], as well as on x, and there are currently no alternatives to their use in such situations (although see RVV 2000).

In all our continuous examples, [P.sub.plug] performed worse in the frequentist sense than either [P.sub.ppost] or [P.sub.cpred] This again supports the asymptotic conclusions of RVV (2000) and suggests that the latter p values, if available, are to be preferred in practice. Computation is clearly an issue, however, in that [P.sub.plug] is typically easier to compute than the new p values, especially [P.sub.cpred] (see also Pauler 1999). Computation of [P.sub.ppost] is usually not difficult if f(t; [theta]) is available in closed form, and we would definitely recommend its use in that case.

The (asymptotic) superiority of [P.sub.ppost] and [P.sub.cpred] arises when the departure statistic T is not appropriately "centered," as discussed by RVV (2000). In a sense, the new p values can be viewed as automatically "centering" a departure statistic T, which can be a considerable simplification in practice, avoiding the need for asymptotics or clever statistical intuition. Indeed, in model checking one often wishes to try a series of rather generic possible discrepancy statistics T, and having an automatic centering mechanism is a considerable simplification.

On a more speculative note, it is quite plausible that use of [P.sub.ppost] and [P.sub.cpred] can result in an improvement (over, say, [P.sub.plug]) with even "centered" choices of T (as long as the distribution of T still depends on [theta] to some extent). This could be improvement in finite-sample performance or in higher-order asymptotic terms.

The situation involving discrete distributions is more complex, but the gains through use of the new p values, especially [P.sub.ppost] can be quite dramatic. Discreteness of the sample space can cause common p values, such as those from the Fisher exact test, to be very conservative in small samples, whereas the partial posterior p value is rather remarkably uniform, especially if a reasonable discrepancy statistic T is used.

M. J. Bayarri is Professor of Statistics and O.R., Department of Statistics and O.R., University of Valencia, 46100 Burjassot, Valencia, Spain (E-mail: susie.bayarri@uv.es). James O. Berger is Arts and Sciences Professor of Statistics, Institute of Statistics and Decision Sciences, Duke University, Durham, NC 27708 (E-mail: berger@stat.duke.edu). This work was supported in part by National Science Foundation grants DMS-9303556 and DMS-9802261, and by Ministry of Education and Culture and the Generalitat Valenciana (Spain) grants PB96-0776 and POST99-0l-7. The authors thank George Casella and Martin Tanner for organizing these contributions to JASA, and the associate editor and three referees for numerous helpful comments that greatly improved the article.

REFERENCES

Agresti, A. (1992), "A Survey of Exact Inference for Contingency Tables" (with discussion), Statistical Science, 7, 131-177.

Aitkin, M. (1991), "Posterior Bayes Factors" (with discussion), Journal of the Royal Statisticol Society, Ser. B, 53, 111-142.

Bayarri, M. J., and Berger, J. 0. (1997), "Measures of Surprise in Bayesian Analysis," ISDS Discussion Paper 97-46, Duke University.

-----(1999), "Quantifying Surprise in the Data and Model Verification," in Bayesian Statistics 6, eds. J. M. Bernardo, J. 0. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 53-82.

Berger, J. 0., Boukai, B., and Wang, W. (1997), "Unified Frequentist and Bayesian Testing of Precise Hypotheses," Statistical Science, 12, 133-160.

Berger, J. 0., Brown, L. D., and Wolpert, R. L. (1994), "A Unified Conditional Frequentist and Bayesian Test for Fixed and Sequential Simple Hypothesis Testing," The Annals of Statistics, 22, 1787-1807.

Berger, J. 0., and Delampady, M. (1987), "Testing Precise Hypotheses" (with discussion), Statistical Science, 2, 317-352.

Berger, J. 0., and Sellke, T. (1987), "Testing a Point Nutt Hypothesis: The Irreconciability of p Values and Evidence," Journal of the American Statistical Association, 82, 112-122.

Berger, R. L., and Boos, D. D. (1994), "P Values Maximized Over a Confidence Set for the Nuisance Parameter," Journal of the American Statistical Association, 89, 1012-1016.

Blyth, C. R., and Staudte, R. G. (1995), "Estimating Statistical Hypotheses," Probability and Statistics Letters, 23, 45-52.

Box, G. E. P. (1980), "Sampling and Bayes Inference in Scientific Modeling and Robustness," Journal of the Royal Statistical Society, Set. A, 143, 383-430.

Carlin, B. P. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification," by M. J. Bayarri and J. 0. Berger, in Bayesian Statistics 6, eds. J. M. Bernardo, J. 0. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 73-74.

De la Horra, J., and Rodriguez-Bernal, M. T. (1997), "Asymptotic Behaviour of the Posterior Predictive P-Value," Communications in Statistics, Part A--Theory and Methods, 26, 2689-2699.

Delampady, M., and Berger, J. 0. (1990), "Lower Bounds on Bayes Factors for Multinomial Distributions, With Applications to Chi-Squared Tests of Fit," The Annals of Statistics, 18, 1295-1316.

Edwards, W., Liodman, H., and Savage, L. J. (1963), "Bayesian Statistical Inference for Psychological Research," Psychological Review, 70, 193-242.

Evans, M. (1997), "Bayesian Inference Procedures Derived via the Concept of Relative Surprise," Communications in Statistics, Part A--Theory and Methods, 26, 1125-1143.

Gelfand, A. E., Dey, D. K., and Chang, H. (1992), "Model Determination Using Predictive Distributions With Implementation via Sampling-Based Methods," in Bayesian Statistics 4, eds. J. M. Bernardo, J. 0. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 147-167.

Gelman, A., Carlin, J. B., Stern, H., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.

Gelman, A., Meng, X. L., and Stern, H. (1996), "Posterior Predictive Assessment of Model Fitness via Realized Discrepancies" (with discussion), Statistica Sinica, 6, 733-807.

Guttman, I. (1967), "The Use of the Concept of a Future Observation in Goodness-of-Fit Problems," Journal of the Royal Statistical Society, Ser. B, 29, 83-100.

Hwang, J. T., Casella, G., Robert, C., Wells, M., and Farrell, R. (1992), "Estimation of Accuracy of Testing," The Annals of Statistics, 20, 490-509.

Hwang, J. T., and Pemantle, R. (1997), "Estimating the Truth Indicator Function of a Statistical Hypothesis Under a Class of Proper Loss Functions," Statistics and Decisions, 15, 103-128.

Hwang, J. T., and Yang, M.-C. (1997), "Evaluate the P-Values for Testing the Independence in 2 x 2 Contingency Tables Using the Estimated Truth Approach--One Way to Resolve the Controversy Relating to Fisher's Exact Test," technical report, Cornell University, Dept. of Statistical Science.

Meng, X. L. (1994), "Posterior Predictive P-Values," The Annals of Statistics, 22, 1142-1160.

Pauler, D. K. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification," by M. J. Bayarri and J. 0. Berger, in Bayesian Statistics 6, eds. J. M. Bernardo, J. 0. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 70-72.

Robins, J. M. (1999), Discussion of "Quantifying Surprise in the Data and Model Verification," by M. J. Bayarri and J. 0. Berger, in Bayesian Statistics 6, eds. J. M. Bernardo, J. 0. Berger, A. P. Dawid, and A. F. M. Smith, London: Oxford University Press, pp. 67-70.

Robins, J. M., van der Vaart, A., and Ventura, V. (2000), "The Asymptotic Distribution of p Values in Composite Null Models," Journal of the American Statistical Association, 95, 1143-1156.

Rubin, D. B. (1984), "Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician," The Annals of Statistics, 12, 1151-1172.

-----(1996), Discussion of "Posterior Predictive Assessment of Model Fitness via Realized Discrepancies," by A. Gelman, X. Meng, and H. S. Stern, Statistica Sinica, 6, 787-792.

Schaafsma, W., Tolboom, J., and Van Der Meulen, E. A. (1989), "Discussing Truth or Falsity by Computing a Q-Value," in Statistical Data Analysis and Inference, eds. Y Dodge, Amsterdam: North-Holland, pp. 85-100.

Sellke, T., Bayarri, M. J., and Berger, J. 0. (1999), "Calibration of P Values for Precise Null Hypotheses," ISDS Discussion Paper 99-13, Duke University.

Suissa, S., and Shuster, J. J. (1985), "Exact Unconditional Sample Sizes for the 2 x 2 Binomial Trial," Journal of the Royal Statistical Society, Ser. A, 148, 317-327.

Thompson, P. (1997), "Bayes P-Values," Statistics and Probability Letters, 31, 267-271.

Tsui, K.-W., and Weerahandi, S. (1989), "Generalized P Values in Significance Testing of Hypothesis in the Presence of Nuisance Parameters," Journal of the American Statistical Association, 84, 602-607.

APPENDIX: TECHNICAL AND COMPUTATIONAL DETAILS

Details for Theorem 1

Suppose that [pi]([theta]) is improper but that there exists a sequence of increasing compact sets [[theta].sub.k] [subset] [theta] such that [[union].sub.k[greater than or equal to]1] [[theta].sub.k] = [theta], 0 [less than] [m.sub.k] = [[integral].sub.[theta]k] [pi]([theta]) d[theta] [less than] [infinity], 0 [less than] m(u) = [[integral].sub.[theta]] f(u; [theta]) [pi]([theta]) d[theta] [less than] [infinity],

and

[lim.sub.k[right arrow][infinity]] [m.sub.k] [integral] [([m.sub.k](u)).sup.2]/m(u) du = 1, (A.1)

where [m.sub.k] (u) = ([[integral].sub.[theta]k]] f(u;[theta])[pi] d[theta]) / [m.sub.k]. Then the conclusion of Theorem 1 holds.

Proof. Define h(u, [theta]) = Pr(p(X) [less than or equal to] [alpha]\u; [theta]). From the definition of [p.sub.cpred(u)], is follows that

[integral] h(u, [theta])[pi]([theta]\u)d[theta]= [alpha]. (A.2)

By the assumption that p(X) has a distribution that does not depend on [theta], it follows that for some constant c,

[integral] h(u, [theta])f(u; [theta]) du = E[Pr(p(X) [less than or equal to][alpha]); [theta]]= c. (A.3)

It is immediate from (A.2) and (A.3) that

[integral]h(u, [theta])[pi]([theta]\u)[m.sub.k](u)du d[theta] = [alpha] (A.4)

and

1/[m.sub.k] [integral] h(u; [theta])f(u; [theta])[pi]([theta])[1.sub.[[theta].sub.k]]d[theta]du = c. (A.5)

To prove that [alpha] = c, completing the proof, we need only show that the difference of the left sides of (A.4) and (A.5) goes to 0 as k [right arrow] [infinity]. Breaking the left side of (A.4) into integrals over [[[theta].sup.C].sub.k] and [[theta].sub.k], and using the fact that

[m.sub.k](u) = 1/[m.sub.k][m(u) - [[integral].sub.[[[theta].sup.C].sub.k]] f(u; [[theta].sup.*])[pi]([[theta].sup.*])d[[theta].sup.*]]

in the second of these integrals, the difference of the left sides of (A.4) and (A.5) can be written

[[integral].sub.[[[theta].sup.C].sub.k]] h(u, [theta])[pi]([theta]\u)[m.sub.k](u)d[theta] du - 1/[m.sub.k] [[integral].sub.[[theta].sub.k]] h(u, [theta])[pi]([theta]\u)

x [ [[integral].sub.[[[theta].sup.C].sub.k]] f(u; [[theta].sup.*])[pi]([[theta].sup.*])d[[theta].sup.*] ] d[theta] du.

Because h(u, [theta]) [less than or equal to] 1, algebra shows that each of these terms is bounded in absolute value by

[[integral].sub.[[[theta].sup.C].sub.k]] [pi]([theta]\u)[m.sub.k](u) d[theta] du

= 1 - [m.sub.k] [integral] [([m.sub.k](u)).sup.2]/m(u) du,

which goes to 0 by (A.1) and completes the proof.

Verification of (A.1) When U has a Location or Scale Distribution. For convenience, we assume that U has a location distribution with range R and that [theta] = R. Other cases can be handled similarly. Write f(u; [theta]) = g(u - [theta]), let G(.) denote the cdf corresponding to g(.), and choose [[theta].sub.k] = (-k,k). Then m(u) = [integral] g(u - [theta])d[theta] = 1, [m.sub.k] = [[[integral].sup.k].sub.-k] (1) d[theta] = 2k, [m.sub.k](u) = (1/2k) [[[integral].sup.k].sub.-k] g(u - [theta]) d[theta] = (1/2k)[G(u+k) - G(u - k)], and (A.1) becomes

[lim.sub.k[right arrow][infinity]] 2k [integral] [([m.sub.k](u)).sup.2] du = 1. (A.6)

Note first that 2k[m.sub.k](u) = [G(u + k) - G(u - k)] [less than or equal to] 1, so that (A.6) is trivially bounded above by 1. To establish a suitable lower bound, note that [G(log k) - G(- log k)] [greater than] (1 - [epsilon]) for any given [epsilon] [greater than] 0 and sufficiently large k, so that

2k [integral] [([m.sub.k](u)).sup.2] du [greater than or equal to] 1/2k [[[integral].sup.k-logk].sub.-k+logk] [[G(u + k) - G(u - k)].sup.2] du

[greater than] [(1 - [epsilon]).sup.2] 2(k - log k)/2k.

Because [epsilon] was arbitrary, (A.6) is clearly satisfied.

Verification of (19)

Lemma A.1. Let W = ([W.sub.1],[W.sub.2],...,[W.sub.n]) be a random vector with uniform distribution on the simplex [[[sigma].sup.n].sub.i=1] [W.sub.i] = 1, and let [W.sub.(1)] = min{[W.sub.i]}. Then Pr([W.sub.(1)] [less than or equal to] c) = 1 - [(1 - nc).sup.n-1].

Proof. We give a geometric argument. The probability to be computed is

Pr([W.sub.(1)] [less than or equal to] c) = 1 - Pr(all [W.sub.i] [greater than] c) = 1 - q. (A.7)

Note that the conditional distribution of W on the set {W : [W.sub.i] [greater than] c for all i} is also uniform, and that this set is itself a simplex of the same shape as the original simplex but with "corners" (c,...,c,1-(n - 1)c),(c,...,1 - (n - 1)c,c),...(1 - (n - 1)c,...,c,c). The edges of the original simplex have length [square root]2, whereas those of the smaller simplex have length [square root]2(1 - nc). It follows that q in (A.7) is given by

[([square root]2(1 - nc)/[square root]2).sup.n-1] = [(1 - nc).sup.n-1],

and the lemma follows.

To establish (19), note that

Pr (T/S [less than or equal to] c) = [E.sup.f(s;[lambda])] [Pr.sup.f(t\s;[lambda])] (T/S [less than or equal to] c).

But given s, the distribution of [X.sub.1],[X.sub.2],...,[X.sub.n] is uniform on the set {X : [[[sigma].sup.n].sub.i=1] [X.sub.i] = s}. Defining [W.sub.i] = [X.sub.i]/s, the conditions of Lemma A.1 clearly apply, with [W.sub.(1)] = T/s, and the result follows.

Computation of [p.sub.ppost] for Independence in Contingency Tables

From (33) and (34), it is clear that a Monte Carlo importance sampling approximation to [p.sub.ppost] in (33) is given by

[p.sub.ppost] [approximate] [[[sigma].sup.L].sub.i=1] [1 - B([t.sub.obs] - 1; n, [[theta].sub.i][[xi].sub.i])] x [pi]([[theta].sub.i],[[xi].sub.i]\[x.sub.obs] \ [t.sub.obs])/h([[theta].sub.i],[[xi].sub.i])/[[[sigma].sup.L].sub.i=1 ] [pi]([[theta].sub.i],[[xi].sub.i]\[x.sub.obs] \ [t.sub.obs])/h([[theta].sub.i],[[xi].sub.i]),

where B(x; n, [""]) is the distribution function at x of the Bi(n, [""]) distribution, h([theta],[xi]) is some importance function, and ([[theta].sub.1],[[xi].sub.1]), ([[theta].sub.2],[[xi].sub.2]),...,([[theta].sub.L],[[xi].sub.L]) are L random draws from h(.).

Importance functions that have a bounded importance ratio, [pi]([[theta].sub.i],[[xi].sub.i]\[x.sub.obs] \ [t.sub.obs])/h([[theta].sub.i],[[xi].sub.i]), and that reasonably approximate the desired distribution are useful for several reasons. First, convergence is typically rapid. Second, an explicit formula for the Monte Carlo variance is then available. Third, the scheme can be readily adapted, via acceptance-rejection, to generate an actual sample from the partial posterior, if desired. The importance function in (35) can be seen to have these properties. In particular, the importance ratio can be computed to be

[pi]([theta],[xi]\[x.sub.obs] \ [t.sub.obs])/h([theta],[xi])

= [{[c.sub.1]/2 [(1 - [theta][xi]).sup.n-[[x.sup.o].sub.11]]/[[theta].sup.[[x.sup.o].sub.21 ]] [(1 - [theta]).sup.n-[[x.sup.o].sub.11]-[[x.sup.o].sub.21]] [(1 - [xi]).sup.[[x.sup.o].sub.21]]

+ [c.sub.2]/2 [(1 - [theta][xi]).sup.n-[[x.sup.o].sub.11]]/[(1 - [theta]).sup.[[x.sup.o].sub.12]][[xi].sup.[[x.sup.o].sub.12]][(1 - [xi]).sup.n-[[x.sup.o].sub.11]-[[x.sup.o].sub.12]]}.sup.-1],

[c.sub.1] = [gamma](n - [[x.sup.o].sub.11] - [[x.sup.o].sub.21] + 2)/[gamma]([[x.sup.o].sub.12] + 1)[gamma]([[x.sup.o].sub.22] + 1)

where

And

[c.sub.2] = [gamma](n - [x[degrees].sub.11] - [x[degrees].sub12] + 2)/[gama]([x[degrees].sub.21] + 1)[gama]([x[degrees].sub.22] + 1)

It is straightforward to show that this is bounded by 2/([c.sub.1] + [c.sub.2]), using the inequalities [theta](1 - [xi])[less than or equal to](1 - [theta][xi]) and (1 - [theta]) [less than or equal to] (1 - [theta][xi]) judiciously. This importance function works well even for very large values of n and extreme values of [theta] and [xi].

-1-

End of free preview...

 To continue reading this publication, you must have a Questia Subscription.

Try Us Today! Click Here

Questia provides the world's largest online library of scholarly books and journal articles, with integrated footnote and bibliography tools, highlighting, note taking and book marking. With a Questia subscription, you'll have access to the full text of more than 67,000 books and 1.5 million articles.

Already a subscriber? Login:

Sponsored Links
Read more than 5,000 classic books FREE!
Free Newsletter
Get helpful how-to's, writing tips, search strategies, quizzes & more!
Search the Library

Customize your search: Search within the topic


Search in:
Books Journals Magazines
Newspapers Encyclopedia Research Topics
  • Type your specific word or phrase in the box above after the word and, then click Search.
  • Put exact phrases in double quotation marks. Do not put single words in quotation marks.
Back to top