More Powerful Tests from Confidence Interval P Values

Journal article by Roger L. Berger; The American Statistician, Vol. 50, 1996

Journal Article Excerpt


More powerful tests from confidence interval p values.

by Roger L. Berger

1. INTRODUCTION

The problem of comparing two binomial proportions has been considered for many years. The most commonly used test is Fisher's Exact Test (Fisher 1935), a conditional test. Barnard (1945, 1947) proposed an unconditional test for this problem. Although unconditional tests are usually more powerful than conditional tests, they are computationally much more complex. But recent advances in computing have made unconditional tests practical, and they are beginning to appear in statistical software packages such as StatXact 3 for Windows. In this article it is shown that unconditional tests based on the confidence interval p value of Berger and Boos (1994) are often uniformly more powerful than the standard unconditional tests.

Let X and Y be independent binomial random variables. The sample size for X is m and the success probability is [p.sub.1]. The sample size for Y is n and the success probability is [p.sub.2]. The binomial probability mass function of X will be denoted by

[Mathematical Expression Omitted].

Similarly, b(y;n, [p.sub.2]) will denote the binomial probability mass function of Y. The sample space of (X, Y) will be denoted by X = {0, . . ., m} x {0, . . ., n}. X contains (m + 1)(n + 1) points.

This kind of data is often displayed in a 2 x 2 contingency table as follows:

                  yes        no                                      

Population 1 X m - X m
Population 2 Y n - Y n

R = X + Y t - R t = m + n.
In this table uppercase letters denote random variables and lowercase letters denote known constants fixed by the sampling scheme. Hence t is the total sample size and R is the observed number of successes. Conditional inference is based on the conditional distribution of X and Y, given the observed marginal R = r = x + y. Consider the problem of testing

[H.sub.0]: [p.sub.1] = [p.sub.2] versus [H.sub.a]: [p.sub.1] [less than] [p.sub.2]. (1)

Exact tests for this problem will be considered. The sizes of the tests are computed using the exact binomial distributions, not normal or chi-squared approximations. The standard Neyman-Pearson paradigm of restricting consideration to level-[Alpha] tests and then comparing the powers of these tests will be followed. For a specified error probability [Alpha] all tests considered are level-[Alpha] tests. Tests that are liberal, that sometimes have type-I error probabilities that are greater than [Alpha], are not considered. However, the tests do not have sizes exactly equal to the specified [Alpha]. Because of the discrete nature of these data, equality can (usually) be achieved only with a randomized test. Because randomized tests are not of any practical interest, this paper considers only nonrandomized tests.

The analysis in this article is unconditional. That is, the size and power comparisons are based on the binomial distributions of the model. There is continuing debate as to whether conditional or unconditional calculations are more relevant for these problems. Little (1989) and Greenland (1991) provided good recent summaries of the issues in this debate. The purpose of this paper is not to continue this debate. Rather, suffice it to say that this paper is relevant to those situations in which the unconditional analysis is appropriate.

2. USUAL UNCONDITIONAL TEST

Barnard (1945, 1947) first proposed an unconditional test for this problem. Because of the computational difficulty of unconditional tests, they were not widely used until recently. Now, computing technology makes the use of unconditional tests feasible.

A commonly used unconditional test is the Z test proposed by Suissa and Shuster (1985) and Haber (1986). Define the Z-pooled statistic (score statistic) as

[Mathematical Expression Omitted],

where [Mathematical Expression Omitted], [Mathematical Expression Omitted], and [Mathematical Expression Omitted], the pooled estimate of [p.sub.1] = [p.sub.2] = p under [H.sub.0]. Then the p value for testing (1), using the test statistic Z, is

[Mathematical Expression Omitted], (2)

where [R.sub.Z](x,y) = {(a,b): (a,b) [element of] X and Z(a,b) [greater than or equal to] Z(x, y)}. The p value is the maximum probability under [H.sub.0] of observing a value of the test statistic equal to or more extreme than the value observed in the data. This is a standard definition of a p value, such as is found in Bickel and Doksum (1977, Sec. 5.2.B). Rejection of [H.sub.0] if and only if [p.sub.Z] [less than or equal to] [Alpha] defines a level-[Alpha] test of (1). The calculation of the supremum in (2) must be done numerically. Typically there is no simple formula for this value. This numeric maximization has been the cause of the computational difficulty of unconditional tests.

3. CONFIDENCE INTERVAL p VALUE

Berger and Boos (1994) proposed a new method of computing a p value. In the problem of comparing two binomial proportions, if [H.sub.0] is true, p = [p.sub.1] = [p.sub.2] is a nuisance parameter. Let [C.sub.[Beta]](x, y) denote a 100(1 - [Beta])% confidence interval for p calculated from the data (x, y) and assuming [p.sub.1] = [p.sub.2] = p. The confidence interval used in this paper is the Clopper and Pearson (1934) interval based on X + Y, a binomial (m + n, p) random variable if [p.sub.1] = [p.sub.2] = p. This interval is easily computed from the formula

[Mathematical Expression Omitted], (3)

where a = x + y, b = m + n, and [F.sub.[Nu],[Eta],[Beta]/2] is the upper 100([Beta]/2) percentile of an F distribution with [Nu] and [Eta] degrees of freedom.

The confidence interval p value, based on the statistic Z, is defined by

[Mathematical Expression Omitted],

where [R.sub.Z](x, y) is the same as in the definition of [p.sub.Z]. [p.sub.C] differs from [p.sub.Z] in that the supremum is taken over the confidence interval [C.sub.[Beta]](x, y) rather than over the whole range 0 [less than or equal to] p [less than or equal to] 1, and the error probability [Beta] is added to the supremum. If [Beta] = 0, [p.sub.C] is the same as [p.sub.Z]. Berger and Boos (1994) showed that this modification of the usual definition of a p value yields a valid p value. That is, the test that rejects [H.sub.0] if and only if [p.sub.C](X, Y) [less than or equal to] [Alpha] is an unconditional level-[Alpha] test. The error probability is specified by the experimenter. Different values of [Beta] yield different p values and tests. In this paper [Beta] = .001 is used (as suggested by Berger and Boos).

Berger and Boos proposed the confidence-interval-based p value for two reasons. The first is computational. In both [p.sub.Z] and [p.sub.C] the function to be maximized is the same. The maximization over the smaller set, [C.sub.[Beta]], can be much simpler. The second is statistical. Having observed the data we should be able to estimate p, and should not need to consider values of p that are completely unsupported by the data. In [p.sub.C] only those "plausible" values that are in the confidence interval are considered.

This article points out that the confidence interval p value can have a third advantage. It can produce tests with higher power than the usual p value. And, remember, this is achieved with less computational effort.

4. EXAMPLE

To see the improvement that can be obtained by using [p.sub.C] rather than [p.sub.Z] consider constructing a level-[Alpha] test with [Alpha] = .10 for sample sizes m = 33 and n = 17.

An enumeration of [p.sub.Z] (x, y) for all the sample points in X shows that the point (x, y) = (23, 15) has the largest value of [p.sub.Z](x, y) that is less than or equal to [Alpha] = .10. Z(23, 15) = 1.454 and [p.sub.Z](23, 15) = .0823. The sample point with the next smaller value of Z is (x,y) = (0, 1) with Z(0, 1) = 1.407 and [p.sub.Z](0, 1) = .1548. So (0, 1), or any other sample point with a smaller value of Z, is not in the level [Alpha] = .10 rejection region based on [p.sub.Z].

Let the p value function for a sample point (x, y) be defined by

[[Alpha].sup.Z](p; x, y) = [summation of] b(a; m, p)b(b; n, p) where (a,b)[element of][R.sub.Z](x,y). (4)

The p value function is the function that is maximized in calculating [p.sub.Z] and [p.sub.C]. In Figure 1 [[Alpha].sup.Z](p; 23, 15) (solid line) and [[Alpha].sup.Z](p; 0, 1) (long dashed line) are shown. [[Alpha].sup.Z](p; 23, 15) [less than] .10 = [Alpha] for all values of p. Its maximum value is [p.sub.Z](23, 15) = .0823, which occurs at p = .524. Because [p.sub.Z](23, 15)is the largest p value not exceeding [Alpha] = .10, .0823 is also the actual size of the level [Alpha] = .10 test constructed using [p.sub.Z]. The addition of the single sample point (0, 1) to the rejection region causes a large spike to appear in [[Alpha].sup.Z](p; 0, 1). The maximum of [[Alpha].sup.Z](p; 0, 1) is [p.sub.Z](0, 1) = .1548, which occurs at p = .026. It is not unusual for the addition of a single sample point with x + y close to 0 or m + n to create a spike like this.

In Figure 2 the sample points in X with [p.sub.Z] (x, y) [less than or equal to] .10 are marked by [squares]'s. These are the elements of the level [Alpha] = .10 rejection region defined by [p.sub.Z]. Because the actual size of this test is only .0823, not too close to [Alpha] = .10, it seems possible that some more points, some of the points marked by x's and +'s, for example, might be added to this rejection region, and the resulting test could still be level [Alpha] = .10. The points marked by +'s and x's are the points that satisfy the "convexity" property of Barnard (1947). It would take a great deal of computation to try each point individually, then try pairs or triples of points, to determine points that could be added. But the use of the confidence interval p value easily identifies some points that can be added.

Consider (x, y) = (21, 14) with Z(21, 14) = 1.368. There are four sample points with Z(21, 14) [less than] Z(x, y) [less than] Z(23, 15), namely, Z(0, 1) = 1.407, Z(26, 16) = 1.401, Z(9, 8) = 1.399, and Z(3, 4) = 1.394. The p value function [[Alpha].sup.Z](p; 21, 14) is shown in Figure 3. The maximum of this function, .1549, is greater than .10 because Z(0, 1) [greater than] Z(21, 14). But [p.sub.C](21, 14) is calculated by maximizing over the 99.9% confidence interval for p, calculated from the data (x, y) = (21, 14). Using (3) this confidence interval is [.459, .881]. These confidence limits are shown in Figure 3. The maximum of [[Alpha].sup.Z](p; 21, 14) over this interval is .0946, and this maximum occurs at p = .830. Therefore, [p.sub.C](21, 14) = .0946 + [Beta] = .0946 + .0010 = .0956. Because [p.sub.C](21, 14) [less than or equal to] .10, (21, 14) is in the level [Alpha] = .10 rejection region defined by [p.sub.C]. In addition, two other points are in the level [Alpha] = .10 rejection region defined by [p.sub.C]. These are (26, 16) with [p.sub.C](26, 16) = .0949 and (9, 8) with [p.sub.C](9, 8) = .0906. These three added points are the points marked with x's in Figure 2.

The probability of the level [Alpha] = .10 rejection region defined by [p.sub.C], that is, the rejection region consisting of all the [squares]'s and x's in Figure 2, as a function of p = [p.sub.1] = [p.sub.2] is graphed in Figure 1 with a short dashed line. This probability is less than .10 for all values of p because this is a level [Alpha] = .10 test. But this probability is much closer to .10 than the probability of the rejection region for the Suissa and Shuster test defined by [p.sub.Z]. The actual size of the [p.sub.C] test is .0946, the maximum of this function.

Because the rejection region of the level [Alpha] = .10 test defined by [p.sub.Z] is a proper subset of the rejection region of the level [Alpha] = .10 test defined by [p.sub.C] the test defined by [p.sub.C] is uniformly more powerful. The maximum absolute increase in power occurs near [p.sub.1] = .785 and [p.sub.2] = .935. Here the power of the [p.sub.Z] test is .492 and the power of the [p.sub.C] test is .557. This is a 13% increase in power. The maximum relative increase in power occurs near the boundary of [H.sub.0] and [H.sub.a], near [p.sub.1] = .835 and [p.sub.2] = .836. Here the power of the [p.sub.Z] test is .073 and the power of the [p.sub.C] test is .095. This is a 30% increase in power.

5. CONSISTENCY OF IMPROVEMENT

In the previous section it was shown that, for [Alpha] = .10 and (m, n) = (33, 17), the confidence interval p value defines a uniformly more powerful level-[Alpha] test than the usual unconditional p value. The question arises as to the generality of this phenomenon. To investigate this we enumerated the level-[Alpha] rejection regions defined by [p.sub.Z] and [p.sub.C] for [Alpha] = .10, .05, and .01 for each of nine different sample sizes, (m, n) = (10, 10), (13, 7), (16, 4), (25, 25), (33, 17), (40, 10), (50, 50), (65, 35), and (80, 20). These sample sizes were chosen to represent small to moderately large total sample sizes and balanced to 4:1 unbalanced designs.

In 15 out of the 27 cases the rejection region defined by [p.sub.Z] is a proper subset of the rejection region defined by [p.sub.C]. So the confidence interval p value defines a uniformly more powerful level-[Alpha] test. In another 9 out of the 27 cases the rejection regions defined by the two p values are exactly the same. In one case, [Alpha] = .01 and (m, n) = (50, 50), neither rejection region contained the other and the power functions of the two tests crossed. In the remaining two cases, [Alpha] = .01 and (m, n) = (13, 7) and (25, 25), the rejection region defined by [p.sub.C] is a proper subset of the rejection region defined by [p.sub.Z], and [p.sub.Z] defines a uniformly more powerful test. The power functions for the nine tests with [Alpha] = .10 are described more fully by Berger (1994).

Thus in most cases the confidence interval p value defines a test that is the same or uniformly more powerful than the test defined by the usual unconditional p value. Only infrequently will the test defined by [p.sub.C] be inferior to the test defined by [p.sub.Z]. And in all cases the computation required for [p.sub.C] is less than that required for [p.sub.Z].

The reason that the rejection region defined by [p.sub.C] usually contains the rejection region defined by [p.sub.Z] is the following fact. If there is no sample point in X such that [Alpha] - [Beta] [less than] [p.sub.Z](x, y) [less than or equal to] [Alpha], then every sample point with [p.sub.Z](x, y) [less than or equal to] [Alpha] also satisfies [p.sub.C](x, y) [less than or equal to] [Alpha]. That is, every sample point in the level-[Alpha] rejection region defined by [p.sub.Z] is also in the level-[Alpha] rejection region defined by [p.sub.C]. This fact is true because, if [p.sub.Z](x, y) [less than or equal to] [Alpha], then [p.sub.Z](x, y) [less than or equal to] [Alpha] - [Beta], and hence

[Mathematical Expression Omitted].

When [Beta] is small compared to [Alpha], as with the [Alpha] = .001 recommended by Berger and Boos (1994) that is used in this paper, it often happens that there is no sample point with [Alpha] - [Beta] [less than] [p.sub.Z] (x,y) [less than or equal to] [Alpha]. In such cases the test defined by [p.sub.C] will be at least as powerful as the test defined by [p.sub.Z]. Note that this property applies in general to confidence interval p values, not just this problem and this test statistic Z.

6. OTHER TEST STATISTICS

Other statistics besides Z(x, y), such as the likelihood ratio test statistic and [Mathematical Expression Omitted], can be used to test (1). Santner and Duffy (1989, Exercises 5.11 and 5.12), Haber (1987), and Martin and Silva (1994) list several possible statistics. The experience with Z suggests that if another statistic is used, the confidence interval p value might provide improved power over the usual unconditional p value.

The power comparisons of Haber (1987) and Martin and Silva (1994) suggest that the two statistics Z(x, y) and

[Mathematical Expression Omitted]

produce tests with the highest power. The statistic B(x, y) was first proposed by Boschloo (1970) and McDonald, Davis, and Milliken (1977). B(x, y) is the conditional p value of Fisher's Exact Test (Fisher 1935). Here B(x, y) is not used as a p value, but rather as a statistic to order the sample points. Small values of B(x, y) give evidence for [H.sub.a] so the unconditional p value based on B(x, y) is

[Mathematical Expression Omitted],

where [R.sub.B](x, y) = {(a, b): (a, b) [element of] X and B(a, b) [less than or equal to] B(x, y)}. The confidence interval p value based on B is defined as in (4), namely,

[Mathematical Expression Omitted].

Berger (1994) found that the p value function [[Alpha].sup.B] (p; x, y) tends to be flatter than [[Alpha].sup.Z](p; x, y), especially for unequal sample sizes. This agrees with Martin and Silva's (1994) finding that the unconditional test based on B usually has higher power than the test based on Z, especially when m [not equal to] n. So there is less room for improvement of Boschloo's test. But Berger (1994) found that, as with [p.sub.Z] and [p.sub.C], the confidence interval p value, [p.sub.CB], usually defined a test that was the same or uniformly more powerful than the test defined by [p.sub.B].

In comparing the tests based on the two confidence interval p values, Berger (1994) did not find a clear preference. Usually the power functions of these two tests crossed, with one test having higher power for some parameter values and the other having higher power for other parameter values. Usually the power function defined by [p.sub.CB] was higher on a majority of the parameter space.

In their power comparison of tests for (1) Martin and Silva (1994) considered two computationally intensive tests they called M and M[prime]. M is the test proposed by Barnard (1945, 1947), and M[prime] is a simplified version of M. Both methods involve construction of a rejection region by adding one sample point at a time, with a good deal of computation required to determine which point is added next. Martin and Silva report that M[prime] and M require about 10 and 85 times the computation time required by [p.sub.Z] or [p.sub.B], respectively. But M and M[prime] do provide some improvement in power. In this paper it has been shown that confidence interval p values provide an improvement in power over [p.sub.Z] or [p.sub.B], but with less computation. It remains to be determined if the improvement in power provided by [p.sub.C] or [p.sub.CB] is comparable to the improvement provided by M[prime] or M.

7. CONCLUSIONS

Confidence interval p values can improve the power of standard unconditional tests for comparing two binomial populations. They also require less computational effort. Thus they offer a promising new method for the analysis of 2 x 2 tables.

Similar, but less extensive, comparisons have been made for two-sided tests. The results are qualitatively the same. The confidence interval p value often defines a more powerful test than the standard p value.

XUN2X2 is a Fortran program that will compute the standard and confidence interval p values discussed in this paper. The program will also perform unconditional tests for multinomial, rather than two independent binomial, 2 x 2 tables. XUN2X2 may be obtained by sending the one-line message "get exact from general" to statlib@lib.stat.cmu.edu.

REFERENCES

Barnard, G. A. (1945), "A New Test for 2 x 2 Tables," Nature, 156, 177.

----- (1947), "Significance Tests for 2 x 2 Tables," Biometrika, 34, 123-138.

Berger, R. L. (1994), "Power Comparison of Exact Unconditional Tests for Comparing Two Binomial Proportions," Technical Report 2266, North Carolina State University, Statistics Department.

Berger, R. L., and Boos, D. D. (1994), "P Values Maximized Over a Confidence Set for the Nuisance Parameter," Journal of the American Statistical Association, 89, 1012-1016.

Bickel, P. J., and Doksum, K. A. (1977), Mathematical Statistics: Basic Ideas and Selected Topics, San Francisco: Holden-Day.

Boschloo, R. D. (1970), "Raised Conditional Level of Significance for the 2 x 2-Table when Testing the Equality of Two Probabilities," Statistica Neerlandica, 24, 1-35.

Clopper, C. J., and Pearson, E. S. (1934), "The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial," Biometrika, 26, 404-413.

Fisher, R. A. (1935), "The Logic of Inductive Inferences," Journal of the Royal Statistical Society, Ser. A, 98, 39-54.

Greenland, S. (1991), "On the Logical Justification of Conditional Tests for Two-by-Two Contingency Tables," The American Statistician, 45, 248-251.

Haber, M. (1986), "An Exact Unconditional Test for the 2 x 2 Comparative Trial," Psychological Bulletin, 99, 129-132.

----- (1987), "A Comparison of Some Conditional and Unconditional Exact Tests for 2 x 2 Contingency Tables," Communications in Statistics - Simulation and Computation, 16, 999-1013.

Little, R. J. A. (1989), "Testing the Equality of Two Independent Binomial Proportions," The American Statistician, 43, 283-288.

Martin Andres, A., and Silva Mato, A. (1994), "Choosing the Optimal Unconditioned Test for Comparing Two Independent Proportions," Computational Statistics and Data Analysis, 17, 555-574.

McDonald, L. L., Davis, B. M., and Milliken, G. A. (1977), "A Nonrandomized Unconditional Test for Comparing Two Proportions in 2 x 2 Contingency Tables," Technometrics, 19, 145-157.

Mehta, C., and Patel, N. (1995), StatXact 3 for Windows: User Guide, Cambridge, MA: Cytel Software.

Santner, T. J., and Duffy, D. E. (1989), The Statistical Analysis of Discrete Data, New York: Springer-Verlag.

Suissa, S., and Shuster, J. J. (1985), "Exact Unconditional sample Sizes for the 2 x 2 Binomial Trial," Journal of the Royal Statistical Society, Ser. A, 148, 317-327.

Roger L. Berger is Professor, Statistics Department, North Carolina State University, Raleigh, NC 27695-8203.

-1-

End of free preview...

 To continue reading this publication, you must have a Questia Subscription.

Try Us Today! Click Here

Questia provides the world's largest online library of scholarly books and journal articles, with integrated footnote and bibliography tools, highlighting, note taking and book marking. With a Questia subscription, you'll have access to the full text of more than 67,000 books and 1.5 million articles.

Already a subscriber? Login:

Sponsored Links
Read more than 5,000 classic books FREE!
Free Newsletter
Get helpful how-to's, writing tips, search strategies, quizzes & more!
Search the Library

Customize your search: Search within the topic


Search in:
Books Journals Magazines
Newspapers Encyclopedia Research Topics
  • Type your specific word or phrase in the box above after the word and, then click Search.
  • Put exact phrases in double quotation marks. Do not put single words in quotation marks.
Back to top