Academic journal article Canadian Journal of Experimental Psychology

The P-Value Fallacy and How to Avoid It

Academic journal article Canadian Journal of Experimental Psychology

The P-Value Fallacy and How to Avoid It

Article excerpt

Abstract Null hypothesis significance tests are commonly used to provide a link between empirical evidence and theoretical interpretation. However, this strategy is prone to the "p-value fallacy" in which effects and interactions are classified as either "noise" or "real" based on whether the associated p value is greater or less than .05. This dichotomous classification can lead to dramatic misconstruals of the evidence provided by an experiment. For example, it is quite possible to have similar patterns of means that lead to entirely different patterns of significance, and one can easily find the same patterns of significance that are associated with completely different patterns of means. Describing data in terms of an inventory of significant and nonsignificant effects can thus completely misrepresent the results. An alternative analytical technique is to identify competing interpretations of the data and then use likelihood ratios to assess which interpretation provides the better account. Several different methods of calculating the likelihood ratios are illustrated. It is argued that this approach satisfies a principle of "graded evidence," according to which similar data should provide similar evidence.

Null hypothesis significance tests are commonly used to provide a link between empirical evidence and theoretical interpretation. For example, after conducting a series of tests, one may characterize the results in terms of the presence or absence of a collection of main effects and interactions. This catalogue of effects thus provides an abstract description of the results that subsequently forms the basis for assessing the adequacy of various theoretical interpretations. However, significance tests were designed as a basis for behavioural decisions, not as a description of results, and in fact are ill suited for this purpose. Here, I argue that using significance tests to describe results in this way can easily generate faulty or misleading accounts of the evidence. For example, similar patterns of means can lead to entirely different patterns of significance and nonsignificance, and the same pattern of significant effects can be associated with completely different patterns of means. Describing results solely in terms of which effects and interactions are significant can thus be very misleading. Further, the use of significance tests as an intermediate step between data and interpretation is unnecessary, and it is much easier to describe the evidence relevant to possible interpretations directly. A method based on likelihood ratios is presented for providing such descriptions.

The p-Value Fallacy

In introductions to null hypothesis significance testing, the mechanics are typically cast in terms of decisions and their behavioural consequences. For example, Gravetter and Wallnau (1992) state that a Type I error "... is a false report and can have serious consequences. For one, other researchers may spend precious time and resources trying to replicate the findings to no avail" (p. 211). They also compare the hypothesis testing framework to the legal judgment process: "A Type I error [in legal decisions] is serious: An innocent person would be sent to prison. Just as in scientific research, it is desirable to avoid making a Type I error" (p. 212). Similarly, May, Masson, and Hunter (1990) indicate that the choice of [alpha] should depend on the consequences of rejecting or failing to reject the null hypothesis:

Although conventions adopted by researchers favor .05, there are circumstances that justify the use of higher or lower values of [alpha]. For example, when carrying out a pilot study to determine whether a full-scale study might lead to rejection of the null hypothesis, a researcher might set [alpha] = .10. On the other hand, consider a researcher who tests a drug that may reduce depression but has undesirable side effects. The null hypothesis is that the drug has no influence on depression, so it should not be used. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.