considered here, the field of aptitude testing devotes a good deal of sophisticated
effort to validity questions.
What will the future bring? A basic assumption of this chapter is that testing
the accuracy of diagnostic systems is often desirable and feasible and is sometimes crucial. Although individual diagnosticians are treated here only in passing, a similar case could be made for the importance of testing them. I suggest
that a wider and deeper understanding of the needs and the possibilities would be
beneficial in science, technology, and society, and that it is appropriate for
scientists to take the lead in enhancing that understanding. Scientists might help
society overcome the resistance to careful evaluation that is often shown by
diagnosticians and by designers and managers of diagnostic systems, and help to
elevate the national priority given to funding for evaluation efforts. Specifically, I
submit that scientists can increase general awareness that the fundamental factors
in accuracy testing are the same across diagnostic fields and that a successful
science of accuracy testing exists. Instead of making isolated attempts to develop
methods of testing for their own fields, evaluators could adapt the proven methods to specific purposes and contribute mutually to their general refinement.
REFERENCES AND NOTES
The measurement of efficacy in the context of the present approach to accuracy is treated in some
detail elsewhere (9, 10). The usefulness of empirical measures of diagnostic, and especially
predictive, accuracy was further set in a societal context in a recent editorial:
D. E. Koshland, Jr.
, Science 238, 727 ( 1987).
W. W. Peterson,
T. G. Birdsall,
W. C. Fox, IRE Trans. Prof. Group Inf. Theorv PGIT-4, 171
With a human decision-maker, one can simply give instructions to use a more or less strict
criterion for each group of trials. Alternatively, one can induce a change in the criterion by
changing the prior probabilities of the two events or the pattern of costs and benefits associated
with the four decision outcomes. If, on the other hand, the decision depends on the continuous
output of some device, say, the intraocular pressure measured in a screening examination for
glaucoma, then, in different groups of trials, one simply takes successively different values along
the numerical (pressure) continuum as partitioning it into two regions of values that lead to
positive and negative decisions, respectively. This example of a continuous output of a system
suggests the alternative to the binary procedure, namely, the so-called "rating" procedure.
Thus, to represent the strictest criterion, one takes only the trials given the highest category
rating and calculates the relevant proportions from them. For the next strictest criterion, the trials
taken are those given either the highest or the next highest category rating--and so on to what
amounts to a very lenient criterion for a positive response. The general idea is illustrated by
probabilistic predictions of rain: first, estimates of 80% or higher may be taken as positive
decisions, then estimates of 60% or higher, and so on, until the pairs of true- and false-positive
proportions are obtained for each of several decision criteria.
D. M. Green and
J. A. Swets, Signal Detection Theorv and Psychophysics ( Wiley, New York, 1966; reprinted with updated topical bibliographies by
Krieger, New York, 1974, and Peninsula
Publishing, Los Altos, CA, 1988).
J. A. Swets, Science 134, 168 ( 1961); ibid. 182, 990 ( 1973). See Chapter 1.