collected with the tested one. Retrospective sampling requires first that the data be accessible, and so far they usually have not been. Such sampling also requires great care. For example, rare cases must be present in at least minimal numbers to represent the rarity fairly, and having that number of them may distort the relative proportions.
In information retrieval, it is difficult to say whether a representative sample of documents is acquired for a general assessment of a system. Working with special subject matters seems appropriate for a given test, but most systems, as illustrated earlier, are tested with just a few of them. Across the few mentioned above, accuracy varies considerably and seems to covary with the "hardness," or technical nature, of the language used for the particular subject matter.
The ability of weather forecasters to assemble large and representative samples for certain weather events is outstanding. Prediction of precipitation at Chicago was tested against 17,000 instances, and even individual forecasters were measured on 3,000 instances. Of course, some weather events are so rare that few positive events are on record, and for such events the precision as well as the generality of the measurements will be low (42).
Can we say how accurate our diagnostic systems are? According to the evidence collected here, the answer is a quite confident "yes" in the fields of medical imaging, information retrieval, and weather forecasting, and, at least for now, a "not very well" in most if not all other fields, as exemplified here by polygraph lie detection, materials testing, and (except for the few analyses mentioned above) aptitude testing for predicting a binary event. ROC measures of accuracy are widely used in medical imaging (5, 10, 25), have been advocated and refined within the field of information retrieval (20, 43), and have been effectively introduced in weather forecasting (15, 17, 18, 44). Although problems of bias in test data do not loom as large in information retrieval and weather forecasting as elsewhere, those fields have shown a high degree of sophisticated concern for such problems, as has medical imaging, where the problems are greater (45). So, in medical imaging we can be quite confident for example, about A values of 0.90 to 0.98 for prominent applications of CT and chest x-ray films and A values of 0.80 to 0.90 for mammography. Similarly, in weather forecasting, confident about A values of 0.75 to 0.90 for rain, depending largely on lead time, and 0.65 to 0.80, depending on definitions, for temperature intervals and fog; and in information retrieval, A values ranging from 0.95 to 0.75 depending on subject matter. A positive aspect of the field of polygraph lie detection is that it recognizes the need for accuracy testing and attempts to identify and cope with inherently difficult data-bias problems, and the field of materials testing is making some beginnings in these respects. Of course, for other than the special case