It is often necessary to dichotomize a continuous scale to separate respondents into normal and abnormal groups. However, because the distributions of the scores in these 2 groups most often overlap, any cut point that is chosen will result in 2 types of errors: false negatives (that is, abnormal cases judged to be normal) and false positives (that is, normal cases placed in the abnormal group). Changing the cut point will alter the numbers of erroneous judgments but will not eliminate the problem. A technique called receiver operating characteristic (ROC) curves allows us to determine the ability of a test to discriminate between groups, to choose the optimal cut point, and to compare the performance of 2 or more tests. We discuss how to calculate and compare ROC curves and the factors that must be considered in choosing an optimal cut point.
(Can J Psychiatry 2007;52:121-128)
Information on funding and support and author affiliations appears at the end of the article.
* ROC analysis is used to select the optimal cut point to dichotomize a continuous scale.
* The usual choice of cut points minimizes the overall number of false positive and false negative errors.
* The cut point may be shifted if the cost of false positives is higher than that of false negatives, or vice versa.
* The accuracy of ROC analysis depends on the quality of the gold standard, which is usually far from perfect in psychiatry.
* Changing the purpose of the test (for example, from diagnosis to screening) requires a shift in cut points.
* A cut point that is ideal for one group may be less than ideal for another.
Key Words: receiver operating characteristic, ROC, area under the curve, AUC, test performance, diagnosis, sensitivity, specificity
Abbreviations used in this article
AUC area under the curve
CAT computed axial tomography
CCHS Canadian Community Health Survey
CI confidence interval
FN false negative
FP false positive
pAUC partial area under the curve
PPV positive predictive value
ROC receiver operating characteristics
SE standard error
SPNP Scale of Photonumerophobia
TN true negative
TP true positive
Those of you who have read this series of articles religiously know that, because of the tremendous loss of information incurred, you should never dichotomize continuous variables.1 Never! Nohow! Ever! Under no circumstances! Except, of course, when it makes sense to do so. One legitimate reason for dichotomizing occurs when a statistical test requires a linear relationship between variables (for example, multiple regression) but the actual relationship isn't linear. It then makes sense to dichotomize or trichotomize the predictor variable. A more common reason occurs when a dichotomous decision must be predicated on a continuous scale. For clinical or research purposes, it may be necessary to divide individuals into 2 groups-say, with or without depression-on the basis of an interview or a scale of depressive symptoms. This is done, for example, with the Center for Epidemiologic Studies Depression Scale,2 where the score can range from 0 to 60. Those who score 17 or more are deemed to suffer from depression; those with lower scores are classified as being without depression. The issue now becomes how we choose the cut point that best divides the sample into these 2 groups.
For historical reasons, the method that's used is called ROC analysis. The name dates back to World War II and the merging of signal detection theory with the development of radar. When the gain of the radar set (comparable to the volume control on a radio) is at zero, no signal (in this case representing an enemy plane) is detected. Increasing the gain lets more signals in, but it also increases the amount of noise that is picked up and possibly misinterpreted as a true signal. At low levels of gain, the noise is very weak and unlikely to be falsely labelled, but at the same time, only very strong signals (very large or close planes, to continue the example) are detected, and many true signals are missed. As the gain is turned up even further, weaker signals are picked up, but so is more noise (things that can seem like planes but are not, such as rain clouds or a flock of birds). At some point, further increases become counterproductive, in that the noise (false positives) begins to outweigh the signals (true positives).
If this terminology reminds you of the language we use in evaluating diagnostic tests, it's not a coincidence. Soon after the war, it was noted that ROC curves could be used in experimental psychology and psychophysics for studies of signal detection.3 People then realized that signal detection theory is exactly what is being used in laboratory medicine and radiology,4,5 albeit with most practitioners unaware of the fact (much like Molière's M Jourdain being unaware that he had been speaking prose for more than 40 years). The "signal" is a test finding indicative of the presence of some abnormality, and the "noise" consists of spurious images that could be misinterpreted as a true signal. The lowest level of "gain" would correspond to a judgment of "definitely, or almost definitely, normal" after, say, reading a computed axial tomography or radionuclide scan to detect a brain tumour.6 No false positives are found, but then again, no abnormalities are detected either. A moderate level of gain would be a judgment of "abnormal," and the highest level is "definitely, or almost definitely, abnormal." Giving this last label to all suspect scans would most definitely catch nearly all the tumours but at the cost of subjecting many patients to unnecessary follow-up investigations or even operations. Obviously, the radiologist wants to choose a subjective cut point that, in his or her own mind, balances these 2 types of mistakes.
Deriving the ROC Curve
To see how the choice of cut point is made in signal detection theory, we'll return to PNP, that dreaded disorder of fear of numbers we've discussed before, and the scale used to detect it.7 We'll assume that the scale to measure PNP, the SPNP, goes from 1 (no phobia) to 10 (the highest degree of phobia). To do a study that will derive the ideal cut point, we will administer the scale to, say, 50 individuals with phobia and 50 without phobia. We also need one other thing: an independent criterion of whether or not the individual suffers from PNP. Although this is commonly referred to as the gold standard, we are often in a position where the quality of the standard, such as clinical judgment or chart diagnosis, is closer to tin or lead. However, in the absence of anything better, it must serve as the gold standard.
The rationale behind finding a cut point is based on the assumption that we are dealing with 2 distributions of scores that are relatively normal: one distribution of scores from individuals who have PNP, as determined by the gold standard, and one of scores from individuals who do not have the disorder (as shown in Figure 1). Notice that the curves overlap, as they almost always do in real life. Some individuals with phobia have SPNP scores that are lower than those of some individuals without phobia. That means that, no matter what we choose as the value of the cut point, we're going to make some mistakes. If we use A on the graph as the dividing line, then we'll catch all of the individuals with phobia, but at the cost of erroneously labelling as phobic a large number of nonfearful individuals. Conversely, C doesn't result in any false negatives, but we'll miss a large number of those with phobia. Cut point B is a compromise position: there will be both false positives and false negatives, but the total number of erroneously classified individuals is smallest at this point. Later, we'll discuss why B isn't always the best option, but first, let's see how to quantify the errors at each possible cut point and turn those into an ROC curve.
We begin by making a table, as in Table 1, that shows the number of individuals in each group, as defined by the gold standard, who receive each score on the test. From this, we (or, more accurately and fortuitously, the computer) derive a series of 2 x 2 tables. Because the scale has 10 points, we'll get 9 tables: 1 for each cut point. The first table is derived from the number of those who score 1, compared with those who score 2 or more; the second table gives the data for those who score 1 or 2, compared with those who score 3 or more, and so on, with each table reflecting a progressively higher cut point. In Table 2, we show the results for one of these tables, using a cut point of 5/6. The letters in the table simply identify the 4 cells; we'll use them a bit later.
From this table, we can derive the 2 indices that we need: the sensitivity and the specificity8 of the SPNP. Sensitivity refers to the ability of the test to detect individuals who actually have the disorder. The formula is:
(ProQuest-CSA LLC: ... denotes formulae omitted.)
For a cut point of 5/6, the sensitivity is therefore 44/50 = 0.88. The term specificity means that the test is specific to the disorder being assessed and that it does not give a positive result because of other conditions. The formula is:
(ProQuest-CSA LLC: ... denotes formulae omitted.)
which in this case is 36 / 50 = 0.72.
This is done for each of the 9 tables, and we then list the sensitivity and (1 - specificity), as in Table 3. Note that we've added 2 extra rows to the table, 1 above the lowest point (a cut point below 1) and 1 below the highest point (for a cut point over 10). These are obviously impossible values, but we use them so that the curve that we draw will begin in the lower left comer and end at the upper right one. These pairs of values are plotted, with (1 - specificity) on the X axis and the sensitivity on the Y axis, yielding the curve in Figure 2. Note that the TP rate is synonymous with the term sensitivity, the TN rate is the same as specificity, and the FP rate means the same as (1 - specificity); they're simply alternative terms for the same parameters. (To be totally accurate, although a bit pedantic, they are commonly called rates, but strictly speaking they're not; they are proportions, because a rate has time in the denominator.)
We've drawn the graph with 2 extra axes: 1 on top and 1 on the right side. These are not usually shown when the results of a study are presented, but we've included them to illustrate the relations among the parameters. As you can see, the sum of the TP rate and the FN rate is 1; when the TP rate increases, the FN rate decreases. Similarly, the TN and FP rates add to 1 and have the same reciprocal relation to one another.
The lower left-hand corner of the curve is analogous to setting the gain to zero on the radar set. No spurious signals are misidentified (the FP rate is zero), but by the same token, no true signals are detected either (the TP rate is zero). Increasing the gain means moving up the curve; the TP rate increases, and so does the FP rate. Initially, the TP rate increases faster than the FP rate until the curve is nearest to the upper left-hand corner. After this point, the TP rate continues to increase, but the FP rate starts to increase faster.
Before we begin to discuss some properties of the ROC, there are a couple of points to note about sensitivity and specificity. First, sensitivity depends only on those who have the disorder and specificity only on those who do not. Consequently, we can derive these estimates, and the ROC curve itself, without having to worry whether the proportion of individuals in each group is representative of the prevalence in the population. Second, as we mentioned before, sensitivity and specificity are like the 2 ends of a seesaw; if one goes up, then the other goes down. Changing the cut point favours one over the other, but you can't increase both without going back and improving the overall performance of the test.
Properties of the ROC Curve
As we'll see, there are several statistics that can be derived from the ROC, but as is often the case, it's best to begin just by looking at the curve. The dotted line indicates the curve for a useless test-one that does not discriminate at all between individuals with and without phobia. A perfect test (which exists only in the dreams of test developers) would run straight up the Y axis until the top and then run horizontally to the right. The more the ROC curve deviates from the dotted line and tends toward the upper left-hand corner, the better the sensitivity and specificity of the test. Further, the cut point that's closest to this corner is the one that minimizes the overall number of errors; in this case, it is 7/8.
The primary statistic we get from the ROC is the AUC. In this case, it is 0.899. This can be compared with the null hypothesis-that the test is useless-which has an AUC of 0.50; that is, one-half of the area in the graph falls below the dotted line, so that 0.399 is between the line and the ROC curve. The AUC can be interpreted in a very useful way. It is the probability that the test will yield a higher value for a randomly chosen individual with the disorder than for a randomly chosen individual who does not have the disorder.9
That means, in this example with an AUC of 0.899, that if we take 2 individuals at random-one with and one without PNP-the probability is nearly 90% that the first individual will have a higher score than the second. (For those who are more statistically inclined, it can be shown that value of the AUC is identical to the result you would get doing a Wilcoxon or Mann-Whitney U test.10,11) A rough rule of thumb is that the accuracy of tests with AUCs between 0.50 and 0.70 is low; between 0.70 and 0.90, the accuracy is moderate; and it is high for AUCs over 0.90.12
As with any parameter, the AUC is an estimate, so an SE is associated with it (0.031 for these data). The SE cap be used in 2 ways. First, the ratio of the AUC to the SE is a t test that can be used to see whether the AUC differs significantly from the null. Second, we can put a 95%CI around the estimate of the AUC with the formula:
CI^sub 95^ = AUC ± 1.96 × SE
One problem with the AUC is that there are many ways to estimate it. The issue here is that, at a theoretical level, the ROC curve is a smooth, continuous function. However, we are trying to estimate it from a finite number of points, leading to the somewhat jagged shape seen in Figure 2. No single method of determining the AUC is ideal because the choice depends on the number of points on the scale, the sample sizes of the 2 groups, and the degree of separation between the groups. A fuller, albeit somewhat mathematical and technical, explanation of the various methods is given by Lasko et al,9 who also list some of the free and commercially available software.
Comparing ROC Curves
At first glance, it may seem tempting to compare 2 tests or 2 versions of the same test by simply seeing which one has the larger AUC. However, like many things in life that are tempting, it's not necessarily the wisest thing to do. Simply comparing AUCs works only if the 2 ROC curves do not cross at any point; that is, one curve is consistently higher than the other. However, Figure 3 shows 2 ROC curves with equivalent AUCs but having very different properties. Even if the AUCs aren't the same, but the lines cross, it's necessary to choose between them by using a finer-grained test, called the partial AUC, or pAUC.9,13 Rather than calculating the AUC over the entire range of the test, we focus our attention on that portion of curves that is of most interest to us when actually using the test, such as the portion between an FP rate of 0.2 and 0.4, or between sensitivities of 0.7 and 0.8. Calculating the pAUC can be difficult, but we can simplify the situation by looking at a specific FP rate rather than a range. Whichever test produces a higher curve at that specific rate is the more useful one. The problem with this simplification is that a small change in the chosen FP rate may change the conclusion if the curves cross in that general area. It's also worth noting that, by using the pAUC, we may end up selecting the test that has a smaller AUC. The consequence is that we might opt for one test on the basis of its pAUC at a specific point, but we'd end up with the poorer test if it were to be used in a different situation where we want a higher or lower FP rate (a point we'll discuss in the next section). This further reinforces the fact that validity is not a property of a test; rather, it depends on the use to which the test is put and on a specific population.14 An example that compares 2 ROC curves, albeit noncrossing ones, is discussed by Cairney et al.15
Choosing a Cut Point
Using the cut point that is nearest to the upper left-hand corner is equivalent to selecting B in Figure 1 as the dividing line between normal and abnormal. It's the point that will result in the lowest number of overall errors: FN + FP. This is our goal in many instances, so that is the cut point that is chosen. However, the assumption, either explicit or (more often) implicit, is that the cost of making an FP mistake is the same as the cost of making an FN one. "Cost" in this case doesn't just mean a financial burden. It also includes the consequences of missing a true case or erroneously labelling an individual as abnormal. For example, in screening blood for hepatitis or HIV-AIDS, the cost of an FN must take into account the risk of infecting a blood recipient, whereas the cost of an FP is merely that of discarding a unit of blood that otherwise could have been used. Conversely, giving drugs to a child who has been erroneously diagnosed as having attention-deficit hyperactivity disorder exposes him or her to all the risks of the medication in addition to the adverse consequences of labelling,16,17 whereas missing the diagnosis may simply delay the intervention until the next trip to a specialist.
Swets18 quantifies the relation between the choice of cut points with the formula:
(ProQuest-CSA LLC: ... denotes formulae omitted.)
where S^sub opt^ is the optimal slope of the ROC (more about this in a minute); P(neg) is the prior probability, or baserate, of a negative finding; P(pos) that of a positive finding; and the 5s and Cs are the benefits and costs, respectively, of the various outcomes-TN, FP, and so on. If you remember high school geometry, you'll recall that a slope of 1 represents a line, tangent to the curve, going up at a 45° angle; that's just where the curve is nearest the upper left corner. Steeper slopes are further down the curve toward the lower left corner, reflecting more stringent criteria; and shallower slopes are nearer to the top of the curve, representing more lenient cut points.
What, then, does this formula tell us? First, let's keep a fixed set of costs and benefits; that is, we'll ignore what's to the right of the multiplication sign. Then, if the baserate of the disorder is low-that is, if we're trying to diagnose a rare condition-P(neg) is greater than P(pos), so the slope is greater than 1. The less prevalent the condition, the more stringent our criterion must be. This reflects the old medical school adage that, if you hear the sound of footsteps, it's more likely to be a horse than a zebra (at least, outside Africa). Conversely, for common disorders, we should use a more lenient criterion. Thus, if we were diagnosing PNP in a community survey, we would select a higher cut point than if we were to use the SPNP in an anxiety disorders clinic. In this latter situation, a higher FP rate is more acceptable because the number of negative cases is relatively small, compared with a community sample.
Now let's keep the ratio of the prior probabilities constant and look at what the right-most part of the formula is telling us. When an intervention's effectiveness is less than ideal and the cost of treating someone without the disorder is high, the cut point should be set at a very stringent level. Conversely, when the benefit of treatment is high and the cost of missing a case is low, we set the cut point much lower.
It's often difficult to quantify the various costs and benefits in units that can be plugged into the equation. One approach is to use relative rather than absolute values for the costs and benefits.19 For example, we could assume that a given disorder is so serious that the cost of an FN is 5 times that of a TP diagnosis. In this case, C^sub FN^ would be 5 and B^sub TP^ = 1. We would do the same with the cost of an FP diagnosis relative to the benefit of a TN. Even though these are just rough approximations, we should still use the formula at a conceptual level so that we keep in mind the trade-offs we make in selecting any given cut point.
We need to figure out sample sizes for 2 situations: to calculate the AUC for 1 ROC curve and to compare 2 ROC curves. We can approach the sample size problem for 1 ROC curve in 2 different ways: by testing the hypothesis that the AUC is significantly different from the null hypothesis (that it is equal to 0.50) or by specifying the size of the SE. Let's start off with the latter approach because the former is very similar to testing the difference between 2 AUCs and we can therefore deal with them together.
The rationale for determining the sample size by setting the magnitude of the SE is predicated on the fact that we are not testing a hypothesis; rather, we are estimating a parameter, the AUC. As is always the case in parameter estimation, sample size determines the SE and, by extension, the width of the CI. That means that we have to approach the problem backwards by deciding ahead of time what should be the maximum size of the SE or width of the CI and then seeing, on the basis of our best guess of what the AUC will be, what sample size we'll need to achieve this. The calculations for this can be quite tedious, but Hanley and McNeil11 give a nomogram of sample sizes for various AUCs and SEs.
Testing whether an AUC is significantly different from the null can be seen as being the same as determining whether 2 AUCs differ from one another, with the second test having an area of 0.50. However, when we test for the difference between 2 AUCs, we have to take one other factor into consideration: the correlation between the 2 tests. Again, the formulae are fairly formidable, but using a program for sample size calculation,20 we generated Table 4, which gives the sizes for each group for an alpha of 0.05 and beta of 0.20, as well as for 2 values of the correlation: 0.6 and 0.3. As with all sample size calculations, these should be regarded as rough approximations, not as additions to the Gospels.
From detecting enemy planes to assessing the efficiency of a diagnostic test, signal detection theory has produced many interesting applications (our personal preference is for nonmilitary applications, but that's another story). Increasingly, ROC analysis is being adopted in psychiatry to evaluate the accuracy of field-based methods for identifying cases of disorder in population studies21,22 and for screening for disorder in clinical settings,23 to give just 2 examples. With the attendant limitations we have identified, ROC analysis is nevertheless a useful tool for what has always been a dilemma for clinical medicine (whether some clinicians choose to recognize it or not): the trade-off between being right or wrong and the costs of making mistakes in either direction. Although these statistics cannot substitute for thinking (what we sometimes refer to as clinical judgment), they do provide a systematic approach to dealing with this problem.
Funding and Support
No funding or other support was received for this article.
Résumé : Que cache le ROC? Une introduction aux courbes caractéristiques de fonctionnement du récepteur
Il est souvent nécessaire de dichotomiser une échelle continue pour séparer les répondants en groupes normaux et anormaux. Cependant, parce que les scores répartis dans ces 2 groupes se chevauchent souvent, tout point de découpage choisi produira 2 types d'erreur : des faux négatifs (c'est-à-dire, des cas anormaux jugés normaux) et des faux positifs (c'est-à-dire, des cas normaux placés dans le groupe anormal). Changer de point de découpage modifiera le nombre des jugements erronés, mais n'éliminera pas le problème. Une technique appelée courbes caractéristiques de fonctionnement du récepteur (ROC) nous permet de déterminer la capacité d'un test de distinguer entre les groupes, de choisir le point de découpage optimal, et de comparer le rendement de 2 tests ou plus. Nous présentons la manière de calculer et de comparer les courbes ROC ainsi que les facteurs qui doivent être pris en compte dans le choix d'un point de découpage optimal.
1. Streiner DL. Breaking up is hard to do: the heartbreak of dichotomizing continuous data. Can J Psychiatry. 2002;47(3):262-266.
2. Radloff LS. The CES-D scale: a self-report depression scale for research in the general population. Appl Psych Meas. 1977;1:385-401.
3. Green DM, Swets JA. Signal detection theory and psychophysics. New York (NY): John Wiley & Sons; 1966.
4. Goodenough DJ, Rossmann K, Lusted LB. Radiographic applications of receiver operating characteristic (ROC) curves. Radiology. 1974;110(1):89-95.
5. Swets JA. ROC analysis applied to the evaluation of medical imaging techniques. Invest Radiol. 1979;14(2):109-21.
6. Swets JA, Pickett RM, Whitehead SF, et al. Assessment of diagnostic technologies. Science. 1979;205(4408):753-759.
7. Streiner DL. Building a better model: an introduction to structural equation modelling. Can J Psychiatry. 2006;51(5):317-324.
8. Streiner DL. Diagnosing tests: using and misusing diagnostic and screening tests. J Pers Assess. 2003;81(3):209-219.
9. Lasko TA, Bhagwat JG, Zou HH, et al. The use of receiver operating characteristic curves in biomedical informatics. J Biomed Inform. 2005;38(5):404-415.
10. Bamber D. The area above the ordinal dominance graph and the area below the receiver operating graph. J Math Psychol. 1975;12:387-415.
11. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29-36.
12. Fischer JE, Bachmann LM, Jaeschke R. A readers' guide to the interpretation of diagnostic test properties: clinical example of sepsis. Intens Care Med. 2003;29(7):1043-1051.
13. Obuchowski NA. Receiver operating characteristic curves and their use in radiology. Radiology. 2003;229(1):3-8.
14. Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and us. 3rd ed. Oxford (UK): Oxford University Press; 2003.
15. Cairney J, Veldhuizen S, Wade TJ, et al. Evaluation of two measures of psychological distress as screeners for depression in the general population. Can J Psychiatry 2007;52(2):111-120.
16. Alderman MH, Charlson ME, Melcher LA. Labelling and absenteeism: the Massachusetts Mutual experience. Clin Invest Med. 1981;4(3-4):165-171.
17. Bergman AB, Stamm SJ. The morbidity of cardiac nondisease in schoolchildren. New Engl J Med. 1967;276(18):1008-10I3.
18. Swets JA. The science of choosing the right decision threshold in high-stakes diagnostics. Am Psychol. 1992;47(4):522-532.
19. Erdreich LS, Lee ET. Use of relative operating characteristic analysis in epidemiology. Am J Epidemiol. 1981;114(5):649-662.
20. Hintze JL. PASS user's guide-II. Kaysville (UT): NCSS; 2002.
21. Furukawa T, Kessler RC, Slade T, et al. The performance of the K6 and K10 screening scales for psychological distress in the Australian National Survey of Mental Health and Well-Being. Psychol Med. 2003;33(2):357-362.
22. Kessler RC, Andrews G, Colpe IJ, et al. Short screening scales to monitor population prevalences and trends in non-specific psychological distress. Psychol Med. 2002;32(6):959-976.
23. Nasr S, Popli A, Wendt B. Can the MiniSCID improthe detection of bipolarity in private practice? J Affect Disord. 2005;86(2-3):289-293.
David L Streiner, PhD1, John Cairney, PhD2
Manuscript received April 2006, revised, and accepted September 2006.
This is the 26th article in the series on Research Methods in Psychiatry.
For previous articles please see Can J Psychiatry 1990;35:616-20, 1991;36:357-62, 1993;38:9-13, 1993;38:140-8, 1994;39:135-40, 1994;39:191-6, 1995;40:60-6, 1995;40:439-44, 1996;41:137-43, 1996;41:491-7, 1996;41:498-502, 1997;42:388-94, 1998;43:173-9, 1998;43:411-5, 1998;43:737-41, 1998;43:837-42, 1999;44:175-9, 2000;45:833-6, 2001;46:72-6, 2002;47:68-75, 2002;47:262-6, 2002;47:552-6, 2003;48:756-61, 2005;50:115-122, 2006;51:317-24.
1 Director, Kunin-Lunenfeld Applied Research Unit, Baycrest Centre; Professor, Department of Psychiatry, University of Toronto, Toronto, Ontario.
2 Research Scientist, Centre for Addiction and Mental Health; Assistant Professor, Department of Psychiatry, University of Toronto, Toronto, Ontario.
Address for correspondence: Dr DL Streiner, Director, Kunin-Lunenfeld Applied Research Unit, Baycrest, 3560 Bathurst Street, Toronto, ON M6A 2E1; email@example.com…