EDITORIAL Methodologic Standards for Diagnostic Test Assessment Studies THE LITERATUREcontains numerous reports of assessments of the a c c u r a c y of diagnostic tests. These reports usually involve estimation of the sensitivity a n d specificity of the tests, a n d occasionally the predictive values. However, b e y o n d the g e n e r a l acc e p t a n c e of these m e a s u r e s as summaries of test accuracy, there is wide variation in the m a n n e r in which the studies a r e conducted a n d the results reported. This is unfortunate, b e c a u s e studies of diagnostic tests are susceptible to several common a n d potentially serious biases.l" • These include verification bias, c a u s e d by selective inclusion of patients on whom the reference test (sometimes referred to a s the "gold s t a n d a r d " ) is performed, test interpretation biases, the exclusion of c a s e s with failed or uninterpretable tests, a n d the a b s e n c e of a definitive reference test. Also, the results m a y be influenced by the composition, or spectrum, of the population studied, the investigators who interpret the tests, a n d the criterion used to define w h a t is m e a n t by a positive test result. The four articles on diagnostic test assessment in this issue of the Journal illustrate m a n y of these problems. Arrol a n d colleagues s h a v e r e p e a t e d a previous survey of the medical literature to track c h a n g e s in the quality of published studies of diagnostic tests. Although they show some improvement since the previous survey, they conclude that the implications of the results of individual studies a r e often neither fully understood nor a d e q u a t e l y described, as evid e n c e d by the lack of attention to the concept of prevalence a n d the estimation of predictive values, f a g r e e with their conclusion that more attention n e e d s to be paid to the development of methodologic s t a n d a r d s for these studies. However, I feel that the list of methodologic criteria used by Arrol et al. n e e d s to be e x p a n d e d to e n c o m p a s s the potential biases a n d extrapolation factors mentioned earlier. In particular, it is critical that the m e c h a n i s m b y which patients a r e selected be described, a n d that it be clarified that such selection is not influenced indirectly by the result of the test, or else verification bias m a y be a serious possibility. For instance, one should be suspicious of studies w h e r e it is stated that, say, the analysis is restricted to biopsy-verified cases. In this circumstance it is very likely that the decision to biopsy might be prompted by positive or suspect test results, as would be the c a s e in screening for or staging of cancer. Similarly, it is important that patients with failed, uninterpretable, or equivo518
cal test results be included in the analysis, since failure to do so m a y artificially inflate the performance of the test. 4"s Although Arrol et al. p a y careful attention to the "gold s t a n d a r d " reference test a n d a r e a w a r e of the possibility that it m a y not be definitive, it is worthwhile in such circumstances to attempt some follow-up of the patients in a n effort to detect possible misclassifications. Finally, a s r e g a r d s the issue of blinding, it is certainly important that the two tests be a s s e s s e d independently. However, blinding is often used for other purposes in studies of diagnostic tests. For example, in the radiologic setting, it is common practice for the "reader" of the test u n d e r investigation to be blinded to available clinical information that might influence his or her interpretation. Such blinding is of dubious value as a methodologic device, since it creates a n artificiality that is not present in the clinical setting. 6 Probably the most common a n d serious methodologic problem is verification bias, a n d its impact is dramatically illustrated in the p a p e r b y Simel a n d coneagues. ~For example, in their Table 3, the sensitivity estimate for PG-2,3 residents c h a n g e s from 82% to 10% after adjustment for the bias. The r e a s o n for this p h e n o m e n o n c a n be s e e n from their a p p e n dix. Ignoring the intermediates, the naive sensitivity estimate is the proportion of positive results a m o n g patients who h a v e ascites, i.e., 82% (~l). However, the denominator of this estimate is u n d e r e s t i m a t e d since most of the patients who h a d n e g a t i v e clinical examinations did not h a v e these results verified by ultrasonography, while the numerator is appropriate since all patients who h a d positive examinations h a d such verification. To m a k e the correction we must recognize that in fact only 2.5% of the patients who h a d negative clinical examinations are reported in the table. Therefore, rather t h a n 35 patients with negative clinical examinations, of w h o m two h a d ascites, there really were approximately 1,400 patients (s%.6~s), of whom we estimate that 80 (%.0~s) h a d ascites. Therefore, the a d j u s t e d estimate of sensitivity is 10% (%6) as indicated. Although the bias is extreme in this example, this is a frequent problem, a n d will occur to a greater or lesser extent depending on how strongly the decision to obtain the reference test is influenced by the result of the study test. This relationship tends to be very strong for screening tests, as in this example. The correction technique outlined by Simel et al. c a n be modified to adjust for the impact of clinical factors that affect the
JOURNALOFGENERALINTERNALMEDICINE.Volume3 (Sept/Oct), 1988
decision to verify, a n d indeed such modification is n e c e s s a r y if such factors play a role in the selection process, s Verification bias m a y also be a problem in the article by Kinney.~ He studied 521 consecutive patientswho, a m o n g other things, e a c h h a d a Doppler e c h o c a r d i o g r a p h y study performed. However, it is not clear from the study description how m a n y patients there were who h a d the routine auscultation but who were not referred for e c h o c a r d i o g r a p h y . Failure to refer patients with negative clinical examinations would l e a d to overestimation of the sensitivity. However, a major finding of the study is that the sensitivities are consistently low, a n d verification bias, if present, would m e a n that the estimates should be e v e n lower. An important m e s s a g e from the study by Kinney is the relation of covariates (i.e., patient a n d physician characteristics) to the performance of the test. This is often overlooked in diagnostic a s s e s s m e n t studies, the norm being merely to provide a g g r e g a t e d sensitivity a n d specificity estimates w h e n in fact the test m a y be useful in some settings but not in others. Of special importance is the demonstration of interobserver variation, since, in principle, the poor diagnostic performance of a particular physician might be improved by training intervention. This study is also a good e x a m p l e of how "messy" a study c a n be due to u n a v o i d a b l e complexities, such as the 112 missing Doppler examinations, the problems with equivocal a n d poor-quality tests, a n d the fact that m a n y patients received more t h a n one examination. Finally, there is the problem that the reference test is a n imperfect "gold standard," which is the subject of the remaining paper. Boyko et al. mh a v e used a formula developed by Gart a n d Buck n to e x a m i n e anticipated biases in the estimates of the test characteristics, a s a function of the error rates in the reference test a n d the prevalence of disease in the study population. They conclude that the bias in sensitivity c a n be large if the prevalence is low, a n d the bias in the specificity c a n be large ff the prevalence is high, a n d that these biases a l w a y s result in underestimation of the test characteristics, except u n d e r "unlikely conditions." Unfortunately, the conditional i n d e p e n d e n c e assumption on which these results a r e b a s e d is typically implausible, a n d generally untestable. If the results of two tests are positively correlated, a s is likely (especially if they a r e primarily detecting the s a m e indication), then the test characteristics a r e more likely to be overestimated. In a n y case, it is certainly true that regardless of the presence or absence of conditional independence, biases are likely if the reference test is imprecise, a n d a good study design involves follow-up to try to identify the specific misclassifications. However, as r e g a r d s the ex-
S 19
amples studied by Boyko et al. I a m inclined to believe that other interpretations of the observed variations in test characteristics a r e more plausible. For example, it is quite r e a s o n a b l e that a test be more "accurate" in a high-prevalence setting w h e r e the disease is more commonly severe. Indeed, this is the r e a s o n used by Hlatky et al.~2 to explain the results of their study of exercise electrocardiography. Also, these d a t a were obtained retrospectively from a d a t a b a s e composed of patients who u n d e r w e n t cardiac catheterization. Since the result of exercise electrocardiography could influence the decision to obtain the reference test, there is the potential for verification bias. This selection probably would also be influenced by the symptoms of a n g i n a , so that differential bias for different symptom combinations is quite plausible, l e a d i n g to the observed variations. Variations in the testing criteria often l e a d to patterns like those observed, with relatively high sensitivity a n d low specificity in the high-risk subgroup, although in this particular study the test criterion a p p e a r s to be fairly objective. Unfortunately, all this is speculative. To resolve these issues one would n e e d more extensive data; the test result recorded in more detail on a n ordinal scale to facilitate ROC analysis; a n d information about the symptoms a n d test results of patients not referred for the reference test. It is clear from these examples that interpretation a n d reporting of diagnostic a s s e s s m e n t studies is complex a n d riddled with pitfalls. How c a n the methodology be improved? Earlier, I described some of the potential biases that a r e rarely recognized but c a n h a v e a significant impact a n d outlined some rudimentary strategies for circumventing them. In addition to these biases a n d the other issues covered in the checklist used by Arrol et al., we n e e d to be concerned about statistical variation in the estimates. Sample sizes are often small, a n d so the use of confidence intervals would help to clarify the lack of precision. Researchers m a y w a n t to a g g r e g a t e d a t a from different published studies in order to achieve a d e q u a t e precision. Consequently, the criterion of positivity n e e d s to be clearly reported in order that we do not mix apples a n d oranges. If the criterion varies, then the use of ROC a n a l y s i s to reconcile the trade-off b e t w e e n sensitivity a n d specificity is necessary. Is Likewise, variation in the composition of the source population m a n d a t e s analytic adjustments, as does the presence of potential bias. In summary, we n e e d further r e s e a r c h on clarifying the design options for diagnostic a s s e s s m e n t studies, a n d on providing guidelines for a n a l y z ing a n d reporting the results, in order to improve the literature as a reliable source of information. m Co/in B. Begg, PhD, Dana Farber Cancer Institute, Harvard Med/cal School, Boston, MA 02115
LETTERS
$20 Supportedbythe NationalCancerInstitute,AwardCA-31247.I wishto thankPeter Doubiletfor helpfulconsultationsin preparingthis editorial.
REFERENCES 1. BeggCB. Biasesin the assessmentof diagnostictests. Statistics Med 1987;6:411-23. 2. RansohoffiDF, FeinsteinAR. Problemsof spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978;299:926-30. 3. Arrol] B, Schechter MT, Sheps SB. The assessment of diagnostic tests: a comparisonof the recent medicalliterature-- 1982 versus 1985. J Gen intern Med 1988;3:443-7. 4. PoynardT, Chaput JC, EtienneJP. Relationsbetween effectiveness of a diagnostictest, prevalenceof the disease, and percentagesof uninterpretable results. Med DeciSMaking 1982;2:285-302. S. BeggCB, GreenesIRA.IglewiczB. The influenceof uninterpretability on the assessment of diagnostic tests. J Chronic Dis 1986;39:575-84.
6. BeggCB, McNeil BJ. Assessmentof radiologictests: control of bias and other design considerations.Radiology 1988; 167:565-9.7. Simel DL, Halvorsen RA, FeussnerJR. Quantitating bedsidediagnosis: dinicat evaluation of ascites. J Gen Intern Med 1988;3:423-8. 8. BaggCB, GreenesRA. Assessmentof diagnostictests when disease verification is subjectto selectionbias.Biometrics 1983;39:207-15. 9. Kinney EL. Causesof false-negativeauscultation of regurgitant lesions: a Doppler echocardiographicstudy of 294 patients. J Gen intern Med 1988;3:429- 34. 10. Boyko EJ, Alderman BW, Baron AE. Referencetest errors bias the evaluationof diagnostictests for ischemicheart disease.J Gen intern Med 1988;3:476-81. 11. Cart JJ, Buck A. Comparisonof a screeningtest and a referencetest in epidemiologicstudies. Am J Epidemio11966;83:593-602. 12. Hlatky MA, Pcyor DB, Harrell FE, Calif RM, Mark DB, Rosati RA. Factorsaffecting sensitivity and specificityof exerciseelectrocardiography. Am J Med 1984;77:64-71. 13. Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978;8:283-98.
LETTERS TO THE EDITORS Expanding the Ranks of Academic General Internal Medicine--Not Just any Port in a Storm To the Editors:-- In view of the "storm warnings" of d e c r e a s i n g g e n e r a l internal medicine fellowship opportunities forecasted in the Journal, ~ the Society of G e n e r a l Internal Medicine (SGIM) should reconsider the signals it sends to g r a d u a t e s of internal medicine r e s i d e n c y programs. SGIM a p p e a r s to h a v e defined the r e s e a r c h domain of the a c a d e m i c g e n e r a l internist in a w a y that excludes support for basic biomedical research. Meanwhile, medical subspecialty societies e a g e r l y welcome the b a s i c science contributions of those internists who a r e willing to "trim their sails" in their fellowship program-~. Basic res e a r c h often provides a foundation from which the res e a r c h a n d practice b e h a v i o r s of a c a d e m i c g e n e r a l internists arise, a n d confirms the a p p r o p r i a t e n e s s of supporting all of the r e s e a r c h interests of those who declare g e n e r a l internal medicine to b e their main clinical vocation. To do otherwise is to e n c o u r a g e potential a c a dernic g e n e r a l internists to seek "any port in a storm," when selecting a clinical specialty that also a d v a n c e s their r e s e a r c h interests. SGIM m a y do well to r e a s s e s s its options for e x p a n d i n g its ranks by legitimi~ng basic science r e s e a r c h training a s a discipline offered in genera] internal medicine fellowship programs. Such prog r a m s would train a c a d e m i c g e n e r a l internists with special perspectives for integrating b a s i c biomedical a d v a n c e s with clinical medical practice a n d societal priorities. In a c a d e m i c g e n e r a l internal medicine, w h e r e the waters a r e getting rougher, SGIM is in the important position of determining which sailors c a n call this dock their home. For some of us, it will m e a n the difference b e t w e e n "any port in a storm" a n d "full s t e a m a h e a d . " - Michael H. Zarouk/an, MD, PhD, D/v/s/on of G e n e r a / M e d icine, D e p a r t m e n t of Medicine, Co//ege of Human Medi-
cine, Michigan State University, East Lansing, Michigan 48824 References 1. Fletcher SW, Fletcher RH. Storm warnings. J Gen Intern Med 1988;3:205-6.
Will Sharing Uncertainty Reduce Physician Effectiveness? To the Editors:--In their study of p h y s i c i a n uncertainty, Johnson et al. convincingly demonstrate that physician disclosure of uncertainty reduces patient satisfaction,~ but a r g u e for just such disclosure. I find that position unjustified. The authors' findings illustrate the n e e d for idealization in d o c t o r - p a t i e n t relationships. Traditional psychoanalytic thinking supports their argument, viewing p a tients' idealizations of doctors as unrealistic distortions which foster d e p e n d e n c e . 2 Recently, Heinz Kohut a n d the self-psychologists d e s c r i b e d idealization as a n important a n d healthy p a r t of psychological development, s We all tend to idealize important figures. Our security a n d selfesteem benefit from seeing them a s powerful a n d gifted. Their failings l e a v e us s h a k e n a n d disappointed. The importance of idealized relationships i n c r e a s e s for those who feel ill or frightened, a s our patients often do. Clinical investigation supports the importance of idealization to the healing role of doctors. Studies of responses to p l a c e b o find idealizing attitudes toward doctors a n d medical treatment to be the most important predictors of therapeutic effect.4, s Idealization of doctors p r o b a b l y contributes to the beneficial effect of all d o c t o r patient encounters. Johnson et al. found that their patients who preferred a more idealized relationship w e r e most dissatisfied with physician uncertainty. Other studies h a v e shown that m a n y patients h a v e similar preferences, a This a p p e a l to discuss uncertainty seems more a moral belief than a scientific conclusion. It prescribes a uniform relationship with all patients believing that we doctors know best what patients need. This seems to me paternalism revisited. It a r g u e s that a c k n o w l e d g i n g uncertainty will foster trust a n d respect. Patients who request more idealized relationships, though, a r e telling us the conditions under which they would feel more supported, respected, a n d a b l e to trust--GREGORY E. SIMOI~I,MD, Prim a r y C a r e Psychiatry, Massachusetts General Hospital, Boston, MA 02114