World J. Surg. 29, 561–566 (2005) DOI: 10.1007/s00268-005-7913-y
How to Appraise a Diagnostic Test Mohit Bhandari, M.D., M.Sc., Gordon H. Guyatt, M.D., M.Sc. Department of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences Center, McMaster University, Room 2c12, 1200 Main Street West, Hamilton, Ontario, L8N 3Z5, Canada Published Online: April 14, 2005 Abstract. Clinicians frequently confront challenges when using diagnostic tests to help them decide whether the patient before them suffers from a particular target condition or diagnosis. The primary issues to consider when determining the validity of a diagnostic test study are how the authors assembled the patients and whether they used an appropriate reference standard in all patients to determine whether the patients did or did not have the target condition. Surgeons should be interested in the characteristics of the test that indicates the direction and magnitude of change in the probability of the target condition associated with a particular test result. The likelihood ratio best captures the link between the pretest probability of the target condition and the probability after the test results are obtained (also called the posttest probability). Many studies, however, present the properties of diagnostic tests in less clinically useful terms: sensitivity and specificity. Sensitivity denotes the proportion of people with the disorder in whom the test result is positive. Specificity denotes the proportion of people without the disorder in whom the test result is negative. Application of the guides presented in this article can allow surgeons to assess critically studies regarding a diagnostic test.
Clinicians frequently confront challenges when ordering and interpreting diagnostic tests. Investigators studying a diagnostic test hope to establish the power of that test to differentiate between patients with the target condition (the disease or health state) and those who are free of the target condition. Patients free of the target condition may be healthy or have one of the competing diagnoses. The credibility, believability, or validity of a study is only as good as the methods used in its conduct [1]. Are the Results of the Study Valid? The primary issues to consider when determining the validity of a diagnostic test study are how the authors assembled the patients and if whether they used an appropriate reference standard in all patients to determine whether patients did or did not have the target condition (Table 1).
Correspondence to: Mohit Bhandari, M.D., M.Sc., Hamilton Health Sciences-General Site, 7 North Wing, Suite 727, 237 Barton St., East, Hamilton, ON, L8L 2X2, Canada, e-mail:
[email protected]
Was There Diagnostic Uncertainty? How do you know whether the investigators chose a suitable population, or their choice threatens the studyÕs validity? The specific question to ask is whether the surgeons caring for the study patients faced genuine diagnostic uncertainty. Tests are able to distinguish easily between severely affected and healthy patients (if, in preliminary studies, they fail to have even this degree of discriminative capacity, they can readily be discarded). The reason for this excellent diagnostic performance relates to the minimal overlap between test results in severely ill patients and test results in healthy volunteers. Clinicians do not need tests to distinguish between health individual and severely effected patients but, rather, to help resolve diagnostic uncertainty. Unfortunately, diagnostic tests typically do less well distinguishing target positive patients from target negative patients suffering from conditions easily confused with the target than distinguishing normals from those severely affected. A study of bias in studies of diagnostic tests revealed that choosing target positive and target negative patients from different populations leads to threefold overestimates of the power of the test (relative diagnostic odds ratio = 3.0, 95% CI 2.0–4.5) [2]. For instance, a shoulder ultrasound scan is almost always positive in patients with a severe rotator cuff tear and those who are unable to abduct their shoulder. On the other hand, it is almost never positive in healthy individuals. The challenge is to distinguish those with mild tears from those with shoulder problems who do not have rotator cuff tears. The use of carcinoembryonic antigen (CEA) to detect colorectal cancer provides a striking example of the variable utility of a diagnostic test in populations with different disease severity. CEA levels were elevated in 35 of 36 patients with established cancer. CEA levels were much lower in patients without cancer [3]. However, when a subsequent study applied CEA testing in patients with less advanced stages of colorectal cancer, the CEA test results were similar enough to those in patients without cancer that the ability of the test to distinguish the two groups declined significantly. Accordingly, the use of CEA in the diagnosis of cancer was abandoned [4]. Prickett and colleagues, in a study evaluating the utility of postoperative shoulder ultrasound scan, included a wide spectrum of patients with low, moderate, and high clinical suspicion of a
562 Table 1. Guidelines for evaluating studies about a diagnostic test. 1. Are the results of the study valid? a. Primary guides Did clinicians face diagnostic uncertainty? Was there an independent, blind comparison with a reference standard? b. Secondary guides Did the results of the test being evaluated influence the decision to perform the reference standard? Were the methods for performing the test described in sufficient detail to permit replication? 2. What are the results? a. Are likelihood ratios of the test being evaluated, or are the data necessary for their calculation provided? 3. Will the results help me care for my patients? a. Will the reproducibility of the test result and its interpretation be satisfactory in my setting? b. Are the results applicable to my patient? c. Will the results change my management? d. Will patients be better off as a result of the test?
rotator cuff tear [5]. In doing so, the authors assembled an appropriate spectrum of patients. Was There an Independent Comparison with a Reference Standard? The accuracy of a diagnostic test is best determined by comparing it with the truth. Truth about whether the disease is present is usually defined by the presence or absence of a pathologic finding that represents the condition (i.e., an essential lesion). A reference standard that uses that pathologic finding is most desirable. Reference standards that do not use an essential lesion are at risk of miscategorizing patients. Therefore, judgment should be used to decide whether the chosen reference is appropriate. Accordingly, readers must assure themselves that the4nvestigators have applied both the test under investigation and an appropriate reference standard (e.g., biopsy, surgery, autopsy, or long-term follow-up) to every patient. By independent we mean that the individual interpreting the reference standard should be unaware (or blind) to the results of the test, and the person interpreting the test should be unaware of the findings from the reference standard. To the extent that this blinding is not achieved, the investigation is likely to overestimate the diagnostic power of the test. In a study by Lijmer et al., lack of blinding significantly overestimated the test performance (relative diagnostic odds ratio = 1.3, 95% CI 1.0–1.9) [3]. For example, surgeons who find a hip fracture using a nuclear bone scan or magnetic resonance imaging (MRI) are more likely to identify a previously undetected fracture line on plain radiographs. In one study evaluating the use of plain radiography and MRI to detect avascular necrosis following hip fractures, investigators did not report independent assessments (plain radiography and MRI). Thus, the investigators who identified MRI changes at 2 months may have been more suspicious of plain radiograph, which initially appeared normal but were ultimately classified as abnormal [6]. Another way a lack of independence can mislead is if the test under evaluation is a component of the reference standard. For example, in one study that evaluated the utility of the serum and urinary amylase test for diagnosing pancreatitis [7], the investigators constructed a reference standard that included serum and urinary amylase assays. This incorporation of the test into the
World J. Surg. Vol. 29, No. 5, May 2005
reference standard overestimates the utility of the test under evaluation. Thus, clinicians should assure themselves that the test under evaluation is not part of the reference standard. In the study by Prickett and colleagues, all patients underwent preoperative shoulder ultrasound scans to determine the presence or absence of a tear. The authors further described that ultrasound assessments were performed by two independent and experienced radiologists (i.e., >2500 ultrasound scans/10 years) [5]. The investigators defined a tear as one requiring an operative repair (i.e., >75% of rotator cuff thickness or > 5 mm of insertional loss lateral to the humeral head articular surface) [5]. Having asked the most critical questions that assist in determining study validity, you can further reduce your chances of being misled by asking an additional question. Did the Results of the Test Being Evaluated Influence the Decision to Perform the Reference Standard? The properties of a diagnostic test are distorted if its results influence the decision to carry out the reference standard. This situation, called verification bias [7, 8] or workup bias [9, 10], applies when, for example, investigators conduct further evaluation with the reference standard only in patients with a positive test and assume that those with a negative test do not have disease. In practice, this leads to an overly sanguine estimation of the ability of the test under evaluation to differentiate target positive from target negative patients. Generally, if a test is invasive, surgeons are less likely to apply the reference standard (i.e., surgical biopsy) when the probability of disease is low. Verification bias occurred in a study evaluating the diagnostic utility of fine-needle aspiration biopsy (FNAB) to determine thyroid malignancy [11]. Investigators who identified benign lesions on FNAB did not resect the thyroid nodule for definitive pathologic diagnosis, whereas those patients with malignant or uncertain lesions on FNAB underwent further reference standard examination with surgical resection and pathology. This study is likely to overestimate the power of the test to exclude a malignancy. As one examines the methods section of the article by Prickett et al., one notes that all patients underwent arthroscopy to determine if a cuff tear was present. Thus, the results of the ultrasound scans did not influence the decision to conduct reference standard investigation (i.e., arthroscopy) in these patients.
What are the Results? The starting point for using any diagnostic process is to determine the probability with which the target disease is present in a given patient group before obtaining the next diagnostic test. Let us consider two patients: a 55-year-old woman who presents with intense shoulder pain and the inability to abduct her right arm following subacromial shoulder decompression surgery; and a 50year-old otherwise healthy woman with intermittent right shoulder pain 1 week following repair of a partial-thickness tear of her rotator cuff with normal physical examination. Most surgeons would consider that the probability of a rotator cuff tear in these two patients is different. The probability, referred to as the pretest probability, of a rotator cuff tear in the 55-year-old with shoulder pain and lack of shoulder abduction is much higher than the
Bhandari and Guyatt: Diagnostic Test Appraisal
probability of a tear in the 50-year-old even before we conduct additional diagnostic tests. How can surgeons estimate the pretest probability? Literature on disease probability given a certain presentation (e.g., discussing the probability of a rotator cuff tear in patients presenting with pain and a positive physical examination), similar data derived from the hospitalÕs registry, and surgeonsÕ clinical experience and intuition can help surgeons to estimate the pretest probability. Another source of information to estimate pretest probabilities comes from the same studies that provide data about a diagnostic test. For instance, in the study of Prickett et al., 50% (22/44) of the patients were found to have a rotator cuff tear. Based on a patientÕs history and clinical examination, surgeons can estimate the pretest probability for a rotator cuff tear [5]. The next step is to decide how the results of the ultrasound scans change this estimate of the probability of a rotator cuff tear. In other words, surgeons should be interested in the characteristic of the test that indicates the direction and magnitude of this change. This characteristic of the test is best captured in the likelihood ratio [1]. The likelihood ratio is the characteristic of the test that links the pretest probability to the probability of the target condition after the test results are obtained (also called the posttest probability). What Are the Likelihood Ratios Associated with the Test Results? Figure 1 presents results from the study by Prickett and colleagues. There were 22 patients with a proven rotator cuff tear and 22 patients in whom a tear was ruled out. How likely is a negative ultrasound scan among those patients who have a tear? Figure 1 reveals that 2 of 22 patients (i.e., 0.09) with a tear had a negative ultrasound scan. The investigators found a normal ultrasound test in 20 of 22 (0.91) patients without a tear. The ratio of these two proportions is the likelihood ratio for a negative C-reactive protein test and equals 0.099. In other words, a negative ultrasound scan is 10.1 times (1/0.099) less likely to occur in patients with a tear than in those without one. Alternatively, a positive ultrasound scan is 10.1 times more likely to occur in patients with a rotator cuff tear than in those without one (Fig. 1). How can we use the likelihood ratio (LR)? The LR tells you how much the pretest probability increases or decreases. For instance, an LR of 1.0 does not change the pretest probability, whereas an LR of > 1 increases it. A rough guide to the interpretation of LRs is as follows: LRs > 10 or LRs < 0.1 generate large, often conclusive changes in the posttest probability; LRs of 5 to 10 or LRs of 0.1 to 0.2 general moderate shifts in posttest probability; LRs of 2 to 5 or 0.5 to 0.2 generate small (but sometimes important) changes in probability; and LRs of 1 to 2 or 0.5 to 1.0 alter posttest probability to a small degree [1]. Having determined the LRs, how do we use them to link the pretest to the posttest probability? A simple but tedious calculation converts the pretest probability to pretest odds [odds = probability/(1 ) probability)]. The clinician can then multiply the pretest odds by the LR to obtain the posttest odds. Using another calculation, the posttest odds can be converted back to posttest probability [(probability = odds/(1 + odds)]. To save time and avoid computations, Fagan proposed a nomogram for converting pretest probability to posttest probability using LRs [12]. The clinician obtains the posttest probability
563
by placing a straight edge that aligns the pretest probability with the likelihood ratio for the diagnostic test. Table 2 describes this approach across varying pretest probabilities. Formally, new knowledge (posttest probability) derived from the revision of previous knowledge (pretest probability) when new information arrives (likelihood ratio) is an application of BayesÕ theorem to diagnosis. Table 2 shows us that a negative test effectively rules out a rotator cuff tear when the pretest probability is 50% or less. If the pretest probability is higher than 50%, a negative result leaves appreciable uncertainty. Similarly, a positive result leaves the surgeon extremely confident of the diagnosis of a tear if the pretest likelihood is above 50%. Below that value, even a positive test leaves some doubt. As is evident from the above examples, likelihood ratios provide the most effective approach to using diagnostic tests. However, many studies present the properties of diagnostic tests in less clinically useful terms: sensitivity and specificity. Sensitivity denotes the proportion of people with the disorder in whom the test result is positive. Specificity denotes the proportion of people without the disorder in whom the test result is negative. Using the rules provided in Figure 1, we can calculate the sensitivity and specificity of ultrasonography detecting rotator cuff tears. To calculate sensitivity, we divide the total number of patients with a rotator cuff tear and a positive ultrasound scan (true positives, n = 20 by the total number of patients who had a proven tear (true positives + false negatives, n = 22). Thus, the sensitivity is 91%. We can obtain the specificity by dividing the total number of patients with a negative ultrasound scan (true negatives, n = 20) by the total number of patients with no tear (true negatives + false positives, n = 22). Therefore, the specificity is 91%. Tests with high sensitivity are useful for ruling out disease, and tests with high specificities are useful for ruling in disease. For example, because almost all patients with scaphoid fractures suffer from anatomic snuffbox tenderness (a highly sensitive test), its absence virtually rules out a scaphoid fracture [13]. In patients with neck injuries, the absence of five clinical features (midline cervical tenderness, focal neurologic deficit, impaired alertness, intoxication, history of distracting injury) reduces the probability of important cervical spine injury to less than 1% [14]. In patients suspected of having full-thickness tears of the rotator cuff injury, because ultrasonography has a sensitivity of 100% a normal ultrasound scan rules out a full-thickness tear [15]. The prior three examples are all situations in which high sensitivity tests, if negative, can rule out a target condition. The posterior drawer test for the diagnosis of posterior cruciate ligament injuries is highly specific. Rubenstein and colleagues conducted a study to determine the diagnostic utility of this test among a varied population of patients including those with normal knees, anterior cruciate-deficient knees, and posterior cruciate-deficient knees. Among blinded assessors, a specificity of 99% was reported [16]. Thus, its presence makes the diagnosis of anterior cruciate ligament injury virtually certain [16]. Sensitivity and specificity exhibit drawbacks. When calculating sensitivity and specificity, important information is often discarded to collapse data to fit the 2 · 2 table format. Moreover, multiple recalculations of sensitivity and specificity are often necessary at each potential cutoff point (or division) when considering the results of a continuous variable (e.g., blood pressure) or a test that is reported as one of a number of cat-
564
World J. Surg. Vol. 29, No. 5, May 2005
Fig. 1. Diagnostic thresholds.
egories (e.g., high, intermediate, or low probability of a ventilation-perfusion scan). Finally, there is no convenient nomogram that allows us, with knowledge of sensitivity, specificity, and a particular test result, to convert pretest to posttest probability. However, one can translate these measures into likelihood ratios (Fig. 1). Similar drawbacks affect the calculation of predictive values (Fig. 1). Can the Results Help Me Care For My Patient? Having assessed the validity of the article and performed the necessary simple calculations to understand its results, we can ask ourselves whether these results can help us care for the patient.
The value of a diagnostic test often depends on its reproducibility when applied to patients. If a test requires much interpretation (electrocardiogram, pathology specimens) or uses laboratory assays (stains, biochemical assays), variations in test results can occur. If a study reports a test as being highly reproducible, two possibilities are likely: Either the test is quite simple and easy to apply to patients, or the investigators in the study are highly skilled in applying this diagnostic test to the study patients. If the latter is true, the diagnostic test may not be useful in a setting in which nonskilled interpretation of the test is likely to occur. Another important issue to consider is the similarity of your patient to those included in the study. The properties of a
Bhandari and Guyatt: Diagnostic Test Appraisal
565
Table 2. Pretest probabilities, likelihood ratios, and posttest probabilities. Pretest probability (%) Negative test 80% (high probability) 50 30 10 (Low probability) Positive test 80 (high probability) 50 30 10 (intermediate probability)
Ultrasound result (LR)
Posttest probability (%)
0.09 0.09 0.09 0.09
26 8.3 3.7 0.9
10 10 10 10
98 91 81 53
LR: likelihood ratio.
diagnostic test can change with different disease severity. For instance, the test may not perform as well in community practice where less complicated cases must be distinguished from multiple competing diagnoses. On the other hand, in the Spangehl study, the patients were assessed in a referral practice setting (university hospital setting). In that setting, surgeons were more likely to encounter patients with more severe or complicated disease in whom the diagnostic test (C-reactive protein) is likely to perform better (likelihood ratio >> 1). In this setting, alternative diagnoses may have already been explored and ruled out. Likelihood ratios tend to move away from the value of 1 when patients with the target disorder have severe disease and tend to move toward the value of 1 when patients with the target disorder have mild disease [1]. In general, however, if you practice in a setting similar to that presented in the study and your patient meets the study eligibility criteria, you can be confident when applying the results of the study to your patient. Once you have decided that the results are, in fact, applicable to your patient, you must decide whether they will change your management. Before making any decisions, surgeons must have a sense of what probabilities would confirm or refute the target diagnosis. For example, suppose you are willing to proceed with rotator cuff repair without further testing the patients with a probability of 90% or more of having a tear (realizing that you will be operating on 10% of patients unnecessarily). Moreover, let us suppose that you are willing to reject the diagnosis of tear if the test probability is 10% or lower. One may wish to apply different numbers here; the treatment and test thresholds are a matter of values (ideally, the patientÕs values), and they differ for different conditions depending on the risks of therapy (i.e., if the therapy is associated with severe side effects, you may want to be more certain of your diagnosis before recommending it) and the danger of the disease if left untreated (if the danger of missing the disease is high, such as with pulmonary embolism, you may want your posttest probability to be extremely low before abandoning diagnostic testing) (Fig. 2). Finally, you can ask yourself if your patient will be better off having had the test. A test becomes more valuable when it has acceptable risks, the target disorder if left untreated has major consequences, and the target disorder can be readily treated if diagnosed. Ultrasound testing poses minimal risk to the patient and may be extremely valuable for ruling in or ruling out a rotator cuff tear.
Fig. 2. In every diagnostic process, surgeons identify a threshold probability of disease beyond which they will request additional tests or below which no further testing is necessary.
Conclusions Application of the guides presented in this article can allow surgeons to assess studies about a diagnostic test critically. Surgeons are continuously exposed to a variety of new and innovative diagnostic tests and to the studies describing their diagnostic properties. Determining the validity of these studies, the study results, and the applicability of these results to your patients are three fundamental steps toward choosing and interpreting diagnostic tests. References 1. Jaeschke R, Guyatt GH, Sackett DL. UserÕs guide to the medical literature: how to use an article about a diagnostic test; are the results of the study valid?. J.A.M.A. 1994;271:389–391 2. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. J.A.M.A. 1999;282:1061–1066 3. Fletcher R. Carcinoembryonic antigen. Ann. Intern. Med. 1986;104: 66–73 4. Thompson DMP, Krupey J, Freedman SO, et al. The radioimmunoassay of circulating carcinoembryonic antigen of the human digestive system. Proc. Natl. Acad. Sci. U.S.A. 1969;64:161–167 5. Prickett W, Teefey S, Galatz L, et al. Accuracy of ultrasound imaging of the rotator cuff in shoulders that are painful post-operatively. J. Bone Joint Surg. Am. 2003;85:1084–1089 6. Kawasaki M, Hasegawa Y, Sakano S, et al. Prediction of osteonecrosis by magnetic resonance imaging after femoral neck fractures. Clin.Orthop. 2001;385:157–164 7. Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics 1983;39: 207–215 8. Gray R, Begg CB, Greenes RA. Construction of receiver operating characteristic curves when disease verification is subject to selection bias. Med. Decis. Making 1984;4:151–164 9. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N. Engl. J. Med. 1978;299: 926–930 10. Choi BC. Sensitivity and specificity of a single diagnostic test in the presence of work-up bias. J. Clin. Epidemiol. 1992;45:581–586 11. Hamming JF, Goslings BM, van Steenis GJvan , et al. The value of fine needle biopsy in patients with nodular thyroid disease divided into groups of suspicion of malignant neoplasms on clinical grounds. Arch. Intern. Med. 1990;150:113–116 12. Fagan TJ. Nomogram for BayesÕs theorem. N. Engl. J. Med. 1975; 293:257 13. Parvizi J, Wayman J, Kelly P, et al. Combining the clinical signs improves diagnosis of scaphoid fractures. J. Hand Surg. [Br.] 1998;23: 324–327 14. Hoffman JR, Mower WR, Wolf son AB, et al. Validity of a set of clinical criteria to rule out injury to the cervical spine
566 in patients with blunt trauma: National Emergency X-Radiography Utilization Study Group. N. Engl. J. Med. 2000;343:94– 99 15. Teefey SA, Hasan SA, Middleton WD, et al. Ultrasonography of the rotator cuff: a comparison of ultrasonographic and arthroscopic find-
World J. Surg. Vol. 29, No. 5, May 2005 ings in one hundred consecutive series. J. Bone Joint Surg. 2000;82:498– 504 16. Rubinstein RA Jr, Shelbourne KD, McCarroll JR, et al. The accuracy of the clinical examination in the setting of posterior cruciate ligament injuries. Am. J. Sports Med. 1994;22:550–557