Assessing Quality of a Diagnostic Test Evaluation CYNTHIA D. MULROW, MD, MSc, WILLIAM D. LINN, PharmD, MARY K. GAUL. PharmD, JACOUELINE A. PUGH, MD Objective: To develop a s t a n d a r d i z e d scale f o r assessing the quality o f a diagnostic test evaluatiom Design: Fourteen p a r t i c i p a n t s with f o r m a l a n d p r a c t i c a l e x p e r i e n c e in evaluating diagnostic tests f o r m e d a consensus p a n e L P a n e l members identified a n d weighted quest i o n s that s h o u l d be a d d r e s s e d w h e n assessing the quality o f a diagnostic test evaluation. Setting: G e n e r a / i n t e r n a l m e d i c i n e d i v i s i o n a t a n academic medical center. Results: A 19-item weighted scale w a s developed, l t p r t o r i ttzes and a d d r e s s e s issues such a s d e s c r i p t i o n o f the
proposed purpose o f the tes~ appropraUe selecaon and descrfptlon o f the study potnJ~__~om approprtate performa n c e a n d d e s c r i p t i o n o f the diagnostic test; a p p r o p r i a t e selection a n d p e r f o r m a n c e o f the reference s t a n d a r d , a n d adequat~ p r e s e n t a t i o n o f test cbaracterLcttcs. Conclusions: The s c a / e / s p r o p o s e d a s a u s e f u / i n s t r u m e n t f o r readers, investigators, reviewers, a n d editors, because it represents a n u p d a t e d synthesis o f i m p o r t a n t criteria to c o n s i d e r w h e n evaluating diagnostic tests. I t c a n also be used to rate quantitatively the quality o f diagnostic test evaluations. Key w o r d s : d / a g n o s t / c tes~. methodo/ogy; e v ~ u a t / o n ; qua/ity c o n t r o l J GEN INTERN ML~ 1989;4:288--295. THE PRACTICE OF MEDICINE h a s r a d i c a l l y c h a n g e d
over
the last 30 years, w h i c h has b e e n reflected in increasing v o l u m e s of diagnostic tests b e i n g p e r f o r m e d . Not o n l y do clinicians o r d e r tests often in their practice, but the variety of tests available to b e o r d e r e d is staggering. Every m o n t h physicians are c o n f r o n t e d w i t h n e w tests that are touted to diagnose better or m o n i t o r m o r e closely c o m m o n clinical conditions. Unfortunately for the clinician, studies assessing diagnostic tests often do not include all of the information necessary for the intelligent evaluation of the tests' p e r f o r m a n c e and applicability. Confusing the clinician further, ideas a b o u t w h a t is necessary for evaluating a diagnostic test have evolved, T M and guidelines for a p p l y i n g these ideas have p r o l i f e r a t e d ) 5, 16, 20, 23 Most guidelines have inc l u d e d lists o f criteria necessary for p r o p e r evaluation of diagnostic tests. These criteria have not b e e n uniform. Actual n u m b e r s of suggested criteria have ranged f r o m 4 to 21. In addition, m e t h o d s of d e v e l o p m e n t o f
Received from the Division of General Internal Medicine, University of Texas Health Science Center and the Ambulatory Care Section, Audie L. Murphy Memorial Veterans' Administration Hospital, San Antonio, Texas. Supported by a Milbank Memorial Scholarship and an American College of Physicians' Teaching and Research Scholar Award. Address correspondence and reprint requests to Dr. Mulrow: Office of ACOS for Ambulatory Care (11C), Audie L. Murphy VA Hospital, 7400 Merton Minter Blvd., San Antonio, TX 78284.
288
criteria have not b e e n w e l l described and quantitative t e c h n i q u e s for assessing quality have not b e e n incorporated. We used e x p e r t consensus t e c h n i q u e s to d e v e l o p a standardized scale that c o u l d be used to score the quality of an evaluation that aims to assess the discriminating ability o f a diagnostic test. This scale is i m p o r t a n t for several reasons. First, readers, authors, editors, and m a n u s c r i p t reviewers can use it to identify critical m e t h o d o l o g i c c o m p o n e n t s of articles and manuscripts addressing diagnostic test evaluations. This c o u l d lead to i m p r o v e d p e e r r e v i e w processes. Second, investigators can use it to h e l p formulate study designs o f diagnostic test evaluations. This c o u l d result in m o r e scientifically sound studies. Third, literature reviewers wishing to s u m m a r i z e information on a particular diagnostic test can use it to evaluate and rank the quality o f existing p e r t i n e n t literature. The reviewers as w e l l as clinicians and health p o l i c y m a k e r s c o u l d then use the rankings to incorporate gradations or levels o f e v i d e n c e into their r e c o m m e n d a t i o n s and practices.
SCALE DEVELOPMENT METHODS Panel Members Scale d e v e l o p m e n t was a c c o m p l i s h e d b y 14 panel m e m b e r s (Appendix I). All had practical e x p e r i e n c e in using diagnostic tests and nine had formal training in clinical e p i d e m i o l o g y . Panel m e m b e r s w e r e r e c r u i t e d from a research section m e e t i n g attended b y staff f r o m the Divisions of General Internal Medicine at the University of Texas Health Science Center and the Brooke Army Medical Center.
Identification and Weighting of Questions (Fig. 1) S t e p 1: An interactive p a n e l of five m e m b e r s w h o had studied relevant literature 8, 13, 15-33m e t in a series of three c o m m i t t e e meetings and explicitly identified 16 questions that addressed the a d e q u a c y of a diagnostic test evaluation. Step 2: A second, i n d e p e n d e n t , p a n e l of nine m e m b e r s was asked individually to c o m p l e t e an opene n d e d questionnaire based on guidelines for diagnostic test evaluations p r e v i o u s l y p u b l i s h e d in the McMaster series o f Clinical E p i d e m i o l o g y Rounds. 16 In this questionnaire, m e m b e r s w e r e asked to c o m m e n t on the relative i m p o r t a n c e of the eight p u b l i s h e d McMaster criteria; w h e t h e r these criteria warranted further clarification or expansion; and w h e t h e r additional criteria w e r e needed. T w o editors c o m p i l e d answers f r o m
JOURNALOFGENERALINTERNALMEDICINE,Volume4 (July/August), 1989
these questionnaires into 20 explicitly defined questions. S t e p 3" As s o m e questions identified in Step 1 and Step 2 w e r e duplicative, all questions w e r e c o m b i n e d into a single 28-item closed-ended questionnaire. This questionnaire i n c l u d e d eight questions originally specified b y the interactive panel only, eight questions specified b y b o t h panels, and 12 questions specified b y the i n d e p e n d e n t panel only. S t e p 4.- The 28-item questionnaire was sent to all 14 panel m e m b e r s . T h e y w e r e asked to assess w h e t h e r the questions w e r e a p p r o p r i a t e and necessary to consider for reviewing the quality of a diagnostic test evaluation. Question assessments w e r e scored as either param o u n t (absolutely essential criteria) or on a scale of 0 to 5, w i t h 0 representing not important and 5 representing very important. S t e p 5: Scores for each question w e r e collated and returned as f e e d b a c k to the 14 p a n e l m e m b e r s . T h e y w e r e asked to rescore each of the 28 questions after referring to their o w n previous answers and the previous answers given b y other panel m e m b e r s . In addition, for questions scored as paramount, panel m e m bers w e r e n o w asked to indicate their degree of certainty in assigning that ranking.
Final Scoring and Editing Scores for each of the 28 questions w e r e averaged. (Because scores w e r e obtained from Likert scales, w h i c h are s u m m a t e d rating scales that are c o m m o n l y treated as interval data, 42, 43 it was considered appropriate to average t h e m . ) In order to decide w h i c h questions should remain in the final questionnaire, t w o editors a p p l i e d cutpoints ( d e t e r m i n e d a priori) to the average scores. Questions that had b e e n ranked param o u n t w i t h o u t reservation b y m o r e than half of the p a n e l m e m b e r s w e r e considered p a r a m o u n t in the final questionnaire. Questions receiving an average rating of less than 3 w e r e not included. The remaining questions w e r e retained along w i t h their average scores. (Some o f these scores w e r e greater than 5, as s o m e p a n e l m e m bers ranked questions p a r a m o u n t even w h e n the majority did not agree w i t h this ranking. In this instance, the p a r a m o u n t ranking was given a score o f 10 so that a n u m e r i c a l average for that question c o u l d b e made.) The above-described scoring process was comp u t e d for the rankings of all 14 panel m e m b e r s as a group, as well as separately for the rankings o f the five original interactive p a n e l m e m b e r s and the nine original i n d e p e n d e n t panel m e m b e r s . To be able to c o m p a r e relative rankings o n individual questions, scores w e r e standardized to a scale o f 100 using the following formula: Standardized score ----
Independent Panel Identifies 20 Questions
Interactive Panel Identifies 16 Questions
I
I
.I
Questions Combined Into Single 28 Item Questionnaire
I
All Panel Members Score Questionnaire
I
All Panel Members Rescore Questionnaire Taking Into Account Previous Scores FIGURE I. Schematicof scaledevelopment procedurefor identifying and weighting questions.
Z89
initial question score X 100 s u m of initial question scores
Standardized scores w e r e d e t e r m i n e d separately for the interactive panel, the i n d e p e n d e n t panel, and the comb i n e d panel.
RESULTS AND SCALE PRESENTATION The final standardized quantitative scale for assessing the quality of a diagnostic test evaluation is given in Table 1. This scale has 19 questions. Nine of 28 items in the first close-ended questionnaire w e r e deleted because they did not receive an overall average score o f at least 3. O f the 19 remaining questions, eight had originally b e e n identified b y the interactive p a n e l m e m b e r s only, three b y the i n d e p e n d e n t panel m e m b e r s only, and eight b y m e m b e r s in b o t h panels. The nine items that w e r e deleted had originally b e e n specified b y m e m b e r s of the i n d e p e n d e n t panel. More than half the m e m b e r s of the panels w e r e certain that three questions (Qs. 1 - 3) should b e considered p a r a m o u n t w h e n assessing the quality of a diagnostic test evaluation. Results of a diagnostic test evaluation w e r e considered interpretable and valid only if a c o m p a r i s o n g r o u p was studied, if the diagnostic test was a p p r o p r i a t e l y p e r f o r m e d , and if an a p p r o p r i a t e c o m p a r i s o n or reference standard was used.
Mulrow eta/., ASSESSINGDIAGNOSTICTESTS
290
TABLE 1 Average Quantitative Scores For Questions Assessing the Quality of Diagnostic Test Evaluations Given by Panel Group*
Original Panel Identification
Interactive Panel Score
Independent Panel Score
Combined Panel Score
Interactive
100% ranked paramount
64% ranked paramount
74% ranked paramount
2. Was the diagnostic test being evaluated appropriately performed in a standardized manner?
Both
8 0 % ranked paramount
7 1 % ranked paramount
7 4 % ranked paramount
3. Was an appropriate reference ("gold") standard used?
Both
8 0 % ranked paramount
6 4 % ranked paramount
6 8 % ranked paramount
Inte~ve
7.9+--2.9
7.6+--3.0
7.7+--3.0
Interactive
7.9 ± 2.9
7.6 + 3.0
7.7 +,, 3.0
6. Were the inclusion and exclusion criteria that were used to select study patients described?
Interactive
6.3 +.. 3.9
5.8 ± 1.8
6.0 ± 2.5
7. Was a wide spectrum of severity of diseased patients included in the case group?
Both
5.6 +,, 1.1
5.2 _ 0.4
5.3 ± 0.7
8. Were patients with a wide spectrum of comorbid diseases included in the control - (nondiseased) group?
Both
5.3 +,, 0.0
5.0 ± O.S
5.1 _+ 0.6
9. Was an appropriate sample size considered?
Independent
5.0 ± 1.1
4.0 _ 1.2
4.3 ± 1.2
10. Were demographic and clinical characteristics of study patients described?
Interactive
4.8 + 2.0
5.0 ___0.5
4.8 ± 1.3
11. Were patients with comorbid conditions included in the case (diseased) group?
Interactive
4.8 ± 0.7
4.4 ± 0.9
4.6 + 0.8
12. Was the source of the study population described?
Both
4.5 -t- 1.2
4.5 _ 0.8
4.5 ± 0.9
13. Was a normal/abnormal test value adequately defined?
Both
6.6+--0.0
8.9+--2.7
8.2+--3.0
14. Was the precision (reproducibility) of the test described?
Both
5.3+'1.3
8.6+--3.2
7.6+--3.7
Both
8.5 + 4.5
7.4 + 3.2
7.7 + 3.4
16. Was the reference ("gold") standard appropriately performed in a standardized manner in all patients?
Interactive
7.1 + 3.4
6.5 _+ 2.5
6.7 + 2.7
17. Was a normal/abnormal reference ("gold") standard adequately defined?
Interactive
6.1 ___0.7
6.5 + 2.5
6.4 _+ 2.3
Paramount questions 1. Were individuals with and without disease included in the evaluation?
Test purpose questions 4. Was the proposed use/purpose of the test described? Study population questions 5. Was the study population appropriate for evaluating the proposed use of the diagnostic test?
Diagnostic test questions
Reference standard questions 15. Were the interpretations of the reference ("gold") standard and of the diagnostic test applied independently (blindly)?
291
JOURNALOFGENERALiNTERNALMEDICINE,Volume 4 (July/August), 1989
TABLE 1 Average Quantitative Scores For Questions Assessingthe Quality of Diagnostic Test Evaluations Given by Panel Group* (continued) Original Panel Identification Results questions 18. Were data presented in enough detail to calculate appropriate test characteristics? 19. Were uninterpretable results enumerated and described?
Interactive Panel Score
Independent Panel Score
Combined Panel Score
Independent
9.0 _+ 3.8
7.5 + 3.1
8.0 + 3.2
Independent
5.3 + 1.3
5.5 _+ 2.2
5.4 +_ 2.1
*All scores standardized to a scaleof 100.
Sixteen additional questions (Qs. 4 - 1 9 ) were weighted by interactive panel members, by independent panel members, and by combined panel members as important criteria to consider when examining the quality of a diagnostic test evaluation. One (Q. 4) addressed the proposed purpose of the test; eight (Qs. 5 - 12) addressed issues relating to the selected study population; five (Qs. 13 - 17) addressed descriptions of the diagnostic test being evaluated and the chosen reference standard; and two (Qs. 18, 19) addressed presentation of test characteristics. Average scores were generally consistent for the interactive and independent panel members, except for two questions (Qs. 13, 14) that referred to the precision of the diagnostic test and its definition of normal. In these instances, independent panel members' ratings were higher than interactive panel members' ratings. There was variability in the ratings of almost all questions, indicated by scores with standard deviations of greater than zero. Variability in rating did not appear to differ between interactive panel members and independent panel members. Questions deleted from the final questionnaire were of two types. The first type (four questions) concerned specification of presentation of results using methods such as sensitivity and specificity, predictive values, likelihood ratios, and receiver operating characteristic curves. Panel members concluded that presentation of data in enough detail to calculate appropriate test characteristics (Q. 18) was more important than the actual summary method of presentation. The second type of questions that were deleted (five questions) concerned issues of utility. Examples of the deleted utility questions were: 1) Did the test demonstrate measurable benefits to the patient's health? 2) Was the discriminating ability of the test in various populations described? and 3) Was the cost-effectiveness of the test evaluated? While all panel members thought that issues concerning the utility of a diagnostic test were very important, they disagreed on the definitions and interpretations of the utility questions and
on which were most important. In addition, they concluded that utility does not need to be considered in every single diagnostic test evaluation but rather should be considered before recommending widespread dissemination of a particular diagnostic test.
Applicability and lnterobserver Reliability o f the Scale Applicability and reliability were evaluated by three reviewers (CM, MG, BL), who used the quantitative scale to rank independently a test set of 16 articles that addressed the captopril-stimulated renin secretion test for the diagnosis of renovascular hypertension. 44 The reviewers found the scale's questions to be feasible and applicable for evaluating the quality of these articles. Pearson correlation coefficients between reviewer pairs (CM-MG, CM-BL, MG-BL) for their summary ranking scores on the 16 articles were 0.96, 0.93, and 0.93, respectively. Kappa values of 1.0 (perfect agreement) were obtained for 11 of the 19 questions in the scale; the remaining eight questions had kappa values that ranged from 0.66 to 0.85. 4s
Scale Instructions The scale can be used as a simple checklist of guidelines or to provide a quantitative summary score (Table 2). If one wished to use the scale to assess quantitatively the quality of a diagnostic test evaluation, one would apply the scale in the following manner: Questions would be applied to adiagnostic test evaluation and answered as yes, no, or unclear. A test evaluation that received a no or unclear answer on any o f the three paramount questions would be considered an invalid evaluation. For the remaining 16 questions, yes answers would receive the corresponding weighted score made by the combined panel members. No or unclear answers would receive negative scores corresponding to the assigned weight of that question made by the combined panel members. The positive and negative points would then be summed for all 16 questions.
P.gP.
Mulrow et aL, ASSESSINGDIAGNOSTICTESTS
DISCUSSION
TABLE Z
Final Guidelinesfor Assessingthe Quality of a DiagnosticTest Evaluation* Score Weights Yes Paramount questions 1. Diseasedand nondiseasedpatients included?
No/Unclear t
2. Test appropriately performed?
t
3. Appropriate reference standard?
t
Test purpose questions 4. Proposed use described?
+7.7
--7.7
Study population questions 5. Appropriate population studied?
+7.7
--7.7
6. Inclusion/exclusion criteria described?
+6.0
-6.0
7. Wide spectrum of diseasedpatients included?
+5.3
--5.3
8. Control (nondiseaseq)patients with comorbid diseasesincluded?
+5. I
--5. I
9. Sample size adequate?
+4.3
--4.3
10. Patient characteristics described?
+4.8
--4.8
1 1. Case(diseased)patients with comorbid diseasesincluded?
+4.6
--4.6
12. Population source described?
+4.5
--4.5
Diagnostic test questions 13. Normal/abnormal defined?
+8.2
--8.2
14. Test precision described?
+7.6
--7.6
Reference standard questions I 5. Interpretations of reference standard and test blinded?
+7.7
--7.7
16. Referencestandard appropriately performed?
+6.7
--6.7
17. Normal/abnormal defined?
+6.4
-6.4
Results questions 18. Data presented in adequate detail?
+8.0
--8.0
19. Uninterpretable results enumerated?
+5.4
--5.4
*To score the questionnaire, circle the weight correspondingto the question. ("Yes" answers should receive positive scores and "No" or unclear answers should receive negative scores.) Sum all of these circled scores to derive the TOTAL SCORE, which should range from --1 O0 to 100. tlf a diagnostictest evaluation receivesa"No" or unclearanswer for these questions, it should be considered an invalid assessment of the discriminating ability of the test.
Possible summary scores w o u l d range from - - 1 0 0 to + 1 0 0 . This scoring mechanism uses zero as a cut-off point w h e r e b y a diagnostic test evaluation that receives a positive score has met more criteria correctly than incorrectly, and an evaluation receiving a negative score has met more criteria incorrectly than correctly.
Recently, Begg reported finding wide variation in the manner in w h i c h studies c o n c e r n i n g diagnostic tests are c o n d u c t e d and t h e i r results presented. 46 Further research on providing guidelines for designing, analyzing, and reporting results was called for to improve the literature as a reliable source o f information about diagnostic tests. With this goal in mind, w e present a scale for assessing the quality of a diagnostic test evaluation that differs from previously published criteria in several ways. First, it was d e v e l o p e d using explicit consensus methods. Second, it can b e used to quantify the quality of a diagnostic test evaluation reliably. Third, it represents a synthesis, but not a replication, of many previously published criteria, since it both deletes some old criteria and adds some n e w criteria. The consensus process involved participants w h o had practical e x p e r i e n c e in using diagnostic tests and formal training in evaluating diagnostic tests. Their consensus ratings allowed expression of variability as well as standardization. Variability was particularly n o t e w o r t h y w h e n panel members had difficulty in defining and agreeing u p o n explicit criteria to consider w h e n evaluating the utility of a particular diagnostic test evaluation. Although issues relating to cost-effectiveness and overall patient benefit w e r e considered important, panel members were unable to reach a consensus regarding what actually constitutes utility. We believe this variability has ider~ified an area that warrants future research in the m e t h o d o l o g y of diagnostic test evaluations. Because the scale has a quantitative c o m p o n e n t , it can be used differently than previous methods for evaluating the quality of a diagnostic test evaluation. Literature reviewers w h o utilize meta-anaIysis and quantitative techniques in their review processes c o u l d use the scale to rate quality numerically. Health policymakers and clinicians c o u l d benefit by incorporating gradations of quality of evidence as identified by the scale into their recommendations. The content o f the scale differs from o t h e r proposed guidelines is, 16, 2o, 23 in that several n e w criteria are added and some previously p r o p o s e d criteria are deleted. (Table 3 summarizes previously p r o p o s e d criteria.) Questions in the n e w guidelines can be categorized as addressing issues of paramount importance (Qs. I - 3), the p r o p o s e d p u r p o s e o f the test (Q. 4), the selection and description o f the study p o p u l a t i o n (Qs. 5 - 12), the description and performance o f the diagnostic test (Qs. 13, 14), the description and performance of the reference or " g o l d standard" (Qs. 15 - 17), and presentation of results (Qs. 18, 19). The question c o n c e r n i n g specification o f the proposed use of the diagnostic test (Q. 4) has not b e e n previously m e n t i o n e d as an important guideline. It is
JOURNALOFGENERALINTERNALMEDICINE,Volume 4 (July/August), 1989
293
TABLE 3
Summarization of Existing Guidelines for Assessing Diagnostic Test Evaluations McMaster Department of Clinical Epidemiology and Biostatistics 16 1. Independent blind comparison with "gold standard"? 2. Appropriate spectrum of disease and commonly confused disorders included? 3. S~udy setting and filter described? 4. Test reproducibility and observer variability determined? 5. Normal defined? If test was part of a cluster, was contribution to overall cluster determined? 7. Test taL-~dcsexactly replicable? 8. Utility determined?
Robertson EA, Zweis MH,
Sheps SB, Schechter M1~
Van Steirteghem ACz°
Reigleman RKIs
1. "Gold standard" defined? 2. Positive and negative tests defined? 3. Test results blindly interpreted? 4. Data displayed in tabular form? 5. Sensitivity and specificity used correctly? 6. Predictive values or likelihood ratios used correctly? 7. Recognition of influence of setting, prevalence, or pretest likelihood on clinical utility?
1. Representative clinical population studied? 2. All tests performed on all subJects at same point in clinical course? 3. independent "gold standard" used? 4. Test performance compared using ROC curve?
1-8. Eight questions addressing the inherent properties of a test, including independent and technical reproducibility, interobserver and intraobserver variability. and experimental and clinical accuracy. 9 - 1S. Seven questions addressing the range of normal and variability of the disease-free population. 16- 17. Two questions addressing variability of the diseased population. 18- 21. Four questions addressing the discriminatory ability of the test.
considered particularly important, as diagnostic tests often serve more than one purpose. The evaluation and interpretation of the performance of the diagnostic test is d e p e n d e n t u p o n the p r o p o s e d use of the test. Several of the questions regarding the study population warrant further explication. The question concerning sample size specification (Q. 9) is a guideline that has not b e e n included in previous criteria. It is important to consider in order precisely to estimate the true discriminating ability o f a diagnostic test. Explicit questions addressing the spectrum of disease and comorbidity that is included in the study population (Qs. 7, 8, 1 1) are necessary to make appropriate generalizations of the data. In some instances, it might be helpful to include a w i d e spectrum of patients in the study p o p u l a t i o n and to present subset analyses of results that w o u l d enable clinicians better to assess h o w a diagnostic test might perform in their particular types o f patients. Another newly p r o p o s e d criterion is the enumeration of uninterpretable results (Q. 9), w h i c h is important to evaluate the potential yield of the diagnostic test. Specific ways for presenting data suggested by previous authors (i.e., sensitivity, specificity, predictive values, receiver operating characteristic curves) are not required in the n e w guidelines. 15, 2o, 23 Rather, it is suggested that the style of the presentation of results may vary so long as data are presented in e n o u g h detail to calculate test characteristics. Finally, determination o f the utility of the test and evaluation o f sequential test contributions w h i c h have also been previously proposed as important criteria, are not included. 16 More t h o r o u g h definition of these areas is n e e d e d before r o u -
tinely r e c o m m e n d i n g their use in assessing diagnostic test evaluations. 31 Primary limitations o f the scale relate to issues of applicability and flexibility. The scale may not be applicable to the early phases o f evaluating a diagnostic test w h e n determination of technologic capabilities are the prime concern. 29, 31.34 It is applicable only to a single diagnostic test evaluation that aims to assess h o w well the test separates diseased from nondiseased patients. Moreover, it is not meant to be used as a criterion for determining w h e t h e r a diagnostic test should be w i d e l y disseminated and used. Before r e c o m m e n d i n g the widespread use of a particular diagnostic test, a series of evaluations, some of w h i c h address clinical utility issues, should b e c o n d u c t e d . 29' 31, 34 Also, the quantitative weights assigned to individual questions may warrant modification in certain circumstances. For example, w h e n considering w h e t h e r a particular test is useful in ruling in a diagnosis, it may be more important to include a wide spectrum of comorbid diseases in the control group (Q. 8) than it is to include a wide spectrum of severities of disease in the case group (Q. 7). In this instance, it w o u l d be reasonable to weight question 8 higher than question 7. Despite limitations, w e believe the scale can be used to h e l p appraise the quality of most diagnostic test evaluations. It is applicable for assessing the many individual studies or groups of studies that are c o n c e r n e d primarily with evaluating the accuracy o f a diagnostic test. This use was demonstrated by reliably applying the scale to a group of 1 6 articles that addressed the accuracy of the captopril-stimulated renin secretion test for the diagnosis of renal vascular hypertension.
~-94
Mulrow et al., ASSESSINGDIAGNOSTICTESTS
In summary, consensus methods have been used to develop guidelines for assessing the quality of a diagnostic test evaluation. These guidelines can be a useful tool for readers, researchers, reviewers, and editors because they represent an updated synthesis of important criteria to consider w h e n evaluating a diagnostic test, and because they can be used to rate quality quantitatively. Application of the guidelines may lead to a greater awareness of the vital elements of a diagnostic test evaluation, more scientifically sound studies addressing diagnostic tests, and improved peer-review processes. The authors thank Dr. Christine Aguilar and Ms. Liz Pease for technical assistance, and Michael Tuley for statistical assistance.
REFERENCES 1. Vecchio TJ. Predictive value of a single diagnostic test in unselected populations. N Engi J Med 1966;274:1171-3. 2. Feinstein AR. Clinical judgment. New York: Robert E. Krieger Publishing Co., 1967;72-127. 3. Galen RS, Gambino SR. Beyond normality. The predictive value and efficacy of medical diagnoses. New York: John Wiley and Sons, 1975. 4. Koran LM. The reliability of clinical methods, data and judgments. N EnglJ Med 1975;293:642-6, 695-701. 5. McNeil BJ, Keeler E, Adelstein SJ. Primer on certain elements of medical decision making. N EnglJ Med 1975;293:211-5. 6. Wulff HR. Rational diagnosis and treatment. 2nd ed. Oxford: Blackwell Scientific Publications, 1976;78-117. 7. Feinstein AR. Clinical biostatistics. St. Louis: C. V. Mosby, 1977. 8. Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engi J Med 1978;299:926-9. 9. SwetsJA, Pickett RM, Whitehead SF, et al. Assessment of diagnostic technologies. Science 1979;205:753-9. 10. Diamond GA, Forrester JA. Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. N Engl J Med 1979;300:1350-7. 11. Weinstein MC, Fineberg HV. Clinieal decision analysis. Philadelphia: W. B. Saunders, 1980. 12. Pauker SG, Kaissirer JP. The threshold approach to clinical decision making. N Engi J Med 1980;302:1109-17. 13. Philbrick JT, Horwitz RI, Feinstein AR. Methodologic problems of exercise testing for coronary artery disease: groups, analysis and bias. AmJ Cardiol 1980;46:807-12. 14. Eddy DM. Screening for cancer; theory, analysis and design. Englewood Cliffs, NJ: Prentice-Hall, 1980;26-96. 15. Riegelman RK. Studying a study and testing a test: how to read the medical literature. Boston: Little, Brown, 1981;93-149. 16. Department of Clinical Epidemiology and Biostatistics, McMaster University Health Science Centre. How to read clinical journals: II. To learn about a diagnostic test. Can Med Assoc J 1981;124:703-10. 17. Griner PF, Mayewski RJ, Mushlin AI, Greenland P. Selection and interpretation of diagnostic tests and procedures: principles and applications. Ann Intern Med 1981;94:557-600. 18. Fletcher RH, Fletcher SW, Wagner EH. Clinical epidemiology: the essentials, Baltimore: Williams and Wilkins, 1982. 19. PhilbrickJT, Horwitz RI, Feinstein AR, Langou RA, ChandlerJP. The limited spectrum of patients studied in exercise test research..lAMA 1982;248:2467-70. 20. Robertson EA, Zweig MH, Van Steirteghem AC. Evaluating the clinical efficacy of laboratory tests. Am J Clin Pathol 1982;79:78-86.
21. Diamond GA, Forrester JS. Metadiagnosis: an epistemologic model of clinical judgment. AmJ Med 1983;75:129-37. 22. lngelfingerJA, Mosteller F, Thibodeau LA, WareJH. Biostatistics in Clinical medicine. NewYork: MacMillan, 1983;1-45. 23. Sheps SB, Schechter MT. The assessment of diagnostic tests: a survey of current medical research. JAMA 1984;252:2418-22. 24. Sackett DL, Haynes RB, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. Boston: Little, Brown, 1985; 3-155. 25. Feinstein AR. Clinical epidemiology: the architecture of clinical research. Philadelphia: W. B. Saunders, 1985;597-631. 26. Doubilet PM, Cain KC. The superiority of sequential over simultaneons testing. Med Decis Making 1985;5:447-51. 27. Richardson DK, Schwartz JS, Weinbaum PJ, Gabbe SG, Diagnostic tests in obstetrics: a method for improved evaluation. Am J Obstet Gynecol 1985; 152:613-8. 28. Rozanski A, Diamond GA, Forrester JS, et al. Should the intent of testing influence its interpretation. J Am Coil Cardiol 1986; 7:17-24. 29. Guyatt G, Drummond M, Feeny D, et al. Guidelines for the clinical and economic evaluation of health care technologies. Soc sei Med 1986;22:393-408. 30. Griner PF, Panzer RJ, Greenland P. Clinical diagnosis and the laBoratory: logical strategies for common medical problems. Chicago: Year Book Medical Publishers, 1986; 1-44. 31. Guyatt GH, Tugwell PX, Feeny DH, Haynes RB, Drummond M. A framework for clinical evaluation of diagnostic technologies. Can Med Assoc J 1987; 134:587-94. 32. Sox HC. Common diagnostic tests. Use and interpretation. Philadelphia: Am Coil of Physicians, 1987;1-15. 33. Hlatky MA, Mark DB, Harrell FE, Lee KL, Califf RM, Pryor DB. Rethinking sensitivity and specificity. Am J Cardiol 1987; 59:1195-8. 34. Nierenberg AA, Feinstein AR. How to evaluate a diagnostic marker test: lessons from the rise and fall of dexamethasone suppression test. JAMA 1988;259:1699-1702. 35. Detrano R, Janosi A, Lyons KP, Marcondes G, Abbassi N, Froelicher VF. Factors affecting sensitivity and specificity of a diagnostic test: the exercise thallium scintigram. Am J Med
1988;84:699-710. 36. Arroll B, sehechter MI, Sheps SB. The assessment of diagnostic tests : a comparison of medical literature in 1982 and 1985. J Gen Intern Med 1988;3:443-7. 37. Poynard T, Chaput JC, Etienne JP. Relations between effectiveness of a diagnostic test, prevalence of the disease, and percentages of uninterpretable results. Med Decis Making 1982; 2:285-302. 38. Begg CB, Greenes RA, Iglewicz B. The influence of uninterpretability on the assessment of diagnostic tests. J Chronic Dis 1986;39:575-84. 39. Begg CB, Greenes RA. Assessment of radiologic tests: control of bias and other design considerations. Radiology 1988; 167:565-9. 40. Boyko EJ, Alderman BW, Baron AE. Reference test errors bias the evaluation of diagnostic tests for isehemic heart disease. J Gen Intern Med 1988;3:476-81. 41. Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978;8:283-98. 42. Kerlinger FN. Foundations of behavior research: education and psychological inquiry. New York: Holt, Rinehart and Winston, 1973. 43. Blalock HM, Blalock AB, Methodology in social research. New York: McGraw Hill, 1968. 44. Gaul MK, Linn WD, Mulrow CD. Captopril stimulated renin secretion in the diagnosis of renal vascular hypertension. Am J Hypertens 1988;1:73. 45. Light RJ. Measures of response agreement for qualitative data: some generalizations and alternatives. Psych Bull 1971; 76:365-77. 46. Begg CB. Methodologic standards for diagnostic test assessment studies. J Gen Intern Med 1988;3:518-20.
JOURNALOF GENERALINTERNALMEDICINE. Volume4 (July/AugusO, 1989
APPENDIX Interactive Group Cynthia D. Mulrow, MD, MSc (editor) Division of General Internal Medicine University of Texas Health Science Center at San Antonio William D. Linn, PharmD (editor) Division of Clinical Pharmacy University of Texas Health Science Center at San Antonio Mary K. Gaul, PharmD Division of Clinical Pharmacy University of Texas Health Science Center at San Antonio Jacqueline A. Pugh, MD Division of General Internal Medicine University of Texas Health Science Center at San Antonio Valerie A. Lawrence, MD Division of General Internal Medicine University of Texas Health Science Center at San Antonio
Independent Group
Meghan Gerety, MD Division of General Internal Medicine University of Texas Health Science Center at San Antonio Judy Hill, MD Division of General Internal Medicine University of Texas Health Science Center at San Antonio Kurt Kroenke, MD General Medicine Division Brooke Army Medical Center John Sawyer, MD, MPH Division of General Internal Medicine University of Texas Health Science Center at San Antonio John Simmons, MD, MPH General Medicine Division Brooke Army Medical Center
Richard Bauer, MD, MSc Division of General Internal Medicine University of Texas Health Science Center at San Antonio
Margie Sunderland, MD Division of General Internal Medicine University of Texas Health Science Center at San Antonio
Andrew Diehl, MD, MSc Division of General Internal Medicine University of Texas Health Science Center at San Antonio
Ramon Velez, MD, MSc Division of General Internal Medicine University of Texas Health Science Center at San Antonio
In Memoriam
Mack Lipkin, Sr., 1 9 0 7 - 1989 W E ARE SAD tO a n n o u n c e the death of o u r e l d e s t E d i t o r i a l B o a r d member, M a c k L i p k i n , Sr. M a c k d i e d p e a c e a b l y a t h o m e o n A p r i l 4 , 1 9 8 9 , a f t e r a f e w hours' illness. M a c k w a s a f o u n d i n g m e m b e r o f t h e Journal's E d i t o r i a l B o a r d , a n d perhaps its most active contributor. In his usual manner, the day before he d i e d h e w a s i n the Journal o f f i c e , c h e c k i n g o n t h e p r o g r e s s o f s o m e m a n u scripts. Had the Editors agreed he would have happily contributed to every i s s u e . As i t w a s , h e g a v e o f h i s c l i n i c a l w i s d o m a n d w r i t i n g t a l e n t i n m a n y ways m reviewing numerous manuscripts and writing editorials, book rev i e w s , a n d a p a r t i c u l a r l y p e r c e p t i v e p i e c e o n p r e s b y c u s i s . M o s t o f all, h e gave the Journal--and everything else he touchedma stamp of e x c e l l e n c e . - - The Editors
295