Eur. Radiol. 8, 484±487 (1998) Ó Springer-Verlag 1998
European Radiology
European health policy
The meta-analysis of diagnostic test studies J.-P. Boissel, M. Cucherat Clinical Pharmacology Department, EA 643, Claude Bernard University, F-69003 Lyon, France
Introduction Meta-analysis is a recognized means of evaluating safety and the efficacy of therapy in clinical trials [1]. However, the metaanalytic approach has had little application in the field of diagnostic tests, despite the obvious need for global assessment of the properties of a given test. The main advantage of meta-analysis in clinical trials is to provide overviews with a quantitative and reproducible summary of the efficacy (or safety) of therapy that has been tested in various settings. The same advantage can, in theory, apply to diagnostic tests. The rationale for meta-analyzing the accuracy of a diagnostic test is therefore: (1) to obtain a reproducible summary estimate of the accuracy; (2) to explore the factors affecting the estimate. However, the medical literature on diagnostic tests is characterized by poor design standards and reporting of results, which had led to wide variability in the results of similar studies and lack of credibility about the conclusions drawn [2]. However, the methodology clinical trials is more standardized and widely accepted. The problems result from the fact that there are many potential biases and sources of variation of the results, many of which are subtle issues that can easily escape the attention of investigators and readers with no dedicated training in the field. This is certainly one reason for the little attention paid to the meta-analytic approach in the diagnostic test domain. However, there are other reasons, that make application of the method less straightforward [3]. These reasons make it unlikely that the test accuracy will be constant across studies. In a series of clinical trials of the same therapy with similar eligibility criteria, it is reasonable to asCorrespondence to: J. P. Boissel, 162, Av. Lacassagne F-69003 Lyon, France
sume that the treatment efficacy is the same, except for sampling fluctuations across trials. A similar assumption does not hold, in general, for a given diagnostic test.
Measures of test efficacy Introduction The two most commonly used measures are specificity and sensitivity. They are simple and easily interpreted by users of the diagnostic tests. However, there are important drawbacks. The whole setting is described in Table 1. The test usually gives a continuous, not binary, outcome. Thus, there is a more or less arbitrary cut-off value on the continuous scale that dichotomizes the space test in positive and negative results. The classification in diseased and nondiseased subjects relies on another test, assumed to be more appropriate. The situation depicted in Table 1 is shown in Fig. 1 when all the subjects in the population are considered, which is never the case in practice. The letters a, b, c, and d depend on the threshold or cut-off value on the scale of the test outcome. If the cutoff value is shifted to the right, the letters a and c will decrease whereas the b and d will increase. Hence, the test results as summarized in Table 1 are not stable for a given diagnostic test.
Table 1.
Definitions Sensitivity Sensitivity is the proportion of diseased patients correctly classified, i. e., with a positive test result. Using the symbols from Table 1, the sensitivity is: a a Se = = a + b Npr It is also called the true positive rate (TPR).
Specificity Specificity is the proportion of non-diseased subjects correctly classified, i. e., with a negative test result. Thus: d d Sp = = c + d Nab The specificity is related to the false-positive rate (FPR), which is the proportion of subjects without the disease who are positive in the test: Sp = 1±FPR
Odds ratio The odds ratio (OR) is defined as the ratio of the odds of true positive over the odds of false-positive: TPR FPR OR = = 1ÿTPR 1ÿFPR
J.-P. Boissel and M. Cucherat: The meta-analysis of diagnostic test studies
485
How to get rid of these problems A potentially useful approach is to use an odds ratio to summarize the fourfold table finding and to adjust its value for covariates, such as the threshold, patient population characteristics and the disease status [2].
Principles of the meta-analysis of diagnostic test studies
Fig. 1
Problems with these parameters
Bias
Dependency of the four-fold table on the disease prevalence
The reader interested in the potential bias in the assessment of a diagnostic test can find a sensible account in a paper by Begg. We will summarize this account here.
The usual summary indexes that can be computed with the cells contain a four-fold table, i. e., the odds ratio or the relative risk depend on the disease prevalence in the considered population, whereas in clinical trials they can be assumed to be constant across trials. This dependency can be derived from the equation: 1 ÿ1 Se = a:Pÿ1 and Sp = d: ÿ P d d N Where Pd is the prevalence of the disease. Thus, two indexes, i. e., Se and Sp, are required to summarize a diagnostic test evaluation instead of a single one with clinical trials.
Dependency of Se and Sp on the threshold Although Se and Sp do not depend on the studied population, they depend heavily on the threshold. The reason is that a binary test outcome the one sketched in Table 1 results from dichotomization of a continuous or ordinal scale, as shown in Fig. 1. Two investigators will choose two different cut-off values, for instance, because they try to optimize the observed test efficacy. In situations such as the case of Fig. 1, if the threshold increases, the sensitivity will decrease and the odds for true positive will decrease even more, whereas the specificity will increase and the false-positive rate will decrease, and the odds for false-positive will decrease even more. Thus, depending on the threshold, the sensitivity can increase, but consequently the specificity will decrease, or the test can be made more specific, but consequently the sensitivity will decrease. However, the odds ratio will vary less. The receiver operating curve (ROC) is another way to accommodate alternative classifications. It is a plot of all possible pairs of Se and 1-Sp.
1. Disease status: For a disease with several stages in terms of both severity and clinical presentation, minute variations in the stage can induce large changes in the accuracy of a diagnostic test. 2. Selection through verification: The reference diagnosis may require an investigation which, for various reasons, cannot be applied to all patients. This selection can introduce a bias. 3. Exclusion of uninterpretable test results. 4. Interobserver variation: The issue here is the generalization of the results. Both sensitivity and specificity depend on the accuracy of the observer. One cause of interobserver variation is the degree of knowledge about the clinical history of the patients. Sensitivity and specificity will differ whether the observer is kept blind to the clinical history or not. 5. Temporal changes because of technical improvements of the diagnostic test. 6. The accuracy of the reference test.
Non uniqueness of test efficacy parameters Although sensitivity and specificity are commonly used to describe test accuracy, they by no means summarize all interesting features and properties of a test. Cost, easiness to operate, lack of harmful effects, acceptability are important too. Hence, one cannot compare two tests on merely their ROC, not to mention the sensitivity and specificity that are totally inappropriate because of their variations to the cut-off value. Actually, what is important with a test is how much it improves the efficacy of the treatment of the disease it is supposed to identify. Hence, comparison of diagnostic test should better be based on randomized trials of care strategies.
In order to obtain a reliable estimate of the common test accuracy and the weight of covariates, the process of meta-analyzing test diagnostic studies should meet a few basic rules, which are listed below.
Completeness of data collection A meta-analysis, in the current use of the word in medicine, is assumed to have summarized all the available evidence concerning the considered issue. Hence, the authors should have collected all the published and unpublished data. If this is not the case, at best the common estimate will not be as precise as possible at the time the metaanalysis is done; at worst it will be biased. The practical consequence of a biased estimate of the accuracy of a diagnostic test may be enormous in terms of both undue burden for the patients and unnecessary costs.
Proper and unbiased selection of studies Since the quality, hence the strength of evidence, of materials changes tremendously across studies, those with low-quality scores should not be incorporated in the summary, or the results would be inaccurate. However, selection should not be biased against test accuracy, or the final results will be biased.
Appropriate review technique The model underlying the summarizing technique should meet both the data structure and objectives of the meta-analysis. Unfortunately, the real data structure is mostly unknown. An important issue is the way the adopted technique deals with the proven or quite likely heterogeneity of the studies in terms of design, populations, reference tests, cut-off values and disease states.
486
Steps of a meta-analysis Formulation of the problem As in any scientific endeavor, the first step consists of careful consideration of the issue that has been brought up. The process will be more or less slightly altered, e. g., its aim is to answer the question of the reality of the efficacy of the test or if it is to evaluate the more cost-effective cut-off value (and its covariates).
Writing the protocol A meta-analysis is a rather complex process, with multiple choices, a series of steps that each require a precise and beforehand algorithm. Every step should be thought out, decided upon and elaborated prior the beginning of the practical work.
Data collection One of the most awkward issues in metaanalysis is certainly data collection for, as said above, it will determine the accuracy of the output. The difficulty with data collection has two dimensions. First, not all studies are published. In the clinical domain, it has been shown that the unpublished data are more often negative [4, 5]. No similar evidence exists for diagnostic test studies, but a similar trend is likely to exist in this domain as well.
Data extraction The reports of diagnostic test studies are not always fully informative on what exactly was done. In addition, there is an interreader variability regarding the data extracted from a given report. Also, there is a possibility of result-dependent extraction. The two necessary steps in order to minimize errors and bias are to have at least two independent readers and an adjudication procedures for discrepancies and to use a form with all the required variables. A blind reading should be considered. It needs to present the reports in an even presentation, with the details that are thought to influence the reader deleted. However, this means a lot of work, and the benefits of proceeding in this wayhave not been established.
J.-P. Boissel and M. Cucherat: The meta-analysis of diagnostic test studies quality dimension. The issues here are not to include flawed data and not to bias the whole process. It is more sensible to categorize the studies according to, say three subgroups, bad quality, uncertain quality, proper quality, following the exclusion of studies that do not meet the criteria expressing the appropriateness of the studies vis a vis the objective of the meta-analysis.
Choice of the statistical pooling approach For meta-analyzing diagnostic test accuracy data, the number of available techniques is limited (see below). For clinical trials, there are several techniques with different underlying properties. However, that does not mean the choice is straightforward. Actually, a decision should be made on the parameter(s), e. g., ROC or odds ratio, the model linking the parameter with the covariates and the covariates, or the use of more conventional technique for pooling odds ratio. In this case, the choice between a fixed effect model, which assumes no heterogeneity between the studies, or a random effect model should be made (see below).
Exploration of the heterogeneity Unless a method such as the Mantel-Haenszel or a DerSimonian and Laird-like has been used, there is no direct way to weigh heterogeneity between studies in a metaanalysis of diagnostic test studies. The indirect way is to explore the statistical significance of each of the covariates in the multivariate model or, if this model has the structure of a multilinear regression, to look after the value of the residual variance.
Sensitivity analysis If a multiple category score for study quality has been chosen, the sensitivity analysis consists of exploring the stability of the final estimate across the categories. Indirectly, if a covariate expressing the study quality was incorporated in the model, its weight is given by the statistical significance of its coefficient in the resulting fit. If it is statistically different from zero, a wise way seems to exclude on a second step analysis the studies of lower quality.
Study selection
Presentation of the results
This step is the second difficult one. The selection criteria should have been decided upon prior to start and clearly mentioned in the protocol. They essentially cover the
If the odds ratio is the parameter of interest, a graphical presentation similar to the one popular for clinical trial meta-analysis can be used [1, 7].
Interpretation A meta-analysis provides with figures the common estimate of the parameter and its confidence interval and a model integrating the covariates. The clinician reading the report might be lost with this material if the authors do not transpose it in practical terms. Also, the findings from the model fit should be interpreted in terms of the dependence on test accuracy in the practical setting of the test, e. g., population, disease state or test threshold. Tables might be of some help in that regard.
Statistical methods Introduction The potential for bias and the accuracy variation factors outlined above suggests the following specifications for any statistical methods for meta-analyzing diagnostic test accuracy: 1. Computation of a summary estimate of test accuracy 2. Adjusted to factors such as cut-off value, patient population, including disease status, and subgroups with different test efficacy
Binary test results The current methods applied in the metaanalysis of clinical trial results are not proper in this setting, since it is likely that the test efficacy summaries, even the odds ratio, will vary from one study to another. A regression approach is more appropriate. It has been proposed by Moses [6] that the logit(TPR) and the logit(FPR) be used that have interesting properties, and that model D be and as a linear function of S, with: D = {logit(TPR) ± logit(FPR)} = log(OR) S = {logit(TPR) + logit(FPR)} The model is: D = a + bS The model can be transformed back to the conventional axes of TPR against FPR, and a summary ROC curve can be obtained. The regression coefficient b provides an estimate of the extent to which the logarithm of the odds ratio is dependent on the threshold used [6]. If the coefficient is near zero, the common odds ratio is given by expa. It does not depend on the threshold. In such case, conventional techniques, such as the Mantel-Haenszel procedure, that are commonly used for the meta-analysis of clinical trials can be used [7]. Those based on a random effect model, such as that proposed by DerSimonian and Laird [8], are considered more appropriate in this
J.-P. Boissel and M. Cucherat: The meta-analysis of diagnostic test studies setting. The outcome is a single summary value for the test efficacy. In the other cases, the above model can be fit using conventional least-square methods, preferably weighted by the inverse of the variance of the logarithm of the odds ratios. Covariates, such as the quality of the studies or the profile of the studied populations, can be added into the model (multilinear regression approach).
Other cases When the test results are given as continuous data, one can dichotomize them and use the approach proposed in the previous section. However, dichotomizing test results causes loss of information. It is preferable to meta-analyze test results as a continuum whenever studies provide such data. Unfortunately, methods for meta-analyzing multicategory tests are not well established. Irwig has proposed the following approach [3]. The probability of a test result Y falling in a given category j or below (the test scale is assumed to have J ordered categories) is supposed to be a linear function of k explanatory variables (xi):
logit[Pr(Y K j/x1, . . ., xk)] = vj ± (Saixi) where vj is the threshold for category J and x1 codes for the disease present or not. Exp(a1) is an estimate of the ratio of (odds of a positive test result in a diseased subject)/(odds of a positive test result in a nondiseased subject).
Conclusion Despite the specific difficulties in the metaanalysis of diagnostic test data, which are not all fixed, one can expect the increase in the number of rigorous overviews in this domain. There is a real need for the development of such an approach.
References 1. Boissel JP, Blanchard J, Panak E, Peyrieux JC, Sacks H (1989) Considerations for the meta-analysis of randomized clinical trials: summary of a panel discussion. Controlled Clin Trials 10: 254±281
487 2. Begg CB (1987) Biases in the assessment of diagnostic tests. Statistics Med 6: 411±423 3. Irwig L, Macaskill P, Glasziou P, Fahey M (1995) Meta-analytic methods for diagnostic test accuracy. J Clin Epidemiol 48: 119±130 4. Poynard T, Conn HO (1985) The retrieval of randomized clinical trials in liver disease from the medical literature. A comparison of MEDLARS and manual methods. Controlled Clin Trials 6: 271±279 5. Chalmers I, Adams M, Dickersin K, Hetherington J, Tarnow-Mordi W, Meinert CL, Tonascia S, Chalmers TC (1990) A cohort study of summary reports of controlled trials. JAMA 263: 1401±1405 6. Moses LE, Shapiro D (1993) Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Statistics Med 12: 1293±1316 7. Cucherat M, Boissel JP, Leizorovicz A (1997) Meta-analyse des essais thØrapeutiques. Masson, Paris 8. DerSimonian R, Laird N (1986) Metaanalysis in clinical trials. Controlled Clin Trials 7: 177±188