Skip to content


  • Research article
  • Open Access
  • Open Peer Review

Reporting of measures of accuracy in systematic reviews of diagnostic literature

BMC Health Services Research20022:4

  • Received: 05 September 2001
  • Accepted: 07 March 2002
  • Published:
Open Peer Review reports



There are a variety of ways in which accuracy of clinical tests can be summarised in systematic reviews. Variation in reporting of summary measures has only been assessed in a small survey restricted to meta-analyses of screening studies found in a single database. Therefore, we performed this study to assess the measures of accuracy used for reporting results of primary studies as well as their meta-analysis in systematic reviews of test accuracy studies.


Relevant reviews on test accuracy were selected from the Database of Abstracts of Reviews of Effectiveness (1994–2000), which electronically searches seven bibliographic databases and manually searches key resources. The structured abstracts of these reviews were screened and information on accuracy measures was extracted from the full texts of 90 relevant reviews, 60 of which used meta-analysis.


Sensitivity or specificity was used for reporting the results of primary studies in 65/90 (72%) reviews, predictive values in 26/90 (28%), and likelihood ratios in 20/90 (22%). For meta-analysis, pooled sensitivity or specificity was used in 35/60 (58%) reviews, pooled predictive values in 11/60 (18%), pooled likelihood ratios in 13/60 (22%), and pooled diagnostic odds ratio in 5/60 (8%). Summary ROC was used in 44/60 (73%) of the meta-analyses. There were no significant differences in measures of test accuracy among reviews published earlier (1994–97) and those published later (1998–2000).


There is considerable variation in ways of reporting and summarising results of test accuracy studies in systematic reviews. There is a need for consensus about the best ways of reporting results of test accuracy studies in reviews.


  • Accuracy Measure
  • Primary Study
  • Receiver Operating Characteristic
  • Test Accuracy
  • Bibliographic Database


The manner in which accuracy of clinical tests is mathematically summarised in the biomedical literature has important implications for clinicians. Appropriate accuracy measures would be expected to sensibly convey the meaning of the study results with scientifically robust statistics without exaggerating or underestimating the clinical significance of the findings. Lack of use of appropriate measures may lead authors of primary accuracy studies to draw biased conclusions.[1] In systematic reviews of test accuracy literature, there are many ways of synthesising results from several studies, not all of which are considered to be scientifically robust. For example, measures such as sensitivity and specificity commonly used in primary studies are not considered suitable for pooling separately in meta-analysis.[2] Variations in reporting of summary accuracy and use of inappropriate summary statistics may increase the risk of misinterpretation of clinical value of tests.

A recent study evaluated a small sample of meta-analytical reviews of screening tests to demonstrate the variety of approaches used to quantitatively summarise accuracy results.[3] This study confined itself to a limited Medline search. It exclusively examined meta-analytical studies so reviews not using quantitative synthesis were excluded. It did not look at accuracy measures used to report results of primary studies separately from those used for meta-analyses. In order to address these issues, we undertook a comprehensive search to survey systematic reviews (with and without meta-analysis) of test accuracy literature to assess the measures used for reporting results of included primary studies as well as their quantitative synthesis.


We manually searched for relevant reviews in the Database of Abstracts of Reviews of Effectiveness (DARE).[4] In order to limit the impact of human error inherent in manual searching, we complemented it with electronic searching. DARE was searched electronically with word variants of relevant terms (diagnostic, screening, test, likelihood ratio, sensitivity, specificity, positive and negative predictive value) combined using OR. From 1994 to 2000 DARE[4] has identified 1897 reviews of different types by regular electronic searching of several bibliographic databases, hand searching of key major medical journals, and by scanning grey literature (search strategy and selection criteria can be found at The structured abstracts of these reviews were screened independently by the authors to identity systematic reviews of test accuracy. The full texts were obtained of those abstracts judged to be potentially relevant. Reviews addressing test development and diagnostic effectiveness or cost effectiveness were excluded. Any disagreements about review selection were resolved by consensus.

Information from each of the selected reviews was extracted for the measures of test accuracy used to report the results of the primary studies included in the review. If a meta-analysis was conducted, information was also extracted for the summary accuracy measures. The various accuracy measures are shown in Table 1. We sought the following in the primary studies: sensitivity or specificity, predictive values, likelihood ratios and diagnostic odds ratio. For meta-analysis, we sought the summary measures pooling the above results and summary receiver operating characteristics (ROC) plot or values. All extracted data were double-checked. We divided the reviews into two groups arbitrarily according to time of publication; one group covering the period 1994–97 (50 reviews) and another covering 1998–2000 (40 reviews). This allowed us to assess whether there were any significant differences in measures being used to report test accuracy results among reviews published earlier and those published later. As the approaches to summarising results are not mutually exclusive, we evaluated and reported the most commonly used measures and their most common combinations. We used chi-squared statistical test for comparison of differences between proportions.
Table 1

Measures of accuracy of dichotomous test results

Measures for primary studies

   Sensitivity (true positive rate)

   The proportion of people with disease who are correctly identified as such.

   Specificity (true negative rate)

   The proportion of people with disease who are correctly identified as such.

   Positive predictive value

   The proportions of test positive people who truly have disease.

   Negative predictive value

   The proportions of test negative people who truly do not have disease.

   Likelihood ratios (LR)

   The ratio of the probability of a positive (or negative) test result in the patients with disease to the probability of the same test result in the patients without the disease.

   Diagnostic odds ratio

   The ratio of the odds of a positive test result in patients with disease compared to the odds of the same test result in patients without disease.

Measures for meta-analysis

   Summary sensitivity, Specificity, predictive values, likelihood ratios, and diagnostic odds ratio

   Pooling of the above accuracy measures obtained from multiple primary studies (usually averaged and weighted according to size of individual studies).

   Summary receiver operating characteristics curve (ROC)

   A method of summarising the performance of a test as found in multiple primary studies, which takes into account the relationship between sensitivity and specificity.


Of the abstracts available in DARE, 150 were considered to be potentially relevant. Excluding reviews that addressed test development and diagnostic effectiveness or cost, 90 reviews of test accuracy were left for inclusion in our survey. There were 45 reviews of dichotomous test results, 42 reviews of continuous results dichotomised by the original authors, and 3 reviews that contained both result types. Meta-analysis was used in 60/90 (67 %) reviews, 50 in 1994–97 and 40 in 1998–2000. (See Additional File: BMC_IncludedRefList_04032002 for a complete listing of the 90 reviews included in our study).

As shown in Table 2, sensitivity or specificity was used for reporting the results of primary studies in 65/90 (72%) reviews, predictive values in 26/90 (28%), and likelihood ratios in 20/90 (22%). For meta-analysis, independently pooled sensitivity or specificity was used in 35/60 (58%) reviews, pooled predictive values in 11/60 (18%), pooled likelihood ratios in 13/60 (22%), and pooled diagnostic odds ratio in 5/60 (8%). Summary ROC was used in 44/60 (73%) of the meta-analyses. There were no significant differences between reviews published earlier and those published later as shown in Table 2.
Table 2

Measures of test accuracy reported in review of diagnostic literature (1994–2000)


Time periods


Measures of test accuracy





% (95% confidence interval) +


% (95% confidence interval +

p-value **

Included primary studies

50 *


40 *


Sensitivity or specificity


70 (55–82)


75 (59–87)


Predictive values


26 (15–40)


33 (19–49)


Sensitivity, specificity and predictive values


24 (13–38)


30 (17–47)


Likelihood ratios




28 (15–44)


Diagnostic odds ratios


0 (0–7)





38 *


22 *

55 (38–71)


Independently pooled sensitivity or specificity


58 (41–74)


62 (36–79)


Pooled predictive values






Pooled likelihood ratios


13 (4–28)


38 (17–59)


Pooled diagnostic odds ratios


13 (4–28)




Summary ROC plot or values


61 (43–76)


52 (28–72)


* numbers do not add up to totals because some reviews used more than one measures of accuracy; ** chi sq. test with Yates' correction; + Exact (Clopper-Pearson) 95% confidence interval a includes study that only reported either sensitivity or specificity; b,c. includes study that only reported either positive or negative predictive value; d includes meta-analyses that only reported either pooled sensitivity or specificity; e,f. includes meta-analyses that only reported either pooled positive or negative predictive values


Our study showed that sensitivity and specificity remain in frequent use, both for primary studies and for meta-analyses over the time period surveyed. Sensitivity and specificity are considered inappropriate for meta-analyses, as they do not behave independently when they are pooled from various primary studies to generate separate averages.[2] In our survey, separate pooling of sensitivities or specificity was used frequently in meta-analyses where summary ROC would have been more appropriate. [57].

Our findings about reporting of summary accuracy measures in meta-analyses are different to those reported previously.[3] We found a higher rate of use of summary ROC, though use of independent summaries of sensitivity, specificity and predictive values were similar. These differences may be due to differences in searching strategies (databases and time frames) and selection criteria. Our search was more recent and comprehensive, using DARE[4], which has covered seven different databases (Medline, CINAHL, BIOSIS, Allied and Alternative Medicine, ERIC, Current Contents clinical medicine and PsycLIT), and hand-searched 68 peer-reviewed journals and publications from 33 health technology assessment centres around the world since February 1994. Moreover, as we did not restrict our selection to meta-analytical reviews only, we were able to examine reviews summarising accuracy results of primary studies without quantitative synthesis, which constituted 33% (30/90) of our sample. Therefore, compared to the previous publication on this topic,[3] our survey provided a broader and more up-to-date overview of the state of reporting of accuracy measure in test accuracy reviews.


The use of inappropriate accuracy measures has the potential to bias judgement about the value of tests. Of the various approaches to reporting accuracy of dichotomous test results, likelihood ratios are considered to be more clinically powerful than sensitivities or specificities.[8] Crucially, it has been empirically shown that authors of primary studies may overstate the value of tests in the absence of likelihood ratios.[1] There is also evidence that readers themselves may misinterpret test accuracy measures following publication.[9] It is conceivable that the problem of inconsistent usage of test accuracy measures in published reviews, as found in our survey, may contribute to misinterpretation by clinical readership. The reason for variation in reported accuracy measures may, in part, be attributed to a lack of consensus regarding the best ways to summarise test results. It is worth noting that despite authoritative publications about appropriate summary accuracy measures in the past,[5, 7, 10] (we have only quoted a few references) inconsistent and inappropriate use of summary measures has remained prevalent in the period 1994–2000. Our paper highlights the need for consensus to support change in this field of research.



We wished to acknowledge J. Dinnes, J. Glanville and F. Song for their contribution in the searches and selection of the reviews for the survey.


RCOG Wellbeing grant number K2/00

Authors’ Affiliations

Academic Dept. of Obstetrics & Gynaecology, Birmingham Women's Hospital, Birmingham, B15 2TG, United Kingdom


  1. Khan KS, Khan SF, Nwosu CR, Chien PFW: Misleading authors' inferences in obstetric diagnostic test literature. Am J Obstet Gynecol. 1999, 181: 112-5.View ArticlePubMedGoogle Scholar
  2. Shapiro DE: Issues in combining independent estimates of the sensitivity and specificity of a diagnostic test. Academic Radiology. 1995, 2: S37-S47.PubMedGoogle Scholar
  3. Walter SD, Jadad AR: Meta-analysis of screening data: A survey of the literature. Stat. Med. 1999, 18: 3409-24. 10.1002/(SICI)1097-0258(19991230)18:24<3409::AID-SIM377>3.3.CO;2-#.View ArticlePubMedGoogle Scholar
  4. NHS Centre for Reviews and Dissemination: Database of Abstracts of Reviews of Effectiveness (DARE). York: University of York, NHS Centre for Reviews and Dissemination. 2001, []Google Scholar
  5. Irwig L, Macaskill P, Glasziou P, Fahey M: Meta-analytic methods for diagnostic test accuracy. J Clin Epidemiol. 1995, 48: 119-30. 10.1016/0895-4356(94)00099-C.View ArticlePubMedGoogle Scholar
  6. Rutter CM, Gatsonis CA: A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat. Med. 2001, 20: 2865-84. 10.1002/sim.942.View ArticlePubMedGoogle Scholar
  7. Moses LE, Shapiro D, Littenberg B: Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations. Stat. Med. 1993, 12: 1293-316.View ArticlePubMedGoogle Scholar
  8. Jaeschke R, Guyatt G, Sackett DL, for the Evidence-Based Medicine Working Group: Users' guides to the medical literature, III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients?. JAMA. 1994, 271: 703-7. 10.1001/jama.271.9.703.View ArticlePubMedGoogle Scholar
  9. Hoffrage U, Lindsey S, Hertwig R, Gigerenzer G: Medicine Communicating statistical information. Science. 2000, 290: 2261-2. 10.1126/science.290.5500.2261.View ArticlePubMedGoogle Scholar
  10. Cochrane Methods Working Group on Systematic Reviews of Screening and Diagnostic Tests. Screening and diagnostic tests: Recommended methods. 1996, []
  11. Pre-publication history

    1. The pre-publication history for this paper can be accessed here:


© Honest and Khan; licensee BioMed Central Ltd. 2002

This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.