- Research article
- Open Access
- Open Peer Review
Systems for grading the quality of evidence and the strength of recommendations I: Critical appraisal of existing approaches The GRADE Working Group
BMC Health Services Researchvolume 4, Article number: 38 (2004)
A number of approaches have been used to grade levels of evidence and the strength of recommendations. The use of many different approaches detracts from one of the main reasons for having explicit approaches: to concisely characterise and communicate this information so that it can easily be understood and thereby help people make well-informed decisions. Our objective was to critically appraise six prominent systems for grading levels of evidence and the strength of recommendations as a basis for agreeing on characteristics of a common, sensible approach to grading levels of evidence and the strength of recommendations.
Six prominent systems for grading levels of evidence and strength of recommendations were selected and someone familiar with each system prepared a description of each of these. Twelve assessors independently evaluated each system based on twelve criteria to assess the sensibility of the different approaches. Systems used by 51 organisations were compared with these six approaches.
There was poor agreement about the sensibility of the six systems. Only one of the systems was suitable for all four types of questions we considered (effectiveness, harm, diagnosis and prognosis). None of the systems was considered usable for all of the target groups we considered (professionals, patients and policy makers). The raters found low reproducibility of judgements made using all six systems. Systems used by 51 organisations that sponsor clinical practice guidelines included a number of minor variations of the six systems that we critically appraised.
All of the currently used approaches to grading levels of evidence and the strength of recommendations have important shortcomings.
In 1979 the Canadian task Force on the Periodic Health Examination published one of the first efforts to explicitly characterise the level of evidence underlying healthcare recommendations and the strength of recommendations . Since then a number of alternative approaches has been proposed and used to classify clinical practice guidelines [2–28].
The original approach used by the Canadian Task Force was based on study design alone, with randomised controlled trials (RCTs) being classified as good (level I) evidence, cohort and case control studies being classified as fair (level II) evidence and expert opinion being classified as poor (level III) evidence. The strength of recommendation was based on the level of evidence with direct correspondence between the two; e.g. a strong recommendation (A) corresponded to there being good evidence. A strength of the original Canadian Task Force approach was that it was simple; the main weakness was that it was too simple. Because of its simplicity, it was easy to understand, apply and present. However, because it was so simple there were many implicit judgements, including judgements about the quality of RCTs, conflicting results of RCTs, and convincing results from non-experimental studies.
Should a small, poorly designed RCT be considered level I evidence?
Should RCTs with conflicting results still be considered level I evidence?
Should observational studies always be considered level II evidence, regardless of how convincing they are?
The original approach by the Canadian Task Force also did not include explicit judgements about the strength of recommendations, such as how trade-offs between the expected benefits, harms and costs were weighed and taken account of in going from an assessment of how good the evidence is to determining the implications of the results for practice.
The GRADE Working Group is an informal collaboration of people with an interest in addressing shortcomings such as these in systems for grading evidence and recommendations. We describe here a critical appraisal of six prominent systems and the results of the critical appraisal.
We selected systems for grading the level of evidence and the strength of recommendations that we considered prominent and that included features not captured by other prominent systems. These were selected based on the experience and knowledge of the authors through informal discussion. A description of the most recent version (as of summer 2000) of each of these systems (Appendix 1 to 6), was prepared by one of the authors familiar with the system, and used in this exercise. The following six systems were appraised: the American College of Chest Physicians (ACCP, [see Additional file 1]) , Australian National Health and Medical Research Council (ANHMRC, [see Additional file 2]) , Oxford Centre for Evidence-Based Medicine (OCEBM, [see Additional file 3]) , Scottish Intercollegiate Guidelines Network (SIGN, [see Additional file 4]) , US Preventive Services Task Force (USPSTF, [see Additional file 5]) , US Task Force on Community Preventive Services (USTFCPS, [see Additional file 6]) .
These descriptions of the systems were given to the twelve people who independently appraised the six systems, all of the authors minus GEV appraised the six systems, three of the authors (DH, SH and DO'C) appraised as a group and reported as one (see contributions). The 12 assessors all had experience with at least one system and most had helped to develop one of the six included systems. Twelve criteria described by Feinstein  provided the basis for assessing the sensibility of the six systems.
Criteria used to assess the sensibility of systems for grading evidence and recommendations
To what extent is the approach applicable to different types of questions? -effectiveness, harm, diagnosis and prognosis (No, Not sure, Yes)
To what extent can the system be used with different audiences? -patients, professionals and policy makers (Little extent, Some extent, Large extent)
How clear and simple is the system? (Not very clear, Somewhat clear, Very clear)
How often will information not usually available be necessary? (Often, Sometimes, Seldom)
To what extent are subjective decisions needed? (Often, Sometimes, Seldom)
Are dimensions included that are not within the construct (level of evidence or strength of recommendation)? (Yes, Partially, No)
Are there important dimensions that should have been included and are not? (No, Partially, Yes)
Is the way in which the included dimensions are aggregated clear and simple? (No, Partially, Yes)
Is the way in which the included dimensions are aggregated appropriate? (No. Partially, Yes)
Are the categories sufficient to discriminate between different levels of evidence and strengths of recommendations? (No, Partially, Yes)
How likely is the system to be successful in discriminating between high and low levels of evidence or strong and weak recommendations? (Not very likely, Somewhat likely, Highly likely)
Are assessments reproducible? (Probably not, Not sure, Probably)
No training was provided and we did not discuss the 12 criteria prior to applying them to the six systems.
Our independent appraisal of the six systems were summarised and discussed. The discussion focused on differences in the interpretation of the criteria, disagreement about the judgements that we made and sources of these disagreements, the strengths and weaknesses of the six systems, and inferences based on the appraisals and subsequent discussion.
In order to identify important systems that we might have overlooked following our appraisal of these six systems we also searched the US Agency for Health Care Research and Quality (AHRQ) National Guidelines Clearing House for organisations that have graded two or more guidelines in the Clearing House using an explicit system . These systems were compared with the six systems that we critically appraised.
There was poor agreement among the 12 assessors who independently assessed the six systems. A summary of the assessments of the sensibility of the six approaches to rating levels of evidence and strength of recommendation is shown in Table 1.
The poor agreement among the assessors likely reflects several factors. Some of us had practical experience using one of the systems or used additional background information related to one or more grading systems, and we may have been biased in favour of the system with which we were most familiar. Each criterion was applied to grading both evidence and recommendations. Some systems were better for one of these constructs than the other and we may have handled these discrepancies differently. In addition each criterion may have been assessed relative to different judgements about the evidence, such as an assessment of the overall quality of evidence for an important outcome (across studies) versus the quality of an individual study. Some of the criteria were not clear and were interpreted or applied inconsistently. For example, a system might be clear and not simple or visa versa. We likely differed in how stringently we applied the criteria. Finally, there was true disagreement.
There was agreement that the OCEBM system works well for all four types of questions. There was disagreement about the extent to which the other systems work well for questions other than effectiveness. It was noted that some systems are not intended to address other types of questions and it is not clear that it is important that a system should address all four types of questions that we considered (effectiveness, harm, diagnosis, prognosis), although criteria for assessing individual studies must take this into account [31, 32].
Most of us did not find that any of the systems are likely to be suitable for use by patients. Almost all agreed that the ACCP system was suitable for professionals and most considered that the USPSTF system was suitable for professionals. There was not much agreement about the suitability of any of the other systems for professionals or about the suitability of any of the systems for policy makers, although most assessed the USTFCPS system to be suitable for policy makers.
There was no agreement that any of the systems are clear and simple, although USPSTF, ACCP and SIGN systems were generally assessed more favourably in this regard. It was generally agreed that the clearer a system was the less simple it was; e.g. the OCEBM system is clear but not simple for categorising the level of evidence. There was some confusion regarding whether we were assessing how clear and simple the system was to guideline developers (as some interpreted this criterion) or how clear and simple the outcome of applying the system was to guideline users (as others interpreted this criterion). Either way, the simpler a system is the less clear it is likely to be.
Most of us judged that for most of the systems necessary information would not be available at least sometimes. The OCEBM system came out somewhat better than the other systems and lack of availability of necessary information was considered to be less of a problem for the USTFCPS system. However, the OCEBM and USTFCPS systems were considered by most to be missing dimensions which may, in part, explain why missing information was considered to be less of a problem. This would be the case to the extent the missing dimensions were the ones for which information would often or sometimes not be available. The dimension for which we considered that information would most often be missing was trade-offs; i.e. knowledge of the preferences or utility values of those affected. Additional problems were identified in relationship to complex interventions and counselling, particularly with the USTFCPS and USPSTF systems. It was pointed out that the USTFCPS system addressed this problem by including availability of information about the intervention as part of its assessment of the quality of evidence.
Most of the systems were assessed to require subjective decisions at least to some extent. The OCEBM system again stood out as being assessed more favourably, although it may be related to omission of dimensions that require more subjective decisions. Judgement is clearly needed with any system. The aim should be to make judgements transparent and to try to protect against bias in the judgements that are made by being systematic and explicit.
Inclusion of dimensions that are not within the constructs being graded was not considered a problem for most of the systems by most of us. Several people considered that it might be a problem for the USTFCPS and USPSTF systems. On the other hand, all of the systems were evaluated to be missing at least one important dimension by at least one person. The challenge of missing dimensions were considered less of a problem for the ACCP and ANHMRC systems. There was not agreement about any of the systems having a clear and simple approach to aggregating the dimensions, although this was considered to be less of a problem for the ACCP, SIGN and USTFCPS systems.
There was also not agreement on the appropriateness of how the dimensions were aggregated. This was considered to be more of a problem for the ANHMRC and USTFCPS systems than the other four systems, all of which were considered to have taken an approach to aggregating the dimensions that was at least partially inappropriate by more than half of us.
Most of us considered that most of the systems had sufficient categories, with the exception of the ANHMRC system. There was almost agreement that the USPSTF system has sufficient categories. We agreed that it is possible to have too many categories as well as too few, the OCEBM system being an example of having too many categories.
There was not agreement that any of the systems are likely to discriminate successfully, although everyone thought that the ACCP, SIGN and USPSTF systems are somewhat to highly likely to discriminate. Lastly, we largely agreed that we were not sure how reproducible assessments are using any of the systems, although half of us considered that assessments using the ANHMRC system are unlikely to be reproducible and about 1/3 considered that assessments using the OCEBM and ACCP systems are likely to be reproducible.
We identified 22 additional organisations that have produced 10 or more practice guidelines using an explicit approach to grade the level of evidence or strength of recommendations. Another 29 have produced between two and nine guidelines using an explicit approach. These systems include a number of minor variations of the six systems that we appraised in detail.
There was generally poor agreement between the individual assessors about the scoring of the six approaches using the 12 criteria. However, there was general agreement that none of these six prominent approaches to grading the levels of evidence and strength of recommendations adequately addressed all of the important concepts and dimensions that we thought should be considered. Although we limited our appraisal to six systems all of the additional approaches to grading levels of evidence and strength of recommendations that we identified were, in essence, variations of the six approaches that we had critically appraised. Therefore we are confident that we did not miss any important grading systems available at the time when these assessments were undertaken.
Based on discussions following the critical appraisal of these six approaches, we agreed on some conclusions:
Separate assessments should be presented for judgements about the quality of the evidence and judgements about the balance of benefits and harms.
Evidence for harms should be assessed in the same way as evidence for benefits, although different evidence may be considered relevant for harms than for benefits; e.g. local evidence of complication rates may be considered more relevant than evidence of complication rates from trials for endarterectomy.
Judgements about the quality of evidence should be based on a systematic review of the relevant research.
Systematic reviews should not be included in a hierarchy of evidence (i.e. as a level or category of evidence). The availability of a well-done systematic review does not correspond to high quality evidence, since a well-done review might include anything from no studies to poor quality studies with inconsistent results to high quality studies with consistent results.
Baseline risk should be taken into consideration in defining the population to whom a recommendation applies. Baseline risk should also be used transparently in making judgements about the balance of benefits and harms. When a recommendation varies in relationship to baseline risk, the evidence for determining baseline risk should be assessed appropriately and explicitly.
Recommendations should not vary in relationship to baseline risk if there is not adequate evidence to guide reliable determinations of baseline risk.
Based on discussions of the strengths and limitations of current approaches to grading levels of evidence and the strength of recommendations, we agreed to develop an approach that addresses the major limitations that we identified. The approach that the GRADE Working Group has developed is based on the discussions following the critical appraisal reported here and a pilot study of the GRADE approach . Based on the pilot testing and the discussions following the pilot, the GRADE Working Group has further developed the GRADE system to its present format .
The GRADE Working Group has continued to grow as an informal collaboration that meets one or two times per year. The group maintains web pages http://www.gradeworkinggroup.org and a discussion list.
DA, PAB, ME, SF, GHG, DH, SH, AL, DO'C, ADO, BP, HS, TTTE, GEV & JWW Jr as members of the GRADE Working Group have contributed to the preparation of this manuscript and the development of the ideas contained herein, participated in the critical assessment, and read and commented on drafts of this article. GHG and ADO have led the process. GEV has had primary responsibility for coordinating the process.
Canadian Task Force on the Periodic Health Examination: The periodic health examination. Can Med Assoc J. 1979, 121: 1193-254.
Sackett DL: Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest. 1986, 89 (suppl 2): 2S-3S.
Sackett DL: Rules of evidence and clinical recommendations on the use of antithrombotic agents. Archives Int Med. 1986, 146: 464-465.
Sackett DL: Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest. 1989, 95: 2S-4S.
Cook DJ, Guyatt GH, Laupacis A, Sackett DL: Rules of evidence and clinical recommendations on the use of antithrombotic agents. Antithrombotic Therapy Consensus Conference. Chest. 1992, 102 (suppl 4): 305S-311S.
US Department of Health and Human Services, Public Health Service, Agency Health Care Policy and Research: Acute Pain Management: Operative or Medical Procedures and Trauma. Agency for Health Care Policy and Research Publications, Rockville, MD. (AHCPR Pub 92-0038). 1992
Gyorkos TW, Tannenbaum TN, Abrahamowicz M, Oxman AD, Scott EA, Millson ME, Rasooly I, Frank JW, Riben PD, Mathias RG: An approach to the development of practice guidelines for community health interventions. Can J Public Health. 1994, 85 (suppl 1): S8-S13.
Hadorn DC, Baker D: Development of the AHCPR-sponsored heart failure guideline: methodologic and procedural issues. J Quality Improvement. 1994, 20: 539-54.
Cook DJ, Guyatt GH, Laupacis A, Sackett DL, Goldberg RJ: Clinical recommendations using levels of evidence for antithrombotic agents. Chest. 1995, 108 (4 Suppl): 227S-230S.
Guyatt GH, Sackett DL, Sinclair JC, Hayward R, Cook DJ, Cook RJ, for the Evidence-Based Medicine Working Group: User's guides to the medical literature.1X. A method for grading health care recommendations. Evidence-Based medicine working group. JAMA. 1995, 274: 1800-4. 10.1001/jama.274.22.1800.
Petrie J, Barnwell E, Grimshaw J: Criteria for appraisal for national use. Pilot Edition. Scottish Intercollegiate Guidelines Network. 1995, [http://www.sign.ac.uk/methodology/index.html]
US Preventive Services Task Force: Guide to Clinical Preventive Services. 1996, Baltimore: Williams & Wilkins, xxxix-lv. 2
Eccles M, Clapp Z, Grimshaw J, Adams PC, Higgins B, Purves I, Russell I: North of England evidence based guidelines development project: methods of guideline development. BMJ. 1996, 312: 760-2.
Centro per la Valutazione della Efficacia della Assistenza Sanitaria (CeVEAS). Linee Guida per il trattamento del tumore della mammella nella provincia di Modena (Luglio 2000). accessed December 29, 2002, [http://www.ceveas.it/ceveas/viewpage.do?idp=3]
Guyatt GH, Cook DJ, Sackett DL, Eckman M, Pauker S: Grades of recommendation for antithrombotic agents. Chest. 1998, 114 (5 Suppl): 441S-4S. [http://www.chestjournal.org/content/vol119/1_suppl/]
Ball C, Sackett D, Phillips B, Straus S, Haynes B: Levels of evidence and grades of recommendations. Last revised 17 September 1998. Centre for Evidence-Based Medicine, [http://www.cebm.net/levels_of_evidence.asp]
National Health and Medical Research Council: How to use the evidence: assessment and application of scientific evidence. Commonwealth of Australia. 2000, [http://www.nhmrc.gov.au/publications/synopses/cp65syn.htm]
Harbour R, Miller J: A new system for grading recommendations in evidence based guidelines. BMJ. 2001, 323: 334-6. 10.1136/bmj.323.7308.334.
Roman SH, Silberzweig SB, Siu AL: Grading the evidence for diabetes performance measures [see comments]. Eff Clin Pract. 2000, 3: 85-91.
Woloshin S: Arguing about grades. Eff Clin Pract. 2000, 3: 94-5.
Guyatt GH, Schünemann H, Cook D, Pauker S, Sinclair J, Bucher H, Jaeschke R: Grades of recommendation for antithrombotic agents. Chest. 2001, 119: 3S-7S. 10.1378/chest.119.1_suppl.3S.
Atkins D, Best D, Shapiro EN: The third U.S. Preventive Services Task Force : background, methods and first recommendations. Am J Preventive Medicine. 2001, 20 (3 (supplement 1)): 1-108.
Woolf SH, Atkins D: The evolving role of prevention in health care: Contributions of the U.S. Preventive Services Task Force. Am J Preventive Medicine. 2001, 20 (3 (supplement 1)): 13-20. 10.1016/S0749-3797(01)00262-8.
Harris RP, Helfand M, Woolf SH, Lohr KN, Mulrow CD, Teutsch SM, Atkins D, for the Methods Work Group of the Third U.S. Preventive Services Task Force: Current methods of the U.S. Preventive Services Task Force: A review of the process. Am J Preventive Medicine. 2001, 20 (3 (Supplement 1)): 21-35. 10.1016/S0749-3797(01)00261-6.
Briss PA, Zaza S, Pappaioanou M, Fielding J, Wright-De Aguero L, Truman BI, Hopkins DP, Mullen PD, Thompson RS, Woolf SH, Carande-Kulis VG, Anderson L, Hinman AR, McQueen DV, Teutsch SM, Harris JR: Developing an evidence-based Guide to Community Preventive Services – methods. The Task Force on Community Preventive Services. Am J Preventive Medicine. 2000, 18: 35-43. 10.1016/S0749-3797(99)00119-1.
Zaza S, Wright-De Aguero LK, Briss PA, Truman BI, Hopkins DP, Hennessy MH, Sosin DM, Anderson L, Carande-Kulis VG, Teutsch SM, Pappaioanou M: Data collection instrument and procedure for systematic reviews in the Guide to Community Preventive Services. Task Force on Community Preventive Services. American Journal of Preventive Medicine. 2000, 18: 44-74. 10.1016/S0749-3797(99)00122-1.
Greer N, Mosser G, Logan G, Halaas GW: A practical approach to evidence grading. Joint Commission J Qual Improv. 2000, 26: 700-12.
West S, King V, Carey TS, Lohr KN, McKoy N, Sutton SF, Lux L: Systems to Rate the Strength of Scientific Evidence. Evidence Report/Technology Assessment No. 47 (Prepared by the Research Triangle Institute-University of North Carolina Evidence-based Practice Center under Contract No. 290-97-0011). AHRQ Publication No. 02-E016. 2002, Rockville, MD: Agency for Healthcare Research and Quality, 64-88.
Feinstein AR: Clinimetrics. 1987, New Haven, CT: Yale University Press, 141-66.
National Guidelines Clearing House. Accessed April 19, 2001, [http://www.guideline.gov/resources/guideline_index.aspx]
Guyatt G, Drummond R, eds: Users' Guide to the Medical Literature. 2002, Chicago, IL: AMA Press, 55-154.
West S, King V, Carey TS, Lohr KN, McKoy N, Sutton SF, Lux L: Systems to Rate the Strength of Scientific Evidence. Evidence Report/Technology Assessment No. 47 (Prepared by the Research Triangle Institute-University of North Carolina Evidence-based Practice Center under Contract No. 290-97-0011). AHRQ Publication No. 02-E016. 2002, Rockville, MD: Agency for Healthcare Research and Quality, 51-63.
Atkins D, Briss PA, Eccles M, Flottorp S, Guyatt GH, Harbour RT, Hill S, Jaeschke R, Liberati A, Magrini N, Mason J, O'Connell D, Oxman AD, Phillips B, Schunemann HJ, Edejer TT, Vist GE, Williams JW, GRADE Working Group: Systems for grading the quality of evidence and the strength of recommendations II: Pilot study of a new system. BioMed Central.
Atkins D, Best D, Briss PA, Eccles M, Falck Ytter Y, Flottorp S, Guyatt GH, Harbour RT, Haugh MC, Henry D, Hill S, Jaeschke R, Leng G, Liberati A, Magrini N, Mason J, Middleton P, Mrukowicz J, O'Connell D, Oxman AD, Phillips B, Schunemann HJ, Edejer TT, Varonen H, Vist GE, Williams JW, Zaza S, Grade Working Group: Grading quality of evidence and strength of recommendations. BMJ. 328 (7454): 1490-2004 Jun 19
The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6963/4/38/prepub
We wish to thank Peter A Briss for participating in the critical assessment and for providing constructive comments on the process. The institutions with which members of the Working Group are affiliated have provided intramural support. Opinions expressed in this paper do not necessarily represent those of the institutions with which the authors are affiliated.
DA has competing interests with the US Preventive Services Task Force (USPSTF), PAB has a competing interest with the US Task Force on Community Preventive Services (USTFCPS), GHG and HS have competing interests with the American College of Chest Physicians (ACCP), DH, SH and DO'C have competing interests with the Australian National Health and Medical Research Council (ANHMRC), BP has competing interests with the Oxford Centre for Evidence-Based Medicine (OCEBM). Most of the other members of the GRADE Working Group have experience with the use of one or more systems of grading evidence and recommendations.