Systems for grading the quality of evidence and the strength of recommendations II: Pilot study of a new system

Background Systems that are used by different organisations to grade the quality of evidence and the strength of recommendations vary. They have different strengths and weaknesses. The GRADE Working Group has developed an approach that addresses key shortcomings in these systems. The aim of this study was to pilot test and further develop the GRADE approach to grading evidence and recommendations. Methods A GRADE evidence profile consists of two tables: a quality assessment and a summary of findings. Twelve evidence profiles were used in this pilot study. Each evidence profile was made based on information available in a systematic review. Seventeen people were given instructions and independently graded the level of evidence and strength of recommendation for each of the 12 evidence profiles. For each example judgements were collected, summarised and discussed in the group with the aim of improving the proposed grading system. Kappas were calculated as a measure of chance-corrected agreement for the quality of evidence for each outcome for each of the twelve evidence profiles. The seventeen judges were also asked about the ease of understanding and the sensibility of the approach. All of the judgements were recorded and disagreements discussed. Results There was a varied amount of agreement on the quality of evidence for the outcomes relating to each of the twelve questions (kappa coefficients for agreement beyond chance ranged from 0 to 0.82). However, there was fair agreement about the relative importance of each outcome. There was poor agreement about the balance of benefits and harms and recommendations. Most of the disagreements were easily resolved through discussion. In general we found the GRADE approach to be clear, understandable and sensible. Some modifications were made in the approach and it was agreed that more information was needed in the evidence profiles. Conclusion Judgements about evidence and recommendations are complex. Some subjectivity, especially regarding recommendations, is unavoidable. We believe our system for guiding these complex judgements appropriately balances the need for simplicity with the need for full and transparent consideration of all important issues.

based on information available in a systematic review. Seventeen people were given instructions and independently graded the level of evidence and strength of recommendation for each of the 12 evidence profiles. For each example judgements were collected, summarised and discussed in the group with the aim of improving the proposed grading system. Kappas were calculated as a measure of chance-corrected agreement for the quality of evidence for each outcome for each of the twelve evidence profiles. The seventeen judges were also asked about the ease of understanding and the sensibility of the approach. All of the judgements were recorded and disagreements discussed.
Results: There was a varied amount of agreement on the quality of evidence for the outcomes relating to each of the twelve questions (kappa coefficients for agreement beyond chance ranged from 0 to 0.82). However, there was fair agreement about the relative importance of each outcome. There was poor agreement about the balance of benefits and harms and recommendations. Most of the disagreements were easily resolved through discussion. In general we found the GRADE approach to be clear, understandable and sensible. Some modifications were made in the approach and it was agreed that more information was needed in the evidence profiles.
Conclusion: Judgements about evidence and recommendations are complex. Some subjectivity, especially regarding recommendations, is unavoidable. We believe our system for guiding these complex judgements appropriately balances the need for simplicity with the need for full and transparent consideration of all important issues.

Background
Reviewers and users of reviews draw conclusions about the overall quality of the evidence that is reviewed. Similarly, people making recommendations and users of those recommendations draw conclusions about the strength of the recommendations that are made. Systematic approaches to doing this can help protect against errors by both doers and users, and can facilitate critical appraisal and communication of the conclusions that are made.
The GRADE Working Group began as an informal collaboration of people with an interest in addressing shortcomings in systems for grading evidence and recommendations. We report elsewhere a critical appraisal of six prominent systems for grading evidence and recommendations [1]. Based on this critical appraisal and a series of discussions, we reached agreement on the key attributes of a system that would address the major shortcomings that we identified. Based on the critical assessment of existing approaches, the agreement we had reached about the key elements that should be included in an approach for grading the level of evidence and strength of recommendations and our previous experiences we put together a suggestion for a grading system. We then applied the suggested system to a series of examples and discussed and revised the system based on this experience and the consideration of other examples. Examples were selected to challenge our thinking. All of the examples used in this pilot study were questions about interventions. We describe here the pilot study of this system.
The aims of the pilot study were to test whether the approach is sensible relative to diverse examples of evi-dence and recommendations, and to agree on necessary changes to the approach, decision rules, and changes in how the evidence profiles used in the pilot study were constructed. The revised approach is described elsewhere [16].

Methods
Seventeen people independently judged the quality of evidence, the balance between benefits and harms, and the formulation of a recommendation for 12 examples. The 17 judges all had experience using other approaches to grade evidence and recommendations.

Evidence profiles
For each example we prepared an evidence profile. Each evidence profile was made based on information available in a systematic review and consists of two tables, one for quality assessment of the available information and one table that presents a summary of the findings (Table  1 and Table 2). For the purpose of testing our grading approach in this pilot study we made the assumption that the systematic reviews that we used were all well conducted. The examples we used and presented here were selected to test our new approach, not with an intention of making actual recommendations for a specific setting based on up-to-date systematic reviews. The quality assessment table was designed such that the quality of each outcome was evaluated separately. For each outcome, the table contained information regarding the number of studies that had reported the outcome, information about the study design (RCTs or observational studies) and the quality of the studies that reported on that outcome (was there any limitations in the design or conduct of these studies). Also included in the quality assessment table was information about the consistency of the results across studies for each outcome and information regarding directness of the study population, outcome measure, intervention and comparison. The summary of findings table was also designed such that each outcome was presented separately. For each outcome information are presented about both the experimental and the control group patients, for dichotomous outcomes the number of events and the total number of participants, and for continuous outcomes means (standard deviation) and the number of patients were presented.
Also included in the summary of findings table is information about the effect, relative effect (95% confidence interval) and absolute effect for each outcome.
Instructions and a form for recording each judgement were included with each example [see Additional file 1]. The judges were instructed to apply the approach without second guessing the information presented in the evidence profile or the approach. They were asked to note problems that they encountered and judgements that did not make sense to them when they adhered to the approach as instructed.  • Should patients with atrial fibrillation be treated with warfarin or aspirin for prevention of stroke? [3] • Should patients with pain believed to be due to degenerative arthritis be treated with non-steroidal anti-inflammatory drugs (NSAIDs) or paracetamol? [4] • Should patients who have had a myocardial infarction be given antiplatelet therapy to reduce all cause mortality? [5] • Should patients who have had a myocardial infarction be offered exercise rehabilitation? [5] • Should patients with deep venous thrombosis be treated with Low Molecular Weight Heparin (LMWH) or IV unfractionated heparin for prevention of pulmonary embolism? [6] • Should antibiotics be used to treat acute maxillary sinusitis? [7] • Should BCG vaccine be used to prevent tuberculosis? [8] • Should surgical discectomy be recommended for patients with sciatica due to lumbar disc prolapse? [9] • Should community water fluoridation be used to reduce dental caries? [10,11] • Should distribution of child safety seats and education programs be used to increase correct use of child safety seats? [12] • Should hormone replacement therapy be given to prevent cardiovascular heart disease in healthy post menopausal women? [13] For each example each person made judgements about; • the quality of evidence for each outcome, scored as high, intermediate, low, or very low; • the relative importance of each outcome, scored as critical to the decision (7-9), important but not critical to the decision (4-6), or not important to the decision (1-3); • the overall quality of all the critical outcomes, scored as high, intermediate, low, or very low; • the balance between benefits and harms, scored as net benefit, trade offs, uncertain net benefit, or not net benefit; and • the recommendation, scored as do it, probably do it, toss up, probably don't do it, or don't do it.
For each example the judgements made by all 17 people were collected and summarised as illustrated in Table 3.
Disagreements were discussed at a meeting attended by 15 of the 17 judges. Because of a lack of time, the last two examples were discussed at another meeting attended by six of the 17 judges, but all 17 raters provided judgements for all of the 12 examples. For each example the kappa agreement was calculated [14] for the 17 graders across the four levels for the quality of evidence across outcomes for each example (number of outcomes per example range from two to seven), across all outcomes (46) and for the judgements about overall quality of the evidence (12).

Sensibility and understandability
After grading all 12 examples, the judges were asked 16 questions regarding the sensibility and understandability of the approach. Each question consisted of a statement and five response options: strongly disagree, disagree, not sure, agree, and strongly agree. Eleven people completed this questionnaire. The questionnaire was adapted from Feinstein [15] and the 16 statements were: 1. The approach is applicable to different types of interventions, including drugs, surgery, counselling, and community-based interventions.
2. The approach is clear and simple to apply 3. The information that is needed is generally available.
4. Subjective decisions are generally not needed.
5. All of the components included in each of the five types of judgements should be included 6. There are not important components that are missing for any of the five types of judgements.
7. The ways in which the components are aggregated for each of the five types of judgements are clear and simple.
8. The ways in which the included components are aggregated are appropriate for each of the five types of judgements.
9. The categories are sufficient to discriminate between different grades for each of the five types of judgements.
10. The approach successfully discriminates between different grades of evidence.
11. The approach successfully discriminates between different grades of recommendations.
12. The overall quality of evidence is clear and understandable.
13. The balance between the benefits and harms is clear and understandable.
14. The recommendation is clear and understandable.
15. The way in which the overall quality of evidence was graded is better than other ways of doing this with which I am familiar.
16. The way in which the recommendation was graded is better than other ways of doing this with which I am familiar.

Quality of evidence for each outcome
The quality of evidence for each outcome as assessed by the 17 graders are shown in Table 4. Much of the disagreement was due to lacking information in the evidence summaries that we prepared based on the information available in the chosen examples. We agreed that the evidence summaries should include footnotes explaining the basis for judgements about study quality, consistency and directness. We also agreed that it was necessary to include information about baseline risk and the setting as part of the background information since different assumptions about these factors also explained some of the disagreement. It was possible to reach a consensus about the quality of evidence for most outcomes when we discussed our judgements. Of the 48 outcomes that were included across the 12 examples, we were not able to reach a consensus regarding five. The lack of consensus resulted from disagreement about whether there was sparse data for three outcomes and because of insufficient information for two outcomes.
We found that in addition to study design, quality, consistency and directness, other quality criteria also influenced judgements about evidence. These additional criteria were sparse data, strong associations, publication bias, dose response, and situations where all plausible confounders strengthened rather than weakened our confidence in the direction of the effect. Concequently, the consistency with which we considered these additional issues were affected and disagreements regarding the quality of evidence for each outcome were reduced.

Relative importance of each outcome
Specification of outcomes in the question that each example addressed resulted in some confusion regarding the relative importance of each outcome and the overall quality of evidence across outcomes. We therefore agreed that outcomes should not be included in the questions and that all important outcomes should be considered. There was good agreement about the relative importance of the 48 outcomes that were considered. We reached a consensus about the relative importance of all but two of the outcomes. This was due to uncertainty and true disagreement about the importance of these two outcomes, dental fluorosis and bone fractures, in relation to the question about water fluoridation.

Overall quality of important outcomes
There was a lack of agreement about the overall quality of evidence across the critical outcomes for each question (Table 5). This poor agreement reflected an accumulation of disagreements about the quality of evidence and importance of the individual outcomes that were considered for each question. In addition, we found that it did not make sense to downgrade the overall quality of evidence because of lower quality evidence for one of several critical outcomes when all of the outcomes showed effect in the same direction. We therefore agreed that the overall quality of evidence should be based on the higher quality evidence, rather than the lowest quality of evidence, when all of the results are in favour of the same option.
The kappa statistics for each question are shown in Table  6. The number of outcomes per example range from two to seven and the kappa ranged from 0 to 0.82. In some instances, the agreement among the graders was slightly worse than by chance as indicated by the negative kappa values seen in Table 6. The kappa across the 46 outcomes included in the calculation was 0.395 (SE 0.008). Kappa for agreement beyond chance for the 12 final judgements about the quality of evidence was 0.270 (SE 0.015).

Balance between benefits and harms
The graders assessments about the balance between benefits and harms are shown in Table 7. There is visibly a poor agreement, this can, in part, be explained by the accumulation of all the previous differences in grading of the quality and importance of the evidence. Some of the judges made assumptions or considered information that was not included in the evidence profiles. When we discussed these judgements, we reached a consensus about the balance between benefits and harms for all but three questions. For one question we found we needed more information. For the second judgement we disagreed about the importance of two of the outcomes. For the third judgement we disagreed about the relative values we attached to the benefits and the harms.

Recommendation
The graders individual considerations about the recommendations are shown in Table 8. During the discussion,  we reached a consensus on a recommendation for the nine examples where we agreed on the balance between benefits and harms. We found that first agreeing on the balance between the benefits and harms clarified our judgements about recommendations and facilitated a consensus. There was not a one-to-one correspondence between our judgements about trade-offs and our judgements about recommendations, because the latter took into account additional considerations.

Sensibility and understandability
Eleven raters provided feedback on the sensibility and understandability of the GRADE system for grading evidence and formulating recommendations. Nine of the 11 respondents agreed or strongly agreed that the judgements about the overall quality of evidence were clear and understandable, and that the judgements about the balance between benefits and harms were clear and understandable using the GRADE approach. Everyone agreed or strongly agreed that the judgements about recommendations were clear and understandable. Eight of the judges agreed or strongly agreed that the GRADE approach to judging the overall quality of evidence was better than other grading systems with which they were familiar. Two disagreed and one was not sure. Eight also agreed that the GRADE approach to formulating recommendations was better than approaches with which the raters were familiar. Three raters were not sure about whether the GRADE approach was superior to other approaches of formulating recomendations.
Nine of the 11 respondents agreed or strongly agreed that the GRADE approach was applicable to different types of interventions, and that the approach was clear and simple to apply. Five judges disagreed that the information that is needed is generally available, two were not sure and four  agreed. Six of the eleven judges disagreed or strongly disagreed that subjective decisions were generally not needed, four were not sure and one agreed. Ten of the eleven judges agreed or strongly agreed that all the components included in each of the four types of judgements should be included; one judge was not sure. Five of the judges were unsure if there were not important components that were missing from any of the four types of judgements, one disagreed and three agreed or strongly agreed. Eight judges agreed or strongly agreed that the ways in which the components were aggregated for each of the four types of judgements were clear and simple; three were unsure. Seven judges agreed or strongly agreed that the ways in which the included components were aggregated were appropriate for each of the four types of judgements, two were unsure and two disagree. Ten of the eleven judges agreed or strongly agreed that the categories were sufficient to discriminate between different grades for each of the four types of judgements; one disagreed. All the eleven judges agreed or strongly agreed that the GRADE approach successfully discriminated between different quality of evidence, and between different grades of recommendations.

Discussion
This pilot study of the GRADE approach to grading the quality of evidence and strength of recommendations helped to identify problems with the approach and enabled us to address these. We found that it was possible to resolve most of the disagreements we had when making judgements independently and there was agreement that this approach warrants further development and evaluation.
Many of the disagreements were a direct result of a lack of information. We concluded that there is a need for detailed additional information in evidence profiles, and have modified the evidence profiles accordingly. When we have found an empirical basis or compelling arguments, we have also provided precise definitions. For example, we have agreed on a basis for defining strong and very strong associations. However, in many cases we continue to rely on judgement. We have addressed this by always including the rationale for such judgements in footnotes attached to the evidence profile.
The evidence profiles used in the pilot study were based on systematic reviews. [2][3][4][5][6][7][8][9][10][11][12][13] Much of the information we found lacking was missing in these original systematic reviews, particularly information about harms and side effects. It was outside of the scope of this study to systematically collect this information. However, systematic reviews of evidence of harms, as well as benefits, are essential for guidelines development panels. If reviews, such as Cochrane reviews, are going to meet the needs of guide-line development panels, and others making decisions about health care, it is essential that evidence of adverse effects is systematically included in these.
An important benefit of the approach to grading evidence and recommendations that we used in this study is that it clarifies the source of true disagreements, as well as helping to resolve disagreements through discussing each type of judgement sequentially. Judgements about the relative importance of different outcomes and about trade-offs, as well as about the quality of evidence, are made explicitly, rather than implicitly. This facilitates discussion and clarification of these judgements. It may be helpful to guideline panels and others to use this approach before making decisions and recommendations.
The most common source of disagreement that we encountered was differences in what we consider to be sparse data. We have not reached a consensus on a definition of sparse data, but have acknowledged that we have different thresholds and now recognize this when we make judgements about the quality of evidence [16].
We have as a result of this pilot study been able to make considerable improvements to our system for grading the quality of evidence and strength of recommendations. The evidence profiles used in the pilot study have been modified and now include information that was missing and was found to be an important source of disagreement, as illustrated in Table 9 and Table 10 and the criteria used for grading the quality of evidence for each important outcome have been modified as summarised in Table 11. Guideline generation includes judgement. Individual, residual judgements will impact on the agreement we measured in this study. Thus, lower kappa values are expected. Further refinement of the GRADE system and additional instructions will improve agreement.
Judgements about confidence in evidence and recommendations are complex. The GRADE system represents our current thinking about how to reduce errors and improve communication of these complex judgements. Ongoing developments include: • Exploring the extent to which the same system should be applied to public health and health policy decisions as well as clinical decisions • Developing guidance for when and how costs (resource utilisation) should be considered • Developing guidance for judgements regarding sparse data   • Preparing tools to support the application of the GRADE system Plans for further development include studies of the reliability and sensibility of this approach and a study comparing alternative ways of presenting these judgements [17]. We invite other organisations responsible for systematic reviews of the effects of healthcare or practice guidelines to work with us to further develop and evaluate the system described here.