Skip to main content

A prospective evaluation of inter-rater agreement of routine medical records audits at a large general hospital in São Paulo, Brazil



The quality of patient medical records is intrinsically related to patient safety, clinical decision-making, communication between health providers, and continuity of care. Additionally, its data are widely used in observational studies. However, the reliability of the information extracted from the records is a matter of concern in audit processes to ensure inter-rater agreement (IRA). Thus, the objective of this study is to evaluate the IRA among members of the Patient Health Record Review Board (PHRRB) in routine auditing of medical records, and the impact of periodic discussions of results with raters.


A prospective longitudinal study was conducted between July of 2015 and April of 2016 at Hospital Municipal Dr. Moysés Deutsch, a large public hospital in São Paulo. The PHRRB was composed of 12 physicians, 9 nurses, and 3 physiotherapists who audited medical records monthly, with the number of raters changing throughout the study. PHRRB meetings were held to reach a consensus on rating criteria that the members use in the auditing process. A review chart was created for raters to verify the registry of the patient’s secondary diagnosis, chief complaint, history of presenting complaint, past medical history, medication history, physical exam, and diagnostic testing. The IRA was obtained every three months. The Gwet’s AC1 coefficient and Proportion of Agreement (PA) were calculated to evaluate the IRA for each item over time.


The study included 1884 items from 239 records with an overall full agreement among raters of 71.2%. A significant IRA increase of 16.5% (OR = 1.17; 95% CI = 1.03—1.32; p = 0.014) was found in the routine PHRRB auditing, with no significant differences between the PA and the Gwet’s AC1, which showed a similar evolution over time. The PA decreased by 27.1% when at least one of the raters was absent from the review meeting (OR = 0.73; 95% CI = 0.53—1.00; p = 0.048).


Medical record quality has been associated with the quality of care and could be optimized and improved by targeted interventions. The PA and the Gwet’s AC1 are suitable agreement coefficients that are feasible to be incorporated in the routine PHRRB evaluation process.

Peer Review reports


Adequate medical record keeping is an essential part of good professional health practice that makes it possible to evaluate and improve the quality of health care. The use of medical records should extend beyond the medical management of patients: adequate records allow for improved coordination and continuity of care; serve as a learning tool; and help prevent and evaluate possible adverse events that may compromise patient safety in hospitals [1, 2].

In 2002 the Brazilian Medical Council established that patient record review commissions are mandatory for health services [3]. However, it does not establish criteria or guidelines for the evaluation or its reliability. When different raters assign the same value for each item being observed, it’s important to measure its inter-rater reliability (IRR), closely related to the inter-rater agreement (IRA) [4, 5]. Some review studies assessing adverse events have been shown to suffer from poor to moderate IRR [6, 7]. In addition, IRR is rarely described or discussed in research papers based on data extracted from medical records, and there are no standard methods for assessing IRR [8]. Moreover, time constraints and work overload are frequent situations faced by health staff performing tasks involving data management, resulting in low data quality that can affect managerial decision-making [2]. Therefore, the evaluation of suitable methods for data extraction from this source is essential [9].

When such studies employ multiple raters it is important to have a strategy to document adequate levels of agreement between them, and the Cohen’s Kappa coefficient (κ) is a well-known measure [10]. However, it is affected by the skewed distributions of categories (the prevalence paradox) and by the degree to which raters disagree (the bias problem) [11, 12].

To correct those limitations, Kilem Li Gwet proposed a new agreement coefficient which can be used with any number of raters and requires a simple categorical rating system [13, 14].

The objective of this study was to evaluate the IRA of routine medical record audits and the impact of periodic discussions among raters in a large general hospital. The study also aimed to compare the estimates of the percent agreement (PA) to the Gwet’s agreement coefficient (AC1), to identify possible factors associated with the PA, and if agreement among the auditors is associated with the adequacy of the evaluated items.


Population and setting

This was a prospective longitudinal study conducted between July of 2015 and April of 2016 at the Hospital Municipal Dr. Moysés Deutsch (HMMD). HMMD is a large public general hospital (300 beds) located in the southern zone of the city of São Paulo, Brazil— an impoverished region encompassing approximately 600,000 inhabitants. The present study was part of a larger intervention aimed at improving the quality of patient care through a tailored integration strategy among health facilities in its Regional Health Care Network [15].

Audit of medical records and review meetings

The HMMD maintains a routine auditing process that includes 13% of all medical records of patients discharged in the previous month, carried out by the Patient Medical Record Review Board (PMRRB). The PMRRB was composed of 24 nominated health professionals: 12 physicians; 9 nurses; and 3 physiotherapists. Half of them were staff coordinators for at least two years and a maximum of eight years. The auditors have an average professional experience of 14 years and 66.7% of whom were women. The audit is a time-consuming procedure because it competes with the patient-care tasks that these professionals are responsible for. Consequently, it is common for the audits to have been carried out by each PMRRB member in isolation from other members without any criteria alignment for rating the items on the audit chart, which compromises the quality of the entire auditing process. However, the patient’s medical charts were selected in a non-random way, lacking adequate representation across the achieved results, and compromising the accuracy of and ability to generalize these data.

The planned intervention used the Lean Six Sigma methodology, which is widely utilized to aggregate values in several HMMD quality improvement processes already a part of the work culture among the professionals [15].

The proposed actions included at least one team-leader from each HMMD clinical department, preferably its coordinator, which increased the PMRRB components, reducing the total medical charts to be reviewed by each member. The audit chart was refined by all members through discussions about the relevance of the information that should be registered by their health teams, answering the question: “Which information cannot be missed in the patient’s medical record?” The chosen items were then discussed to define the criteria to determine a rating as adequate, inadequate, incomplete, or not applicable (Table 1). For each item, a consensus was reached about its content as follows: Secondary diagnosis was considered adequate if it was registered at any time during hospitalization. A complete medical history should be rated as adequate only if the chief complaint, history of presenting complaint, medical and past medical history were present at the patient’s admission. A physical exam was adequate if registered by a physician encompassing a general and specific examination. Diagnostic testing was considered adequate if the results were transcribed, not merely checked as done. In the discussion meetings, all members were trained, and medical charts were presented on a screen, allowing all members to rate each chosen item by raising color cards as green (adequate), red (inadequate), and yellow (not applicable). Disagreements were discussed to reach a consensus. The medical records were filled out in an unstructured text.

Table 1 Audited items

Finally, the patient medical records were randomly selected, weighted by the discharge proportion of each department.

The number of raters varying throughout the study is shown in Table 2.

Table 2 Number of audited medical records and raters over time

Every three months during the study period, in addition to the routine audits, five to six medical records were randomly allocated to the same two or three independent raters of the same professional category to evaluate the IRA. The study also included review meetings conducted every three months to align assessment criteria based on the results of the IRA evaluation and the auditing processes.

Statistical analysis

The Gwet’s AC1 and PA were calculated to evaluate the IRA for each item over time and were compared through line graphs including 95% confidence intervals (CIs). The Gwet’s AC1 95% CIs [14] were calculated, while the PA were modelled by generalized estimating equations (GEE), without an intercept [16, 17]. The agreement measures were interpreted following the categories proposed by Altman [18].

Logistic GEE was used to model the PA of all raters over time, using the values of 1 for full agreement and 0 for some disagreement. Two designs were considered: combining all items, to attain global associations; and considering each item individually, to obtain more details. The analyses employed an exchangeable working correlation matrix, and items in a single medical record were considered to be correlated. The model included as independent variables: professional category, review meeting attendance, and time (audits 1 to 4). A forward stepwise approach was used for variable selection employing a p-value lesser than or equal to 0.200 in the unadjusted model, and lesser than or equal to 0.050 in the multiple-variable model.

To measure the association between the agreement and the adequacy of the items the Spearman correlation coefficient was applied. The agreement was measured as PA. The adequacy was measured as the percentage of “adequate” evaluations. Both were considered by item and time.

The analyses were performed with the R software version 3.2.2 [19] with geepack [20].


The study included 1884 items from 239 records with an overall full agreement among raters of 71.2%. The estimated mean PA was found to be larger than the Gwet’s AC1 for all audited items (Fig. 1), however, these differences were not statistically significant and the evolutions of the two agreement coefficients were similar throughout the study period. Although a positive trend was found in the agreement of almost all items, their CIs did not indicate any statistically significant change over time. Additionally, the coefficients measurements grew closer as the agreement increased. During the study period, the greatest agreement was “chief complaint,” while the lowest one was “secondary diagnosis.”

Fig. 1
figure 1

Estimated percent agreement (PA) and Gwet’s AC1 by audited item. CI: Confidence interval. AC: Agreement coefficient

The logistic GEE model that included all items (Table 3) found a statistically significant increase of 17% over time for the PA, but when at least one of the raters was absent from the review meeting, the PA decreased by 27%. Physiotherapists and physicians showed higher PA when compared to nurses.

Table 3 Estimated odds ratios (OR) for percent agreement. N = 1884 items from 239 records

In the analysis by item, there was a non-significant positive trend for higher PA for “history of presenting complaint” while physicians presented a significantly higher PA over time when compared to nurses for “secondary diagnosis,” “medication history,” and “diagnostic testing.” Physiotherapists presented a significantly higher PA over time when compared to nurses for “medication history.” Finally, when at least one of the raters was absent from the review meeting the PA decreased by 60.5% for “diagnostic testing” (Table 4).

Table 4 Estimated odds ratios (OR) of percent agreement by item. N = 239 records

The average adequacy of the items assessed in the first audit was 73.3%, increasing to 78.2% in the second audit, 76.3% in the third, and then falling to 72.1% in the fourth audit. Regardless of the time and type of item, when comparing the PA value with the percentage of adequacy, a Spearman correlation coefficient of 0.72 was found (p-value < 0.001).


A significant increase in the IRA among PHRRB members was found over time in routine medical record auditing processes when periodic evaluations of the agreement were performed and discussed by them. Supporting this finding, the absence of a member in a review meeting had a negative impact on the PA. In addition, the PA and the Gwet’s AC1 were comparable and presented a similar evolution over time. Complete medical history was a composite of chief complaint, history of complaint, past medical history, and medication history. It was considered adequate if all of them were complete. Thus, it showed a positive evolution in both PA and Gwet’s AC1 over time from moderate to substantial according to Altman’s categories [18]. Only the IRA of secondary diagnosis remained moderate. These findings may indicate the raters’ learning curve regarding the positive evolution of some variables across agreement ranges. Nevertheless, the degree of agreement is arbitrary, making it impossible to define an acceptable level [5]. Thus, the interpretation of these IRA values follows the main study objective, i.e., the raters’ concordance in a particular category.

The greater IRA among physicians and physiotherapists when compared to nurses may reflect some inconsistency across the evaluations that can be attributed to the raters’ selection, training, and accountability [5], and could be influenced by a misunderstanding about rating the “complete history” item.

The strategy applied to the IRA was feasible to be carried out in this real-world scenario, aggregating value to the auditing process and providing more accurate information that can be used by health leadership. The use of PA and Gwet’s AC1 for that purpose was successful because they demand a relatively small sample of PMRs to be audited by each rater and can provide two data consistency measures [5, 21]. Both of the used indices have reached acceptable levels of agreement [18, 22], according to study purposes.

Following and evaluating the progress of the agreement among raters of PMRs allows for setting up goals and identifying associated factors to improve the audit processes, but previously proposed models worked with continuous variables [23] or with the Kappa coefficient [24], so the use of PA and Gwet’s AC1 made it possible to model the agreement of more than two raters over time.

The increased IRA highlights the need for more careful planning and evaluation of medical record audits since this activity is closely related to health care quality and patient safety improvements efforts [8, 9].

Since the present study was conducted under real-world conditions and included different health providers as raters, this intervention has the potential to be applicable in other similar settings, taking into consideration that it was carried out in only one hospital that has a culture of evidence-based improvement interventions, during a short-term follow-up. Although this study did not include an evaluation of the impact in the quality of medical records, that should be the final goal of any routine audit. There was a strong association between agreement and adequacy of information registered in the patient health records, and although the study was conducted between 2015 and 2016, these results are still relevant given the lack of studies evaluating data quality of medical records auditing.

Furthermore, the literature on the quality of medical record keeping and IRA or IRR is scarce— reflected by the fact that no reviews on the subject could be identified— making the results of this study relevant to improve the body of knowledge in the era of data-driven institutions and big data from patient health records.


Medical record quality has been associated with the quality of care and could be optimized and improved by targeted interventions. The PA and the Gwet’s AC1 are suitable agreement coefficients that are feasible to be incorporated in the routine PHRRB evaluation process.

Availability of data and materials

The dataset supporting the conclusions of this article is available to researchers who want to explore the data. To request, please send an email to



Agreement coefficient


Confidence interval


Generalized estimating eqs.


Hospital Municipal Dr. Moysés Deutsch


Inter-rater agreement


Inter-rater reliability


Odds ratio


Proportion of agreement


Patient’s health record review board


  1. Pirkle CM, Dumont A, Zunzunegui M-V. Medical recordkeeping, essential but overlooked aspect of quality of care in resource-limited settings. Int J Qual Health Care. 2012;24(6):564–7.

    Article  PubMed  Google Scholar 

  2. Zegers M, de Bruijne MC, Spreeuwenberg P, Wagner C, Groenewegen PP, van der Wal G. Quality of patient record keeping: an indicator of the quality of care? BMJ Quality Safety. 2011;20(4):314–8.

    Article  PubMed  Google Scholar 

  3. Conselho Federal de Medicina. Resolução n° 1638. Diário Oficial União n° 153, seção 1, 09/08/2002, p. 184–5. Available: [Accessed 30 Dec 2019].

  4. Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Res Soc Adm Pharm. 2013;9:330–8.

    Article  Google Scholar 

  5. Bajpai S, Bajpai R, Chaturvedi HK. Evaluation of inter-rater agreement and inter-rater reliability for observational data: an overview of concepts and methods. J Indian Academy Applied Psychol. 2015;41(3):20–7.

    Google Scholar 

  6. Lilford R, Edwards A, Girling A, Hofer T, Di Tanna GL, Petty J, et al. Inter-rater reliability of case-note audit: a systematic review. J Health Serv Res Policy. 2007;12(3):173–80.

    Article  PubMed  Google Scholar 

  7. Thomas EJ, Lipsitz SR, Studdert DM, Brennan TA. The reliability of medical record review for estimating adverse event rates. Ann Intern Med. 2002;136(11):812–6.

    Article  PubMed  Google Scholar 

  8. Yawn BP, Wollan P. Interrater reliability: completing the methods description in medical records review studies. Am J Epidemiol. 2005;161(10):974–7.

    Article  PubMed  Google Scholar 

  9. Liddy C, Wiens M, Hogg W. Methods to achieve high interrater reliability in data collection from primary care medical records. Ann Fam Med. 2011;9:57–62.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.

    Article  Google Scholar 

  11. Zec S, Soriani N, Comoretto R, Baldi I. High Agreement and High Prevalence: The Paradox of Cohen’s Kappa. Open Nurs J. 2017;11(Suppl-1, M5):211–8.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Eugenio BD, Glass M. The kappa statistic: a second look. Computational Linguistics. 2004;30(1):95–101.

    Article  Google Scholar 

  13. Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013;13:61.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Gwet KL. Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. 4th ed. Gaithersburg, MD: Advanced Analytics, LLC; 2014.

    Google Scholar 

  15. Bracco MM, Mafra ACCN, Abdo AH, Colugnati FAB, Dalla MDB, Demarzo MMP, et al. Implementation of integration strategies between primary care units and a regional general hospital in Brazil to update and connect health care professionals: a quasi-experimental study protocol. BMC Health Serv Res. 2016;16:380.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Prentice RL, Zhao LP. Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics. 1991;47(3):825–39.

    Article  CAS  PubMed  Google Scholar 

  17. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.

    Article  Google Scholar 

  18. Altman DG. Practical statistics for medical research. 1st ed. London: Chapman and Hall; 1991.

    Google Scholar 

  19. R Core Team (2019). R: a language and environment for statistical computing.4 R Foundation for Statistical Computing. Vienna. Available: [Accessed 30 Dec 2019].

  20. Højsgaard S, Halekoh U, Yan J. The R Package geepack for Generalized Estimating Equations. J Statistical Software. 2005;15:2.

    Article  Google Scholar 

  21. Walter SD, Eliasziw M, Donner A. Sample size and optimal designs for reliability studies. Stat Med. 1998;17(1):101–10.<101::AID-SIM727>3.0.CO;2-E.

    Article  CAS  PubMed  Google Scholar 

  22. Stemler SE. A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation. 2004;9:4.

    Google Scholar 

  23. Hill EG, Slate EH. A semi-parametric Bayesian model of inter- and intra-examiner agreement for periodontal probing depth. Ann Appl Stat. 2014;8(1):331–51.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Williamson JM, Lipsitz SR, Manatunga AK. Modeling kappa for measuring dependent categorical agreement data. Biostatistics. 2000;1(2):191–202.

    Article  CAS  PubMed  Google Scholar 

  25. PlataformaBrasil. Availabe: [Accessed 15 Apr 2019].

Download references


We thank all the MRRC members who conducted the audit records and supported the process, the members of the archiving sector, as well as the hospital leadership.


This work was funded by the Brazilian Ministry of Health and São Paulo State Research Foundation through the Research Program for the Unified Health System-PPSUS grant 2012/51228–9. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations



ACCNM conceptualized the study design, drafted the initial manuscript, carried out the random sampling, the statistical analysis, and revised the manuscript. RRST and EA elaborated and operationalized the intervention, contributed to the study design and reviewed the manuscript. FABC contributed to the study design and reviewed the manuscript. JLM and GP contributed to the interpretation of data for the work and revised the manuscript critically for important intellectual content. MMB elaborated the study design, operationalized the intervention, drafted and revised the manuscript. All authors approved the final manuscript as submitted.

Corresponding author

Correspondence to Ana Carolina Cintra Nunes Mafra.

Ethics declarations

Ethics approval and consent to participate

Data from medical records were accessed only by PMRRB members as part of the routine audit process. For this research, patient data were not used, only the result of the medical records evaluations, and therefore, no consent forms were applied. The hospital director permits the use of this information for scientific research. The study was approved by the research ethics committee of the São Paulo Municipal Health Department, and the partners’ institutions (26981514.3.0000.0086) [25].

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mafra, A.C.C.N., Miraglia, J.L., Colugnati, F.A.B. et al. A prospective evaluation of inter-rater agreement of routine medical records audits at a large general hospital in São Paulo, Brazil. BMC Health Serv Res 20, 638 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: