Improved accuracy of co-morbidity coding over time after the introduction of ICD-10 administrative data

Background Co-morbidity information derived from administrative data needs to be validated to allow its regular use. We assessed evolution in the accuracy of coding for Charlson and Elixhauser co-morbidities at three time points over a 5-year period, following the introduction of the International Classification of Diseases, 10th Revision (ICD-10), coding of hospital discharges. Methods Cross-sectional time trend evaluation study of coding accuracy using hospital chart data of 3'499 randomly selected patients who were discharged in 1999, 2001 and 2003, from two teaching and one non-teaching hospital in Switzerland. We measured sensitivity, positive predictive and Kappa values for agreement between administrative data coded with ICD-10 and chart data as the 'reference standard' for recording 36 co-morbidities. Results For the 17 the Charlson co-morbidities, the sensitivity - median (min-max) - was 36.5% (17.4-64.1) in 1999, 42.5% (22.2-64.6) in 2001 and 42.8% (8.4-75.6) in 2003. For the 29 Elixhauser co-morbidities, the sensitivity was 34.2% (1.9-64.1) in 1999, 38.6% (10.5-66.5) in 2001 and 41.6% (5.1-76.5) in 2003. Between 1999 and 2003, sensitivity estimates increased for 30 co-morbidities and decreased for 6 co-morbidities. The increase in sensitivities was statistically significant for six conditions and the decrease significant for one. Kappa values were increased for 29 co-morbidities and decreased for seven. Conclusions Accuracy of administrative data in recording clinical conditions improved slightly between 1999 and 2003. These findings are of relevance to all jurisdictions introducing new coding systems, because they demonstrate a phenomenon of improved administrative data accuracy that may relate to a coding 'learning curve' with the new coding system.

Studies have shown variable accuracy across jurisdictions. Relatively little is known about trends over time in accuracy of coding and, in particular, coding "learning curves" after introduction of a new coding system. Canada, for example, introduced the International Classification of Diseases, 10 th revision, (ICD-10) [17] recently, between 2002 and 2007, [18] with each province doing so on a different schedule, and there was no available information on the possible existence of coding learning curve effects on data validity. One study assessed the accuracy one year after the introduction in Alberta (Canada), [19] but without time trend investigation. ICD-10 was introduced in Switzerland in 1998. This study assesses data accuracy in the five years following the implementation of ICD-10 in the French part of Switzerland, with assessments of administrative data accuracy relative to charts in 1999, 2001, and 2003. Our primary research question of interest was to determine whether there was evidence of improved data accuracy.

Study Population
This time trend evaluation study was based on three crosssectional analyses of randomly selected administrative discharge records from three Swiss hospitals collected in 1999, 2001 and 2003. Two teaching hospitals and one non-teaching were included. Each of the two teaching hospitals has more than 1000 beds and over 30'000 discharges per year, while the non-teaching hospital has 300 beds and 12'000 discharges per year. For these three hospitals, the data were used for reimbursement at the date of the study start. Professional coders were nurses with at least 5 years of clinical practice and at least one year of training as a coder. Professional coders were introduced before 1999 for one teaching hospital and between 1999 and 2001 for the one other teaching and the non-teaching hospital. We randomly selected 500 records among patient discharges in 1999, 2001 and 2003 from each hospital, thus collecting 1500 records from each hospital, totalling 4'500 records. We included patients 16 years of age or older who stayed in the hospital for at least 24 hours and were discharged from any acute care wards from these hospitals. Of the 4'500 randomly selected patients, we excluded 1001 patients for the following reasons: eight patients because they left the hospital against medical advice, three because their age was less than 16 years, 11 because their length of stay (LOS) was less than 24 hours, and 979 because their charts could not be located.
The study was approved by the respective ethic committees of the three cantons.

Chart Data Abstraction
We identified charts through the chart number recorded in the administrative data. Chart abstraction was performed by two trained research nurses who read the entire chart, including admission and transfer notes, physician daily progress notes and orders, consultation notes, operative notes, diagnostic imaging and exam reports, discharge and narrative summaries and pathology reports. One research nurse undertook the data abstraction for each chart. The detailed chart review process took approximately 45 minutes per chart, on average. The reviewers extracted information about patient age and sex, length of stay, death and Charlson [20] and Elixhauser [21] co-morbidities. Presence or absence of these co-morbidities was determined using the definitions described by Charlson et al. [20] and a chart abstraction instrument developed by a Canadian research team [19] for determining the Elixhauser comorbidities [21]. All diagnoses corresponding to the definition of each non-redundant co-morbidity from Charlson and Elixhauser indices were identified by the two research nurses in the medical records. We excluded the condition 'Other neurological disorders' from the chart review process because its definition is too broad and vague. The inter-rater agreement between the two reviewers was assessed before the chart abstraction process. Kappa values ranged from -0.04 to 1.00 among 50 charts. Out of 30 co-morbidities assessed, 16 had substantial agreement (Kappa: 0.60-0.79), 10 had moderate agreement (Kappa: 0.40-0.59), and 4 had fair or poor agreement (Kappa< 0.40) [22]. We further trained the reviewers through discussion of results from this agreement test. Chart data abstraction was performed between 2005 and 2007.

Defining Co-morbidities in Administrative Data
Charlson [20] and Elixhauser [21] indices were derived from ICD10 coding algorithms. Each administrative hospital discharge record contains a unique personal identification number, patient chart number, and up to 30 diagnoses in one teaching hospital, and up to 10 diagnoses in the other teaching hospital and in the nonteaching hospital. We used a recently developed ICD-10 coding algorithm [23] to define the 36 non-redundant Charlson and Elixhauser co-morbidities in these administrative data.

Matching comparisons between ICD-10 data and Chart Review Data
Comparisons were based on the match between ICD-10 data and chart review data for the same hospital discharges, in the three hospitals samples and studied years, respectively, using the unique personal identification number. As explained above, we developed two comparative databases, one using ICD-10 data and the other using chart review. For each hospital discharge, we matched chart review data and non-redundant co-morbidities from Charlson and Elixhauser indices that had been identified using ICD-10. Then, we performed pairwise analyses that were constituted by ICD-10 data and chart review data for all hospital discharges in our study and for all non-redundant Charlson and Elixhauser indices co-morbidity, respectively.

Statistical Analysis
The prevalence of the Charlson and Elixhauser co-morbidities in administrative data and chart review, as well as their difference (i.e. prevalence difference, Δ Chart-ICD ), were assessed for each year and co-morbidity. Ninetyfive percent confidence intervals for Δ Chart-ICD were calculated using the formula for the comparison of two binomial proportions, accounting for sampling weights. The heterogeneity in prevalence differences across the three years was assessed by the Cochran Q statistic for each co-morbidity, respectively [24].
The accuracy of the administrative data was determined using chart data as the 'reference standard'. We calculated the sensitivity (Se) and positive predictive value (PPV) for each co-morbidity and year. We assessed the impact of the year using logistic regression analysis for survey sampling and global Wald test. The dependent variable was the "sensitivity" and the study year was a covariate (the reference year was 1999 and an odds-ratio was estimated for 2001 and 2003, respectively). A P-value < 0.05 was considered as significant.
Using a different perspective, we also calculated Cohen's Kappa index value along with its 95% confidence interval to assess the agreement between the ICD-10 and Chart review for each co-morbidity and year. These 95% CI intervals were used to assess pairwise differences between the indexes [25,26]. No multiple testing adjustments were performed [27]. Finally, the co-morbidity counts, determined for both Charlson and Elixhauser items, for the ICD-10 and chart review data were grouped into 4 ordinal categories. The categories were 0, 1, 2 or 3, and 4 or higher for the Charlson index, and 0, 1 and 2, 3-5, and 6 and higher for the Elixhauser index. A weighted Kappa index between ICD-10 and chart review data, with weights 1-|i-j|/(4-1) where i and j indicates the rows and columns of the 4 categories, respectively, was calculated for each year.
All statistical analyses were performed using the SAS Software version 9.2, Cary, NC, USA.

Results
We analyzed 3'499 patient records (77.8% of 4500). The mean age (standard deviation) was 58. The mean number of diagnoses coded was 4.01 (2.8), across all hospitals. The mean number of diagnoses differed between the three hospitals (p < 0.001): 4.99 (2.8) in the first teaching hospital with up to 30 diagnoses coded, 3.29 (2.7) in the second teaching hospital with up to 10 diagnoses coded, and 3.33 (2.4) in the non teaching hospital with up to 10 diagnoses coded. Table 1 presents the prevalence of the 36 co-morbidities by data source across study years. In general, administrative data underreported 33 conditions but reported similar levels of moderate and severe liver disease, any tumour and lymphoma compared to chart review data. The prevalence difference, Δ Chart-ICD , ranged from -0.5% to 15.1% in 1999, from -0.6% to 16.1% in 2001, and from -1.7% to 12.2% in 2003. The difference in trend Δ Chart-ICD across the three years was significant for 33 co-morbidities. No significant differences were observed for myocardial infarction, any tumor and metastatic cancer.
Indicators of accuracy of administrative data and of agreement between chart review data and administrative data by study year are presented in Table 2 Out of 36 conditions, there was an increase in sensitivity for the 30 following conditions between 1999 and 2003: cardiac arrhythmias*, myocardial infarction, peripheral vascular disease, pulmonary circulation disorders, valvular disease*, cerebrovascular disease, hemiplegia or paraplegia, hypertension*, diabetes without complication, hypothyroidism, peptic ulcer, peptic ulcer excluding bleeding, liver disease, mild liver disease, moderate and severe liver disease, any tumour, metastatic cancer*, solid tumour without metastasis, blood loss anaemia, deficiency anaemia, fluid electrolytic disorder*, weight loss, obesity*, alcohol abuse, drug abuse, dementia, psychosis, depression (statistically significant for 6 of those(*)). A decrease was observed for the 6 following conditions: diabetes with complications, lymphoma, renal failure, rheumatic disease*, AIDS/HIV, coagulopathy (statistically significant for only 1 of them (*)).
Kappa values ranged from 0.05 to 0.78 in 1999, from 0.17 to 0.71 in 2001 and from 0.06 to 0.77 in 2003 across the 36 non redundant conditions from both Charlson and Elixhauser (see Table 2 and Figure 1). Between 1999 and 2003, Kappa values increased for 29 co-morbidities and decreased for seven ( Table 2). The

Discussion
Our study indicates that the accuracy of administrative data coded with ICD-10 improved slightly between 1999 and 2003. That improvement was evidenced by the increase of sensitivity for most co-morbidities across the five year period. However, we also found that, for some conditions (i.e. lymphoma, renal failure, rheumatic diseases, AIDS/HIV and coagulopathy), the accuracy of ICD-10 administrative data decreased somewhat over the period. These findings are of relevance to all jurisdictions interested in studying data quality trends as new coding systems are introduced.
There are several possible explanations for these results. First, professional coders were increasingly employed to code charts in Switzerland during the more recent years of the study period. However, there are few training programs for coders in Switzerland. Lay persons or clinically trained nurses and physicians are coding charts after only a short training period. Therefore, it is hard to avoid inter-coder variation in the quality of coding. Second, an APDRG reimbursement system was introduced in some hospitals. Therefore, this financial incentive may have triggered coders to code more conditions than previously. In our sample, the average number of diagnosis codes was 2.99 in 1999, 4.25 in 2001 and 4.91 in 2003. Third, coders' knowledge and skills in using ICD-10 coding methods and ICD-10 guidelines may have improved with time (i.e., a coding 'learning curve'), contributing to an improved adherence to coding guidelines. Fourth, administrative data quality improvement initiatives in Switzerland have been implemented with the creation of a coding unit at the Swiss Federal Statistical Office. This Office has developed and disseminated national coding rules for standardizing coding methods. Fifth, coders might have become more aware of the importance of certain conditions and paid more attention to coding these conditions. For example, we found that Kappa was 0.18 in 1999 and increased to 0.52 in 2001 and 0.57 in 2003 for obesity although this condition is not considered to be important for coding unless it directly contributes to the hospital stay. Sixth, it is possible that physicians documented clinical information better so that coders could translate the clinical information into electronic codes more easily. We also found that the accuracy of six conditions decreased over the study period. The accuracy of AIDS/HIV dropped dramatically. The sensitivity decreased from 64.1% in 1999 to 16.8% in 2003. While this drop appears substantial at first glance, we suspect that random error is a major contributor to this finding, as a result of the very small sample size used for judging the accuracy of this variable. In addition, PPV was rarely equal to 100%. There are several possible explanations for false positives: incomplete medical records used in this study; the research nurses could have missed some clinical diagnosis from chart review. Indeed, chart review is not a perfect reference standard.
Since the introduction of the ICD-10 coding system, only a few accuracy studies have been conducted. Our 2003 results for sensitivity and Kappa values were similar to those obtained in 2003 in a Canadian study [19]. Both studies employed the same methodology, including study designs, data collection process and definition of study variables. In Europe, Gibson and Bridgman [28] compared the accuracy of ICD-10 primary diagnosis coding in hospital administrative data versus charts. They studied a total of 298 general surgery records from the North Staffordshire Health Authority, United Kingdom, in 1996-1997. Coding errors occurred in 8% of records at the first character level, 9% at the second character level, 24% at the third character level, and 29% at the fourth character level. A recent study in Australia [29] demonstrated that the validity of ICD-10 administrative data was high in 2000 and 2001, two years after the introduction of the ICD-10 in that country, with sensitivities ranging from 0.58 to 0.97. In the future, more and more countries will be using ICD-10. The USA have plans to introduce ICD-10-CM in 2013 [30]. Use of administrative data and their validation will become increasingly important internationally in the future. The World Health Organisation (WHO) is continuously working on ICD-10 revisions and the production of ICD-11 is planned for 2015 [31]. Swiss administrative data have previously been studied, especially since the introduction of ICD-10 [17] in 1998. However, to date, only quality of coding reliability studies have been performed in the country [31,32], that have shown that about two thirds of the primary diagnosis codes had all five characters correctly coded in 1998 already [33] and that major improvements in the quality of coding occurred between 1998 and 2003 [34]. Administrative data are commonly used for trend analyses of conditions and quality of care assessment over time [19,28,29]. Such studies should be interpreted with caution. Indeed, we have observed in our study that the accuracy of several conditions improved or decreased over time. Such variations in accuracy over time could result in better or worse identification by time period. For example, the prevalence of AIDS/HIV decreased as a co-morbidity, apparently, from 0.6% in 1999 to 0.2% in 2003, when based on administrative data. However, the Figure 1 was stable, at 1.0% in 1999 and 0.9% in 2003, when based on chart review data. The decrease of the positive predictive value for AIDS/HIV from 100% in 1999 to 66.7% in 2003 could mislead quality of care studies focusing on this comorbidity, as might occur if one wanted to study anti-HIV virus drug utilization, for example. In such studies, the utilization was likely to decrease over the years due to higher levels of misclassification of non-AIDS/HIV cases, with 33.3% in 2003 compared to 0% in 1999. Therefore, trend accuracy in the study period should, when possible, be considered in the interpretation of such studies.
Our study has several limitations. First, we used chart data as the reference standard to evaluate the validity of ICD-10 administrative data. Ideally, validity should be assessed whether the condition is truly present in a patient or not. In fact, this standard depends on the quality of medical charts. However, the extent of clinical information missing in the charts cannot be determined. Second, although our sample size was relatively large, it was limited in view of the low prevalence of most Charlson and Elixhauser co-morbidities in acute care patients. Actually, we chose to use similar sample sizes to those of our Canadian colleagues [19]. Thus, estimates of accuracy parameters for some rare conditions lacked precision and the observed changes of indicator values was significant in few cases only, but most significant changes corresponded to improvement with time. Third, the accuracy of administrative data may vary across hospitals [35], and from country to country. Therefore, generalizability of our findings to other jurisdictions is not certain and should be assessed through similar studies in other countries of coding accuracy over time. Fourth, the use of only one individual research nurse to abstract the data from each chart, and 2 individuals in total was another limitation. We examined inter-rater agreements and further trained the research nurses. Fifth, there were differences in the number of diagnosis codes recorded between the three hospitals. Moreover, the number of patients excluded from the nonteaching hospital was proportionately much larger than in the teaching hospitals. Thus, the type of hospital might influence our results. In addition, we used a convenience sample, not a representative sample of all hospitals in the country, the external validity of our results is thus limited. It is difficult to extend our specific results to other countries because coding rules are potentially different. For instance, the number of coded diagnoses varies between countries and could constitute a bias for international comparisons [36]. However, comparisons could be possible between countries using a selection of co-morbidities like Charlson co-morbidities [37]. Nevertheless, beyond the specific figure, the phenomenon is worth noting and is of relevance to all jurisdictions introducing new coding systems. Improved administrative data accuracy may relate to a coding 'learning curve' with the new coding system. In addition, as coding rules are being adapted constantly, e.g., in relation with DRG reimbursement schemes, one cannot take for granted that the measure of cormibidty is stable over the years.
However, our study has some strength. This is, to our knowledge, the first study assessing the evolution of accuracy of co-morbidity information derived from administrative data, measured at three time points over a five-year period shortly after the introduction of ICD-10.
In an attempt to represent national hospital discharge data, we included both teaching and non-teaching hospitals because the validity of administrative data can vary by type of hospital [34]. Possible further investigations could include using the patient as the unit of analysis instead of each co-morbiditiy independently. This will constitute a different approach to the validity issue of administrative data.

Conclusions
Our study demonstrates that the accuracy and reliability of co-morbidity information from ICD-10 administrative data improved slightly between 1999 and 2003. This improvement may be related to higher adherence to coding standards and systematic use of professional coders in Swiss hospitals over these years. This finding indicates that other countries should consider similar data accuracy assessments over time as new coding systems are introduced.