Comprehensive review of ICD-9 code accuracies to measure multimorbidity in administrative data

Background Quantifying the burden of multimorbidity for healthcare research using administrative data has been constrained. Existing measures incompletely capture chronic conditions of relevance and are narrowly focused on risk-adjustment for mortality, healthcare cost or utilization. Moreover, the measures have not undergone a rigorous review for how accurately the components, specifically the International Classification of Diseases, Ninth Revision (ICD-9) codes, represent the chronic conditions that comprise the measures. We performed a comprehensive, structured literature review of research studies on the accuracy of ICD-9 codes validated using external sources across an inventory of 81 chronic conditions. The conditions as a weighted measure set have previously been demonstrated to impact not only mortality but also physical and mental health-related quality of life. Methods For each of 81 conditions we performed a structured literature search with the goal to identify 1) studies that externally validate ICD-9 codes mapped to each chronic condition against an external source of data, and 2) the accuracy of ICD-9 codes reported in the identified validation studies. The primary measure of accuracy was the positive predictive value (PPV). We also reported negative predictive value (NPV), sensitivity, specificity, and kappa statistics when available. We searched PubMed and Google Scholar for studies published before June 2019. Results We identified studies with validation statistics of ICD-9 codes for 51 (64%) of 81 conditions. Most of the studies (47/51 or 92%) used medical chart review as the external reference standard. Of the validated using medical chart review, the median (range) of mean PPVs was 85% (39–100%) and NPVs was 91% (41–100%). Most conditions had at least one validation study reporting PPV ≥70%. Conclusions To help facilitate the use of patient-centered measures of multimorbidity in administrative data, this review provides the accuracy of ICD-9 codes for chronic conditions that impact a universally valued patient-centered outcome: health-related quality of life. These findings will assist health services studies that measure chronic disease burden and risk-adjust for comorbidity and multimorbidity using patient-centered outcomes in administrative data.


Background
Health system tools for quantifying the burden of multimorbidity (multiple coexisting chronic conditions) have frequently been limited to mortality-based measures such as the Charlson-Deyo comorbidity index and Elixhauser comorbidity score [1][2][3][4]. These measures select from a narrow inventory of conditions based on inpatient diagnoses that may not be suitable for capturing the full breadth and depth of multimorbidity across the total healthcare system, including community-dwelling adults receiving ambulatory care. Further, these measures have not undergone a rigorous review for how accurately their components, specifically the International Classification of Diseases, Ninth Revision (ICD-9) codes, represent the clinical presence or absence of a chronic condition.
Modern administrative data-based measures of multimorbidity are needed, along with a firm understanding of the accuracy of the ICD-9 code components used in measures. Among the most comprehensive of multimorbidity measures is an index consisting of 81 chronic conditions assessed longitudinally with repeated measures of self-reported chronic conditions among community-dwelling adults [5]. This multimorbidity-weighted index was developed from three large cohorts of community-dwelling adults with repeated measures of highly reliable self-reported physician-diagnosed chronic conditions and physical health-related quality of life (HRQOL). The index was externally validated in the nationally-representative Health and Retirement Study, with additional physical and cognitive functioning outcomes assessed to demonstrate construct validity [6]. Finally, the conditions were mapped to ICD-9 codes and validated in HRS-Medicare data to facilitate use in claims data [7]. These 81 conditions also predict mortality, HRQOL, social participation, and disability [6,[8][9][10][11][12]. Compared with existing indices, this measure captures a wider breadth of chronic conditions that are prevalent among patients with multimorbidity and spans the widest distribution of multimorbidity at both the low and high extremes of disease burden [6,[8][9][10][11].
We aimed to perform a comprehensive array of literature searches to quantify the accuracy of ICD-9 codes used to identify the presence or absence of chronic conditions that impact HRQOL for use in measuring multimorbidity. For each condition, we reviewed the literature to identify: 1) studies that externally validate ICD-9 codes for each condition against an external source of clinical data such as patient interview or medical chart review, and 2) the range of accuracies reported in these validation studies.

Candidate condition selection
We examined 81 chronic conditions in the multimorbidityweighted index that predicts HRQOL [5] and created a comprehensive mapping of ICD-9 codes for each of these conditions [7] (Additional File, Table 1).

Study selection
Our primary goal was to conduct a literature search to identify studies that provided validation statistics, i.e., positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, or kappa, for ICD-9 codes corresponding to chronic conditions that impact HRQOL. We used PubMed as the primary database, using search terms "ICD-9" and "algorithm" or "validation" combined with each condition (e.g., "ICD-9" AND "atrial fibrillation" AND "validation" OR "algorithm") through June 2019 (Fig. 1). If no articles were found in PubMed, we searched Google Scholar using the same search terms.
For remaining conditions whose searches failed to identify validation studies, our last tier search (using only "ICD-9" and the condition in PubMed) was to identify studies that used algorithms of ICD-9 codes to identify the chronic condition in administrative data. If the studies used the codes to find associations between the condition with clinical outcomes, we considered that study as evidence supporting construct validity for that ICD-9 algorithm. We reviewed article titles and abstracts to determine if ICD-9 codes for each respective condition might be included in the article, and if applicable, reviewed the full text of the article.

Inclusion criteria
We had two inclusion criteria for included studies. The first was for the validation of ICD-9 codes (test of interest), and the second was for the reference standard used for the validation (gold standard of interest) (Additional file, Table 2).

Test of interest
Although some conditions can be further confirmed using other types of administrative billing data (e.g, Current Procedural Technology codes or pharmacy fill data), these sources of data beyond ICD-9 diagnostic codes are not relevant for all conditions and therefore beyond the scope of these review. Therefore, if all algorithms in a study required information beyond ICD-9 codes, we excluded that study.

Gold standard of interest
We identified studies that used medical chart review (i.e., physician review of nursing, physician, and consultation notes; admission and discharge reports; laboratory and diagnostic test reports; surgical reports; and other clinical and administrative documentation) as the external reference standard because it is the most thorough assessment of several sources of information to confirm the diagnosis of a chronic condition and its associated ICD-9 code in a person. If validation with chart review was unavailable for a given condition, we used those validated by other standards. In order of priority, the secondary reference standards were: 1) self-report, 2) disease registry, and 3) disease screening. For a few conditions we found systematic reviews of validation studies. To include these systematic reviews, we incorporated the aggregate results reported by the study, such as means and ranges.

Validation statistics
We reported the median and range of the PPV, NPV, sensitivity, specificity, and kappa for each condition for validation studies that provided these metrics. Condition-specific medians were computed by calculating the median of each respective validation statistic of all studies found for a specific condition. The range was computed by reporting the minimum and maximum values of each respective validation statistic of all studies found for a specific condition. The PPV was computed as the probability of being a case (true disease using medical chart review) among those who had a positive screening test (based on ICD-9 codes of interest). The NPV was computed as the probability of not being a case among those with a negative screening test. We plotted the range of PPVs and sensitivities for each condition in three separate graph panels: rare conditions (< 5% prevalence), common conditions (5-20% prevalence), and highly prevalent conditions (> 20% prevalence) (Figs. 2, 3). We used 70% to denote a moderately accurate PPV or sensitivity, for the purposes of displaying the PPVs in Fig. 2 [13,14] and the sensitivities in Fig. 3.
Prevalence was determined using any ICD-9 codes available for each condition [7] from 8933 Health and Retirement Study participants who provided access to their Medicare outpatient, inpatient, or skilled nursing facility claims in 2014. ICD-9 codes for the 10 most prevalent conditions are summarized in Table 1.

Results
We considered 81 conditions mapped to ICD-9 codes (Additional File, Table 1). We combined two conditions (basal cell and squamous cell carcinoma) into one group during the literature review once we found no publications validating codes for these conditions separately. This resulted in a final validation for 80 total conditions. After mapping all conditions to Fig. 1 Structured Literature Review Flow Diagram. *We prioritize the articles with the following: 1) chart abstraction as gold standard; 2) selfreport, disease registry, or disease screening as gold standard; 3) systematic review. We excluded articles with an algorithm including other criteria than just ICD-9 codes, such as ICD-10 codes, ICD-8 codes, Current Procedural Technology (CPT) codes, lab results, or medications, and those validating ICD-9 codes for multiple conditions in a comorbidity index (e.g., Charlson comorbidity index) corresponding ICD-9 codes, we found articles providing validation statistics for codes for 51 (64%) of the 80 conditions. We also found articles with validating through construct validity for 27 (34%) of the 80 conditions. We did not find articles validating codes for 2 (2.5%) of the 80 conditions. Medical chart review was the most common method of ICD-9 code validation (47 of 51 conditions with validation statistics, 92%), followed by systematic review (5 conditions, 10%), self-report of condition (1 condition, 2%), disease registry (1 condition, 2%), and diagnostic screening (1 condition, 2%), not mutually exclusive. Of the 51 conditions reporting validation statistics, the median and range for the accuracies were as follows: sensitivity 83% (3-100%, n = 142 values), specificity 97% (0-100%, n = 76), PPV 84% (0-100%, n = 175), NPV 90% (32-100%, n = 52), and kappa statistic 0.85 (0.45-0.92, n = 18) (Additional file, Table 1). The most common validation measure reported was PPV (available for 46 conditions), followed by sensitivity (available for 43 conditions).
Most ICD-9 coded conditions had moderate to high mean PPV and NPVs of at least 70% (37/46, 80%) among studies that provided PPVs, and 19/24 (79%)   among studies that provided NPVs). We observed variation in the reported accuracies, with the highest mean PPV for wrist fracture (100%) compared with the lowest mean PPV for depression (39%). The highest mean NPV was for chronic hepatitis/hepatocellular disease (100%), and the lowest was for Parkinson disease (41%). We plotted the ranges of PPVs and sensitivities for chronic conditions mapped to ICD codes from least to (1) Any claim in 12-month window; (2) Any claim in 24-month window SENS: (1) 90%; (2) 90% SPEC: (1) 95%; (2) 94% PPV: (1) 72%; (2) 69% NPV: (1) 99%; (2) 98% [31] ( The number within the bracket following the accuracy values indicates the citation number for the reference greatest disease prevalence (Figs. 2, 3). Of the 46 conditions that provided PPV as a validation metric, 42 (91%) had at least one publication reporting a PPV ≥70%. Myocardial infarction had the widest range of PPVs (9-100%). Of the 44 conditions that provided sensitivity, 33 (75%) had at least one publication reporting a sensitivity of ≥70%. Myocardial infarction had the widest range of sensitivities (6-94%). For conditions that provided codes validated through construct validity, we present the respective ICD-9 codes based on mapping these conditions from ICD code mappings with CMS fiscal year 2015 (October 1, 2014 to September 30, 2015) ICD-9-CM, a comprehensive list of ICD-9 codes and corresponding conditions. Codes were identified and checked independently for agreement by four individuals (including authors MYW, JEL, CC). To validate the accuracy of these mapped conditions without validation studies and the overall mapping of ICD-9 codes to conditions in the multimorbidity-weighted index, we examined the construct validity and conducted direct comparisons of the ICD-coded multimorbidityweighted index with traditional metrics [7].

Discussion
This literature review provides evidence to support the accuracy and validity of using ICD-9 diagnostic codes to classify the presence or absence of 81 chronic conditions to measure multimorbidity in administrative-based data. The ICD-9 codes we studied had overall moderate to high PPVs and NPVs (≥70%) based on external standards for presence versus absence of each individual condition. This may be attributed to several factors such as different population samples, coding artifact [36,37], and different methodologic approaches and comprehensiveness with mapping ICD-9 codes to conditions. The highest priority reference standard, medical chart review, was available for 47 of 81 conditions. This study provides researchers with a tool to code many existing indices, as well as one using all 81 conditions that has previously been demonstrated to predict HRQOL [10]. We also feature an innovative approach to the challenging task of synthesizing the results from 81 separate literature reviews by presenting results graphically by prevalence, for added context and face validity. To our knowledge, a comprehensive review for this scope of chronic conditions has not been performed previously. In addition, prior reviews [38] focus on largely inpatient conditions that predict mortality, but this review captures chronic conditions prevalent among community-dwelling adults with multimorbidity that impact physical and mental HRQOL.
Administrative and claims-based studies using ICD-9 codes have traditionally relied upon a few commonly used measures for comorbidity adjustment that weight conditions to mortality risk, healthcare cost and utilization. Measures such as the Charlson-Deyo and Elixhauser comorbidity measures [1][2][3] have been readily available in datasets, facilitating and perpetuating their use. However, a possible unintended consequence of convenience has been overextending their intended application such as risk-adjustment for other outcomes such as HRQOL.

Strengths and limitations
Although existing comorbidity indices are available for use in administrative data and have frequently been extrapolated to measure multimorbidity, this research provides a practical method to operationalize a modern measure of multimorbidity that predicts a universal health outcome, physical functioning. Previous comorbidity indices have been limited in the scope of included conditions and are calibrated to outcomes of limited relevance to disease survivors, such as inpatient mortality risk, healthcare cost and utilization [2-4, 39, 40]. This inventory of 81 chronic conditions is one of the most comprehensive multimorbidity indices and is a validated, patient-centered measure of multimorbidity that assigns disease severity based on the impact conditions have on physical functioning, an outcome of particular relevance for older adults. Additional strengths over existing indices include a broad distribution, greater precision in quantifying multimorbidity, and rigorous validation for predicting several downstream consequences of multimorbidity [6][7][8][9][10]. As patient-centered health outcomes persist in importance and relevance for research and policy, this review offers an additional tool to measure and target outcomes for quality improvement. For example, the ICD-9 codes presented in this study could be used as a proxy for physical functioning, which is absent in administrative data.
Our study has limitations. First, we focused only on chronic condition diagnostic codes to define test variables. However, the accuracy of some conditions such as diabetes might be further improved by adding laboratory values available in an electronic health record. Other administrative data, such as procedure codes and durable medical equipment, could also potentially augment the ICD codes, but were beyond the scope of this review. Second, our literature search did not include articles published in languages other than English or articles not accessible through PubMed or Google Scholar. Google Scholar can access 87% of all scholarly documents online, including journal and conference papers, dissertations, books, technical reports, and working papers and identified more documents than PubMed, Web of Science, and Microsoft Academic Search [41]. Third, these validated ICD-9 codes apply to retrospective quality improvement efforts prior to 2016 using existing data available through the Centers for Medicare and Medicaid Services. Application to future prospective research requires a cross-walk of the conditions to ICD-10 codes and validation, which are underway. Finally, the overall value of any measure of multimorbidity in administrative data will depend on the completeness and accuracy of documentation by providers, which is a limitation inherent to all claims data. For example, financial incentives have been suspected as the cause of biased coding practices over time [37].

Implications for further research
Our findings demonstrate that there is variation in the quality and accuracy of ICD code mappings for chronic conditions, including some conditions that lack external validity. Future studies are needed to validate the accuracy of ICD codes for conditions that did not provide these measures of accuracy design, analysis, interpretation of results, and applications to clinical care and health policy. To increase the accuracy of diagnoses with ICD codes, one future direction for researchers would be to include data beyond ICD codes such as medications, labs, and imaging studies when consistently available for specific chronic conditions.

Conclusion
Modern measures of multimorbidity that weight disease severity by patient-centered outcomes have emerged and are available for use in administrative studies. We provide a comprehensive inventory of diagnostic codes mapped to chronic conditions that impact HRQOL. This research demonstrates moderate to high accuracies and validation for most but not all diagnostic codes. Based on our comprehensive literature reviews, researchers can apply with a fuller understanding of the validity of diagnostic codes for specific chronic diseases to identify populations with multimorbidity or risk-adjust for comorbidity using a comprehensive and patient-centered measure.
Additional file 1 : Additional File Table 1. Multimorbidity-Weighted Index Conditions and the Accuracy, Source, and Type of Validation for Respective ICD-9 Codes. Additional File Table 2. Two by Two Table for the Association Between ICD Codes (Test) to Indicate a Chronic Condition and an External Reference Standard (Gold Standard) to Verify a Chronic Condition.