Claims-based algorithms for identifying Medicare beneficiaries at high estimated risk for coronary heart disease events: a cross-sectional study

Background Databases of medical claims can be valuable resources for cardiovascular research, such as comparative effectiveness and pharmacovigilance studies of cardiovascular medications. However, claims data do not include all of the factors used for risk stratification in clinical care. We sought to develop claims-based algorithms to identify individuals at high estimated risk for coronary heart disease (CHD) events, and to identify uncontrolled low-density lipoprotein (LDL) cholesterol among statin users at high risk for CHD events. Methods We conducted a cross-sectional analysis of 6,615 participants ≥66 years old using data from the REasons for Geographic And Racial Differences in Stroke (REGARDS) study baseline visit in 2003–2007 linked to Medicare claims data. Using REGARDS data we defined high risk for CHD events as having a history of CHD, at least 1 risk equivalent, or Framingham CHD risk score >20%. Among statin users at high risk for CHD events we defined uncontrolled LDL cholesterol as LDL cholesterol ≥100 mg/dL. Using Medicare claims-based variables for diagnoses, procedures, and healthcare utilization, we developed algorithms for high CHD event risk and uncontrolled LDL cholesterol. Results REGARDS data indicated that 49% of participants were at high risk for CHD events. A claims-based algorithm identified high risk for CHD events with a positive predictive value of 87% (95% CI: 85%, 88%), sensitivity of 69% (95% CI: 67%, 70%), and specificity of 90% (95% CI: 89%, 91%). Among statin users at high risk for CHD events, 30% had LDL cholesterol ≥100 mg/dL. A claims-based algorithm identified LDL cholesterol ≥100 mg/dL with a positive predictive value of 43% (95% CI: 38%, 49%), sensitivity of 19% (95% CI: 15%, 22%), and specificity of 89% (95% CI: 86%, 90%). Conclusions Although the sensitivity was low, the high positive predictive value of our algorithm for high risk for CHD events supports the use of claims to identify Medicare beneficiaries at high risk for CHD events.


Background
National Cholesterol Education Program Adult Treatment Panel III (ATP III) guidelines and the newly released 2013 American College of Cardiology/American Heart Association Guideline on the Treatment of Blood Cholesterol to Reduce Atherosclerotic Cardiovascular Risk in Adults recommend lipid treatment according to estimated risk for future coronary heart disease (CHD) events such as nonfatal myocardial infarction or CHD death [1][2][3]. Pharmacotherapy initiation is guided by low-density lipoprotein (LDL) cholesterol levels and risk for future CHD or other atherosclerotic events. Risk is assessed by history of CHD, risk equivalents such as stroke and diabetes, risk factors such as hypertension and current smoking, and predicted risk calculated from risk equations including CHD risk factors [1][2][3]. Despite these guidelines, many people eligible for lipid lowering therapy are untreated or undertreated [4,5].
Novel LDL cholesterol lowering medications are being evaluated in clinical trials [6,7]. If these medications obtain regulatory approval, healthcare claims data could be used for comparative effectiveness and pharmacovigilance studies [8]. Identifying people using medications is feasible with pharmacy claims. Finding a comparison group with a similar CHD risk profile is more challenging. One potential barrier is that claims data do not include clinical or laboratory values that are often used to estimate CHD event risk. Some data show that claims-based algorithms can be useful in identifying high risk groups, for example people at high risk for osteoporotic fracture [9]. However, whether claimsbased algorithms can be used to identify individuals at high risk for CHD events is not known.
In addition to identifying high risk groups, it would be of interest in some studies to identify people who may warrant more intensive treatment based on laboratory tests; one such group is individuals with elevated LDL cholesterol levels despite statin treatment. So far there is not much evidence that healthcare claims can be used effectively to estimate laboratory values [10]. Advances in this area could increase the value of healthcare claims for comparative effectiveness and pharmacovigilance research.
In this paper we describe the development of claimsbased algorithms to identify individuals at high risk for CHD events according to ATP III guidelines, using Medicare claims data on diagnoses, procedures, and healthcare utilization. We expanded upon existing claims-based definitions of specific cardiovascular conditions and procedures [11][12][13][14][15][16][17][18][19] by bringing them into a broader framework with the goal of identifying the more general concept of "high risk." We also describe claims-based algorithms to identify uncontrolled LDL cholesterol according to ATP III guidelines among statin users at high risk for CHD events. In evaluating these algorithms we considered positive predictive value as the most important measure of model performance because our aim was to identify highrisk groups and exclude lower-risk individuals.

Design, setting, and participants
We conducted a cross-sectional analysis of baseline data from REasons for Geographic And Racial Differences in Stroke (REGARDS) study participants linked to Medicare data. REGARDS is a population-based cohort of 30,239 adults ≥45 years of age enrolled in 2003-2007 in the continental United States [20]. Linkage of REGARDS data with Medicare enrollment data was based on social security number, which is a unique identifier that was required to match exactly on all digits; sex, which was required to match; and birth date, which was required to match on year and month, year and day, or month and day allowing a difference of one year. We included 6,615 REGARDS participants who provided study data free of anomalies such as missing all baseline data collection forms, were ≥66 years of age at their REGARDS inhome visit, linked to Medicare, provided a fasting blood sample, had complete REGARDS data for calculating CHD event risk according to ATP III guidelines (described below), and had been living in the United States, continuously enrolled in Medicare parts A and B, but not in a Medicare Advantage plan, for at least one year immediately prior to their REGARDS in-home visit ( Figure 1). This research was conducted in accordance with the Declaration of Helsinki. Institutional review boards of the collaborating institutions (University of Alabama at Birmingham, University of Vermont, Wake Forest University, and University of Cincinnati) approved the REGARDS protocol. Participants gave informed consent. The REGARDS-Medicare linkage was approved by the institutional review board at the University of Alabama at Birmingham.

REGARDS variables
REGARDS baseline data collection included a structured telephone interview and an in-home visit. The telephone interview included questions about demographics and medical history. The in-home visit included two blood pressure measurements, an electrocardiogram, blood sample collection, and a medication inventory to assess current medication use including statins (see Text, Additional file 1).
Using REGARDS data we categorized each participant's CHD event risk according to the ATP III guidelines 2004 update [1,2]. High risk for CHD events was defined as (1) having a history of CHD, including myocardial infarction (MI) or coronary revascularization; (2) having a history of least one of the following risk equivalents: peripheral arterial disease, abdominal aortic aneurysm, carotid artery disease, stroke, or diabetes; or (3) in the absence of CHD or risk equivalents, having ≥2 CHD risk factors with Framingham 10-year CHD risk score >20%. CHD risk factors included age ≥45 years (men) or ≥ 55 years (women), family history of premature MI, current smoking, hypertension, and high-density lipoprotein (HDL) cholesterol <40 mg/dL. HDL cholesterol ≥60 mg/dL reduced the risk factor count by one. Framingham 10-year CHD risk score >20% was defined using age, sex, total cholesterol, HDL cholesterol, current smoking, and systolic blood pressure [1]. Very high risk for CHD events was defined as having CHD and at least one of the following: acute MI in the prior year, diabetes, current smoking, or metabolic syndrome.
Using the above definitions and REGARDS study data, we defined the presence of two high risk conditions among REGARDS-Medicare linked participants (N = 6,615): Condition 1: High risk for CHD events Condition 2: Very high risk for CHD events We defined a third high risk condition among REGARDS-Medicare linked participants who did not have a history of CHD or risk equivalents (N = 3,720): Condition 3: Framingham 10-year CHD risk score >20% Also using REGARDS study data, we defined the presence of uncontrolled LDL cholesterol, using two cutpoints, among REGARDS-Medicare linked participants at high risk for CHD events who were using statins according to the REGARDS in-home visit medication inventory (N = 1,583): Several pre-specified variables were based on published claims-based definitions (see Text, Additional file 1) [11][12][13][14][15][16][17][18][19]. The same 25 pre-specified variables were used in algorithms for all conditions of interest. Second, through a data mining procedure we identified additional Medicare variables, including diagnosis and procedure codes. The data mining procedure was adapted from a previously described algorithm for high-dimensional propensity scores (for detail see Text, Additional file 1) [21]. The procedure had four steps: (1) identify diagnosis and procedure codes appearing in REGARDS participants' linked Medicare claims data, (2) calculate the prevalence of each code, (3) calculate the odds ratio of each code with high risk as defined using REGARDS data, and (4) rank the codes as a function of their prevalence and their odds ratio with high risk, and select the highest-ranked codes, excluding collinear variables. The data mining variables differed for each condition. We did not use Medicare Part D pharmacy claims data because few  participants had Part D coverage prior to the REGARDS in-home visit.

Statistical analysis
We used logistic regression and Medicare claims-based variables to develop algorithms identifying each of three high risk conditions and uncontrolled LDL cholesterol defined in REGARDS data. The analysis is described below for Condition 1, high risk for CHD events. An identical approach was used for the other conditions. We calculated beta coefficients and standard errors for high risk associated with each pre-specified Medicare variable from a multivariable logistic regression model. We included interaction terms for sex with other variables if they had a P value <0.1. We calculated the predicted probability of being at high risk for each participant from the beta coefficients, and plotted distributions of predicted probabilities for people at high risk and for people not at high risk according to REGARDS data. We calculated sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) across the range of predicted probability thresholds from 0 to 1, and we calculated a c-statistic. We calculated 95% confidence intervals for the model performance characteristics by bootstrapping. To report model performance characteristics in tables we chose the predicted probability threshold that resulted in 90% specificity before correcting for optimism (see below). We built a second model adding variables from the data mining procedure. We corrected model performance characteristics for optimism using bootstrap resampling, which has been recommended as a better method for internal validation than a split-sample approach [22]. We cross-classified participants by model-predicted and observed high risk status to compare characteristics of true positives, false positives, false negatives, and true negatives. We conducted four sensitivity analyses. First, for identifying high risk, we assigned a predicted probability of 1 for each participant who had evidence in their claims of a history of CHD, peripheral arterial disease, abdominal aortic aneurysm, carotid artery disease, stroke, or diabetes; and assigned a predicted probability of 0 otherwise. Second, for identifying high risk, we assigned a predicted probability of 1 as in the first sensitivity analysis; and assigned a predicted probability based on a logistic regression model otherwise. Third, for identifying very high risk, we assigned a predicted probability of 1 for each participant who had evidence in their claims of a history of CHD and either acute MI in the prior year or diabetes; and assigned a predicted probability based on a logistic regression model otherwise. Fourth, for all five conditions, we used claims data for only one year prior to the REGARDS in-home visit to define pre-specified and data mining Medicare variables, instead of using all available claims.
We used SAS software version 9.3 (Cary, NC) for all statistical analyses.

Results
High risk for CHD events Among 6,615 REGARDS participants, 49% were at high risk for CHD events based on REGARDS data. High risk participants were more likely to be men, have less education, be using statins, have lower cholesterol, and higher levels of other cardiovascular risk factors including hypertension and metabolic syndrome, compared with participants not at high risk (Table 1). Predicted probabilities using pre-specified variables tended to be higher for participants at high risk and lower for participants not at high risk ( Figure 2, Panel A). In the model that included pre-specified variables, a predicted probability threshold of 0.55 yielded a PPV of 87% (95% CI: 85%, 88%) for identifying high risk for CHD, and a sensitivity of 69% (95% CI: 67%, 70%); results were similar after adding data mining variables (Table 2 and see Additional file 1: Figure S1, Panel A). High risk participants not identified by the algorithm (false negatives) were less likely to be men, be using statins, have metabolic syndrome, diabetes, and CHD, and had higher LDL cholesterol, higher blood pressure, and lower blood glucose, compared with high risk participants correctly identified by the algorithm (true positives) ( Table 3). Nonhigh risk participants identified as high risk by the algorithm (false positives) were less likely to be using statins, to have metabolic syndrome, and had higher total and LDL cholesterol but lower triglycerides, compared with true positives.

Very high risk for CHD events
Among 6,615 participants, 14% were at very high risk for CHD events based on REGARDS data. Predicted probabilities were uniformly distributed for participants at very high risk and tended to be low for participants not at very high risk (Figure 2, Panel B). In the model using pre-specified variables, a predicted probability threshold of 0.28 yielded a PPV of 52% (95% CI: 49%, 54%) for identifying very high risk for CHD events and a sensitivity of 63% (95% CI: 59%, 66%); results were similar after adding data mining variables (Table 2 and see Additional file 1: Figure S1, Panel B). Participant characteristics by observed and model-predicted very high risk status are in Additional file 1: Table S1.
Framingham CHD risk score >20% Among 3,720 participants who did not have a history of CHD or risk equivalents according to REGARDS data, 9% had Framingham 10-year CHD risk scores >20%.
Predicted probabilities were uniformly distributed for participants with risk scores >20% and tended to be low for participants with risk scores ≤20% (Figure 2, Panel C). In the model using pre-specified variables, a predicted probability threshold of 0.20 yielded a PPV of 31% (95% CI: 27%, 36%) for identifying Framingham CHD risk score >20% and a sensitivity of 47% (95% CI: 43%, 54%); results were similar after adding data mining variables (Table 2 and see Additional file 1: Figure S1, Panel C). Participant characteristics by observed and modelpredicted Framingham CHD risk score >20% status are in Additional file 1: Table S2.
Uncontrolled LDL cholesterol among statin users at high risk for CHD events Among 1,583 participants at high risk for CHD events who were using statins according to the REGARDS medication inventory, 30% had LDL cholesterol ≥100 mg/dL, and 80% had LDL cholesterol ≥70 mg/dL. The predicted probability distributions overlapped for participants with and without uncontrolled LDL cholesterol ( Figure 2). In the model using pre-specified variables, a predicted probability threshold of 0.43 yielded a PPV of 43% (95% CI: 38%, 49%) for identifying LDL cholesterol ≥100 and a sensitivity of 19% (95% CI: 15%, 22%) ( Table 2 and see Additional file 1: Figure S2, Panel A). For identifying LDL cholesterol ≥70 the PPV was 86% (95% CI: 83%, 89%) in the model using pre-specified variables, and increased to 91% (95% CI: 85%, 91%) when adding data mining variables (Table 2 and Additional file 1: Figure S2, Panel B). Participant characteristics by observed and model-predicted uncontrolled LDL cholesterol status are in Additional file 1: Tables S3 and S4.

Model parameters
Beta coefficients and standard errors for models using pre-specified Medicare variables are in Additional file 1: Table S5.

Sensitivity analyses
When we assigned a predicted probability of 1 for participants who met a pre-specified claims-based definition of high risk for CHD events, and a predicted probability of 0 otherwise, the PPV was 80% (95% CI: 79%, 81%) and sensitivity was 75% (95% CI: 73%, 77%). Model characteristics were similar when we assigned a predicted probability of 1 for participants who met a pre-specified claims-based definition of high risk for CHD events, and assigned model-based predicted probabilities otherwise. When we assigned a predicted probability of 1 for participants who met a pre-specified claims-based definition of very high risk for CHD events, and assigned model-based predicted probabilities otherwise, the PPV was 51% (95% CI: 48%, 54%) and sensitivity was 63% (95% CI: 58%, 66%) (see Additional file 1: Table S6, and Additional file 1: Figure S3).
When we used only one year of Medicare claims data prior to the REGARDS in-home visit, the sensitivity decreased for identifying high risk for CHD events, very high risk for CHD events, and Framingham 10-year Medicare data and met all eligibility criteria for this analysis (see Figure 1). †Numbers are column percentages or means (standard deviations). Income was missing for 914 participants, education for 4, body mass index for 16, and C-reactive protein for 164. ‡By definition, participants not at high risk for CHD events did not have a history of CHD or risk equivalents according to REGARDS study data.
CHD risk score >20% (see Additional file 1: Table S7). Results for uncontrolled LDL cholesterol among statin users at high risk for CHD events were similar to the main results when we used only one year of Medicare claims data (see Additional file 1: Table S8).

Discussion
In this population of REGARDS study participants, an algorithm using 25 pre-specified Medicare claims variables had a PPV of 87% for identifying people at high risk for CHD events. Additional claims variables identified through data mining did not substantially improve the algorithm performance. The high PPV of our algorithm supports the use of claims to identify Medicare beneficiaries at high risk for CHD events. Our algorithm could be applied in comparative effectiveness or pharmacovigilance studies of the outcomes of cardiovascular medication use among Medicare beneficiaries. For example, if novel LDL cholesterol lowering drugs currently in development [6,7] come to market, Medicare may be a setting in which to evaluate the effectiveness and safety of these drugs. Comparison groups in such studies should have comparable proportions of people at high risk for CHD events, and our algorithm could be used to identify appropriate comparison cohorts. However, the algorithm had a sensitivity of 69% and missed 31% of participants at high risk for CHD events. Along with this low sensitivity, the group identified was not representative of all participants at high risk for CHD events. People at high risk for CHD events whom our algorithm missed were less likely to have CHD and diabetes and to be using statins, and had higher lipid levels and blood pressure compared with people at high risk for CHD events whom the algorithm correctly identified. This pattern is consistent with claims data being more sensitive for identifying people with diagnosed conditions than for identifying people with abnormal laboratory values. Therefore, future use of this algorithm to identify A B C In each panel the solid curve represents the distribution of predicted probabilities of having the condition among those observed to have the condition in REGARDS, the dashed curve represents the distribution of predicted probabilities of having the condition among those not observed to have the condition in REGARDS, and the vertical dotted line represents the predicted probability threshold corresponding to 90% specificity for identifying the condition. All results are from models using pre-specified Medicare variables only. Results from models using pre-specified variables plus data mining variables were similar (data not shown).  Corrected for optimism using bootstrap resampling [22].
people at high risk for CHD events should be accompanied by careful consideration of generalizability.
Our algorithms did not perform as well for identifying those at very high risk for CHD events and those without CHD or risk equivalents but with a Framingham 10year CHD risk score >20%. Very high risk for CHD events and Framingham 10-year CHD risk score >20% had low prevalence. Because PPV depends on the prevalence of the condition, it is not surprising that PPVs for identifying these subgroups were low. A prior study has reported claims data on diagnoses, procedures, and healthcare utilization may not be good proxies for clinical and laboratory values [10]. The current analysis extends this prior finding and indicates claims data have limited usefulness for identifying individuals with a Framingham 10-year CHD risk score >20%.
Among participants at high risk for CHD events who were using statins, algorithms identified those with LDL ‡The Centers for Medicare and Medicaid Services (CMS) requires the figure be redacted because the cell contained fewer than 11 participants, or would allow a number fewer than 11 participants to be deduced in another cell.
cholesterol ≥100 mg/dL with a PPV of 43% using prespecified variables. As expected, the PPV for identifying LDL cholesterol ≥70 mg/dL was higher (86%), due to higher prevalence of the condition. However, true positives, false positives, and false negatives differed on several characteristics. The gain in PPV by using a claims-based algorithm to identify individuals with LDL cholesterol ≥70 mg/dL may not outweigh the potential loss in representativeness as compared with all statin users at high risk for CHD events.
In comparison with using all available Medicare claims prior to the REGARDS baseline study visit for each participant, limiting Medicare claims to the one year period prior to REGARDS baseline decreased the sensitivity for identifying participants at high estimated risk for CHD events. Evidently some Medicare beneficiaries with a history of cardiovascular conditions or risk factors do not have sufficient evidence of those conditions or risk factors in recent diagnosis and procedure codes. Therefore, studies in which participants' Medicare history is limited to a certain time period may tend toward underestimating the prevalence of high risk for CHD events.
To our knowledge, there are no prior reports of claims-based algorithms for identifying high risk for CHD events according to the ATP III definition. Published claims-based definitions of several cardiovascular conditions and procedures are available, and we incorporated these into our algorithms as pre-specified variables [11][12][13][14][15][16][17][18][19]. In this study we attempted to go beyond specific disease diagnoses to identify a more broadly-defined group at high risk for CHD events according to ATP III guidelines. As in the ATP III guidelines, the newly published cholesterol treatment guidelines recommend that treatment decisions be guided by risk for future events as determined by medical history and estimated risk based on measured risk factors [3]. Claims-based algorithms to approximate clinical risk stratification may be useful in future studies of outcomes of treatment with novel LDL cholesterol lowering medications, in which comparison groups would need to be identified.
Schneeweiss and colleagues found that the ability of claims data to predict LDL cholesterol values was poor. They concluded that in settings where LDL cholesterol is a potential confounder, estimating missing LDL cholesterol values using claims may not substantially improve confounding control [10]. Similarly, we found that our algorithms did not identify representative groups of statin users at high risk for CHD events who had uncontrolled LDL cholesterol. The new guidelines have less focus on LDL treatment targets, but retain recommendations to monitor LDL treatment response [3]. This suggests that identifying comparison groups for pharmacovigilance studies that are comparable in their LDL cholesterol status and other characteristics may be difficult to accomplish using only claims data. It is also important in pharmacovigilance studies to identify comparison groups that would be similar in their risk for non-CHD adverse events. Further work is required in this area.
This study has limitations. First, our classifications of high risk groups were based partly on self-reported medical history in REGARDS. Participants over-reporting or under-reporting medical conditions may have resulted in misclassification of true high risk status, diluting the ability of algorithms to correctly identify target groups. Second, we did not incorporate Medicare Part D pharmacy claims into our algorithms. Part D was implemented in 2006, with high penetration in the Medicare population by 2007. However, most REGARDS in-home visits occurred before 2007, and few REGARDS participants had Part D coverage prior to their REGARDS in-home visit. Including Part D pharmacy claims may improve the ability to identify groups at high cardiovascular risk or with uncontrolled LDL cholesterol. Third, the sample of REGARDS participants linked to Medicare may not be representative of the overall Medicare population enrolled in Parts A and B, where we would hope to apply these algorithms. Also, Medicare Advantage plan enrollees were not included in our study, so our algorithms may not be generalizable to that subset of Medicare beneficiaries. Fourth, use of claims data for identifying health-related variables is limited by potential inaccuracies in the claims. For example, administrative coding of diagnoses and procedures may be affected by reimbursement incentives, random or systematic coding errors, or mismatches in diagnostic resolution of available codes versus diagnostic resolution in clinical practice; these potential biases may also fluctuate over time. This could be a particular problem for documenting behavioral risk factors like smoking. To maximize the accuracy of our algorithms we used previously validated claims-based definitions of diagnoses and procedures when such were available. Fifth, we used an area-level income variable from Census data as a pre-specified variable in our algorithms. Incorporating individual-level income data for Medicare beneficiaries could strengthen the algorithms.

Conclusions
In summary, we have demonstrated that claims-based algorithms can be used to identify Medicare beneficiaries at high risk for CHD events. Despite not having clinical or laboratory data, Medicare claims have potential as a data source for pharmacovigilance studies when groups at high risk for CHD events need to be identified. Improving algorithms for identifying subsets of high risk or groups with uncontrolled LDL cholesterol will require further work. Representativeness and generalizability will need to be considered in interpreting the findings of future studies conducted in cohorts identified by these algorithms.