Identifying and understanding determinants of high healthcare costs for breast cancer: a quantile regression machine learning approach

Background To identify and rank the importance of key determinants of high medical expenses among breast cancer patients and to understand the underlying effects of these determinants. Methods The Oncology Care Model (OCM) developed by the Center for Medicare & Medicaid Innovation were used. The OCM data provided to Mount Sinai on 2938 breast-cancer episodes included both baseline periods and three performance periods between Jan 1, 2012 and Jan 1, 2018. We included 11 variables representing information on treatment, demography and socio-economics status, in addition to episode expenditures. OCM data were collected from participating practices and payers. We applied a principled variable selection algorithm using a flexible tree-based machine learning technique, Quantile Regression Forests. Results We found that the use of chemotherapy drugs (versus hormonal therapy) and interval of days without chemotherapy predominantly affected medical expenses among high-cost breast cancer patients. The second-tier major determinants were comorbidities and age. Receipt of surgery or radiation, geographically adjusted relative cost and insurance type were also identified as important high-cost drivers. These factors had disproportionally larger effects upon the high-cost patients. Conclusions Data-driven machine learning methods provide insights into the underlying web of factors driving up the costs for breast cancer care management. Results from our study may help inform population health management initiatives and allow policymakers to develop tailored interventions to meet the needs of those high-cost patients and to avoid waste of scarce resource.


Background
It is well known that healthcare costs are concentrated among a small group of 'high-cost' patients [1]. Despite they receive substantial care, many have unmet critical healthcare needs and receive unnecessary and ineffective treatments [2][3][4][5]. This suggests that 'high-need, highcost' patients are a natural group to seek for healthcare quality improvement and cost reduction. In the US, providers and insurance plans have sought to develop care coordination and disease management programs to reduce hospital use and costs [6]. Research has shown that these programs are more effective when they are targeted to patients who most likely benefit [2,7,8]. Studies have looked into developing predictive models to identify high-cost patients prospectively [9]. Little is known, however, about the relative importance of clinical characteristics and demographic and socialeconomic status to the distribution of health expenditures. Identifying major underlying drivers of high healthcare costs and understanding how they are linked to different percentiles of the cost distribution, especially the upper tail where the medical expenditures are concentrated, will provide insights into designing effective and tailored interventions to meet the needs of high-cost patients and reduce costs.
Breast cancer diagnosis is the top cancer diagnosis among women in the US, accounting for 29% of all newly diagnosed female cancers each year [10]. The costs of breast cancer treatment and follow-up care put a strain on both healthcare system and patients. Cost of care in the first year after diagnosis varies from $54,664 to $127,444 depending on the stage at which breast cancer was diagnosed, based on the claim data from private insurers from 2003 to 2010 [11]. If measured by episode defined by the Oncology Care Model (OCM)a payment model developed by the Center for Medicare & Medicaid Innovation (CMMI)the total Medicare expenditure for breast cancer is $20,887 per episode on average, with the largest component chemotherapy accounting for 25.9% of the total spending [12].
The OCM is a new payment and delivery model that began on July 1, 2016 and runs through Jun 30, 2021. It is designed to improve the effectiveness and efficacy of specialty care. It aims to encourage participating practices to improve care and lower costs for Medicare fee-for-service beneficiaries with cancer through an episode-based payment model that financially incentivizes high-quality, coordinated care. The OCM collects rich information on episodes and patients from nearly 200 practices and 17 payers, including Center for Medicare & Medicaid Services (CMS), and is well suited for health services research. Since the main goal of the OCM is to set the target price so that performance of participating providers can be measured by comparing the actual cost to the target price, current research utilizing the OCM data generally focuses on expense prediction [13]. Investigating the underlying drivers of high costs for cancer care and how they affect high-cost patients is largely an untapped area [14]. In this article, we leverage the large number of episodes on breast cancer captured in the OCM data and establish the role of key drivers of high costs for breast cancer patients. We believe this is the first study to utilize the OCM data and aim to clarify the underlying drivers of high costs for cancer management.
Expenditure data is typically skewed and heteroscedastic. Figure 1 shows a histogram of OCM episode expenditures for breast cancer. The skewness measure is 1.67, indicating the expenditure distribution is highly skewed. Quantile regression (QR) methods are well suited to estimate how specified quantiles, or percentiles of the distribution of the outcome variable vary with covariates, and is robust against outliers and is more informative for a skewed distribution than mean-based regression [15]. We demonstrate the value of a highly flexible machine learning based quantile regression method in studying healthcare expenditures.
We used episode-based expenditure data on breast cancer, drawn from the OCM, and included 11 variables representing information on treatment, demography and socio-economics status. We then exploited quantile regression random forests (QRFs)a machine learning modeling techniqueto rank the relative importance of the covariates, and proposed and implemented a principled algorithm to identify a set of major determinants for high episode costs. We further quantified the effects of the identified major determinants on different quantiles of episode expenditures and emphasized new insights that can be gained relative to high cost patients.

Methods
We extracted the cost and episode/patient related information from the data that OCM provided to Mount Sinai Hospital, which is a participating institution. The OCM is a voluntary 5-year episode-based payment program developed by the CMMI, which started in 2016 among 194 US oncology provider groups with the baseline period between January 2012 and June 2015. It was set to continue for 5 years, with the goal of improving care coordination and lowering care costs through episode-based cost performance and quality measures [16,17].
The cost is arranged at the episode level. Each episode is triggered by either outpatient chemotherapy claim along with a corresponding cancer diagnosis on the claim, or the filling of a prescription for Part D covered chemotherapy [18]. The duration of an episode is 6 months from the triggering event or at the patient's death. The eligibility criteria for a beneficiary's episode to be included in OCM are: 1) beneficiary is enrolled in Medicare Parts A and B; 2) beneficiary does not receive the Medicare End Stage Renal Disease benefit; 3) beneficiary has Medicare as his or her primary payer; 4) beneficiary is not covered under Medicare Advantage or any other group health program; 5) beneficiary received chemotherapy treatment for cancer; 6) beneficiary has at least one qualifying Evaluation & Management visit during the 6 months of the episode. Episodes in which a beneficiary dies or elects hospice care before the end of 6 months are considered eligible; death will be the only case in which an episode will be shorter than 6 months [13]. The Mount Sinai OCM data included 2938 breastcancer episodes from 1333 patients in both the baseline periods and three performance periods between Jan 1, 2012 and Jan 1, 2018 with the last episode ending on June 30, 2018. All the episodes were included in our analysis with no missing value.
We defined the actual cost associated with each episode as the outcome. It is the Medicare fee-for-service (FFS) expenditures incurred during each episode, which include all Medicare Part A and Part B FFS expenditures (which will include the OCM Monthly Enhanced Oncology Services payments), certain Part D expenditures, and payments resulting from overlapping participation in other Centers for Medicare & Medicaid Services models. We included 11 covariates used in the OCM risk adjustment model [13]. They were (1) Age, (2) Sex, (3) Chemotherapy drugs taken/administered during the episodes. It is grouped into two categorized: Part D (only Part D chemotherapy or long-term oral endocrine chemotherapy) such as tamoxifen and an aromatase inhibitor, and Part B (Part B chemotherapy or other therapies) such as antineo and cetuximab. The drugs included in each category can be found in the OCM therapy drug list provided by CMS [19]. Breast cancer episodes involving only part D or long-term oral endocrine chemotherapy tend to be much less costly than the episodes that involves other therapies [4]. Receipt of cancer-related surgery, [5] Part D eligibility and dual eligibility for Medicare and Medicaid, [6] Receipt of radiation therapy, [7] Clinical trial participation, [8] Comorbidities, which are measured through a subset of the CMS Hierarchical Condition Category (HCC) flags. These flags are created by CMS on a calendar year basis and indicate treatment for 70 different conditions in the prior calendar year. The number of HCC flags that are "turned on" indicates that episode expenditures increase with higher numbers of pre-existing comorbidities. Based on the number of HCC flags, we classify it into 6 categories: 0 flag, 1 flag, 2 flags, 3 flags, 4 flags and over, and new enrollee [9]. History of prior chemotherapy use, denoted by "clean period". The clean period is calculated by the episode start date minus the date of the most recent chemotherapy claim before the episode start date and categorized into three category as in OCM: between 1 and 61 days; between 62 and 730 days; and 730 days above or no prior chemo claims [10]. Institutional status, indicating whether the beneficiary had been institutionalized in a long-term care facility for more than 90 days as of the month in which the episode started, and 11) Hospital Referral Region (HRR) relative cost, which captures the percentage difference in average episode costs between a given HRR and all HRRs. It is formulated as: HRR relative cost = [(Average episode cost for the HRR/ Average episode cost across all HRRs) -1] * 100. Based on this, a geographic adjustment will be made to distinguish episodes occurring in high-and low-cost areas.
The distribution of episode costs for each factor variable is summarized in Table 1, and scatterplots of episode costs for two continuous variables, age and HRR relative cost, are presented in Fig. 2. Our final analytical dataset included 2938 breast cancer episodes.
We applied a nonparametric machine learning technique, QRFs, on the OCM expenditure data. QRFs extends the framework of the Random forests (RFs). RFs consists of an ensemble of classification and regression trees, each of which is learned from a bootstrapped sample via binary recursive splitting. The RFs is adept at capturing interactions and nonlinearities [20]. For its high prediction accuracy and adaptability, RFs has  Fig. 2 gained popularity in medical research [20][21][22][23][24][25][26]. QRFs uses the basis of RFs and gives an accurate way of estimating conditional quantiles (rather than the mean) for multivariate covariates [27]. QRFs grows an ensemble of regression trees as in the standard RF algorithm, but for each node in each tree, QRFs keeps the values of all observations in the node instead of just the means as in RFs. Using the entire distribution of the observations, QRFs can examine the effects of exposure for different quantiles and provide a fuller picture of the exposureresponse relationship than mean-based RFs. For model validation, as the QRFs model performs prediction using the out-of-bag (OOB) observationssamples left out as the testing data in each decision tree construction, it can provide its own internal estimate of predictive performance that correlates well with either cross-validation estimates for test set estimates [28]. We also conducted a goodness-of-fit test of our QRFs model, using the metric R1, or 1 minus the ratio between the sum of absolute deviations in our QRFs models and the sum of absolute deviations in the null (non-conditional) quantile model [29]. We implemented a backward stepwise variable selection algorithm, which we previously developed, based on the variable importance scores generated by QRFs to determine the key factors for the 90th percentile of the episode expenditures [24]. The 90th percentile is commonly used in practice as the threshold for high-cost patients because the 10% of the population above the 90th percentile represents the group that incurred a disproportionately large share of all expenditures [9,30]. The algorithm is summarized in Fig. 3. Details of the algorithm have been described elsewhere [24]. To obtain a reduced set of informative clinical characteristics associated with the upper tail of the episode costs, we implemented a backward stepwise QRFs. At each step, we removed the least important variable and rebuilt a QRFs model with the remaining variables and recorded the OOB average quantile loss (AQL) until no variable was left. AQL assesses the prediction error of τ-th (e.g., τ = 0.9) conditional quantile by averaging the quantile loss function over all observations [31,32]. We identified the key determinants of the 90th percentile of the episode costs for breast cancer as the set of covariates corresponding to the QRFs model with the smallest AQL. Furthermore, we evaluated the relative importance of a variable by the reduction in AQL induced by the inclusion of that variable in the QRFs model. Finally, to "unblackbox" machine learning, we included the major factors selected by QRFs in a classical linear QR model to quantify the effects of each factor on different quantiles of the episode expenditures. We used nature cubic splines with three degrees-of-freedom to model the smoothed effects of two continuous variables, age and HRR relative cost. All statistical analyses were performed using R version 3.6.1. QRFs models were built using the "quantregForest" R package. Figure 4 shows, for the 90th percentile, the estimated OOB AQL for each QRFs model built at each iteration of our stepwise backward algorithm. The "optimal" QRFs model with the smallest prediction error suggests eight determinants for the upper tail of the cost distribution, including chemotherapy drugs used or administered, chemotherapy clean period, radiation therapy, eligibility for Medicare and Medicaid, age, comorbidities, HRR relative cost and surgery. The goodness-of-fit test of our QRFs model for the 90th percentile was 0.78, indicating a reasonably good model fit.

Results
The relative importance of each variable is also implied in Fig. 4. Higher numbered variables were removed from the QRFs model earlier than lower numbered variables. The drop in AQL induced by the inclusion of a variable implies the importance of that variable to the outcome. Taken together, chemotherapy drugs used or administered during episodes and chemotherapy clean Fig. 4 Estimated out-of-bag average quantile loss for the 90th percentile of episode expenditures corresponding to each QRFs model, which includes the remaining k variables (numbered by 1, 2, …, k) after sequentially removing variables (numbered by k + 1, …, 11) with lower importance scores, where k = 1, 2, …, 11. The null model is the intercept only model period were two predominant factors of the 90th percentile of the episode expenditure; they jointly accounted for 77% of the total reduction in AQL from the null model (with no covariates) versus the optimal model (with eight key determinants).
We further provided an "unblackboxing" analysis to quantify the effects of the identified key factors on the episode expenditures. To demonstrate that a variable may have different effects across quantiles of the outcome distribution, we examined the respective effects on the 90th (upper tail), 75th, 50th (median), 25th and 10th (lower tail) quantile. To explore the possible nonlinear age effects, we also fitted a separate model using nature cubic splines with three degrees-of-freedom to capture the smoothed effects of age. Table 2 summarizes the point estimates and 95% confidence intervals for each of the eight major determinants. First, compared to long-term hormone therapy, other non-chemotherapy drugs, Part B drugs and Part D drugs were all associated with higher costs across all percentiles of the cost distribution. Manifested by the largest effect estimates, Part B drugs were the most expensive drugs for breast cancer. Both short (1-61 days) and long (> 730 days) periods of no chemotherapy were linked to higher costs among high-cost patients compared to the quiescent phase of treatment (clean period of 61-730 days), suggesting a "U" shape with highest costs at onset of disease and at death [33]. Radiation, surgery and multimorbidity were all associated with higher costs across different quantiles. While full medication insurance in general incurred higher costs than other partial insurance types, eligibility for Medicare and Medicaid was only associated with median costs with inconclusive effect on other percentiles of the cost distribution. There was a strong association between HRR relative cost and the episode expenditures among highcost patients, suggested by much higher effect --every 30 units increase in HRR relative cost was associated with $1800 (95% CI, $1000, $2600) higher expenditure --for the 90th quantile than for the 10th quantile -every 30 units increase in HRR relative cost was associated with $100 (95% CI, 0, $200) higher cost. This finding is consistent with previous study findings that individuals living in the high cost area go on to use more hospital resources [34]. With age, on average, there was a decreasing trend showing that older patients were Table 2 The effects (point estimate [95% confidence interval]) of eight major factor variables on episode expenditures varied across the 10th, 25th, 50th, 75th and 90th quantile of the expenditure distribution. Effects are measured in thousands of dollars associated with less episode expenditures; and this trend was more evident among the high cost patients (e.g., 90th percentile) compared to low cost patients (e.g., 10th quantile). The fitted splines of age in Fig. 5 suggest their nonlinear effects on the costs. The 90th percentile of the costs was highest among patients aged 50-55, then gradually decreased through age 80 before turning up towards the end of life.
Second, our results demonstrate that the effects of the eight determinants upon episode-based expenditures are not uniform, but are in general disproportionally larger on the right tail of the cost distribution, i.e., those who already have the highest expenditures. For example, compared to long-term hormone therapy, Part B chemotherapy drugs cost $53,800 (95% CI, $49,900 -$56,800) more among high-cost patients (sitting at the right tail) and $10,200 more among low-cost patients (sitting at the left tail). Compared to the quiescent period with a chemotherapy clean interval of 62-730 days, a clean interval of less than 61 days (e.g., around the onset of disease) cost $6300 more among high-cost patients and only $100 more among low-cost patients. These findings suggest that our QR based analyses provide a full picture about the effects of exposures. For HRR relative cost for example, the effect of HRR relative cost is negligible among low cost patients (10th percentile) but is markedly evident among high cost patients (90th percentile).

Discussion
In this study, we applied a robust and reproducible machine learning based approach to identify major factors for high-cost breast cancer patients, when the cost distribution was highly skewed, and investigated the underlying effect mechanisms of the major factors, leveraging a high-performance nonparametric quantile regression technique, QRFs. We exploited a Mount Sinai OCM cost data set on nearly 3000 breast cancer patients with episode-based clinical information and demographic and social-economic status.
Our results provided insights into drivers of high medical costs for breast cancer. Our approach identified eight determinants that jointly impact episode-based expenditures for breast cancer among high-cost patients. Among these factors, chemo drugs and chemo clean period were two predominantly influential variables, followed by the number of comorbidities and age. These determinants did not uniformly impact upon the expenditures, but disproportionally affected the high-cost patients, and their effects on low-cost patients may be negligible. Using mean-based methods would have ignored the disproportionality in the effect estimates, leading to a limited and biased conclusion. Our approach offered a "higher-resolution" analysis that can be used to expand and deepen the existing quantitative evidence on clinical risk factors for episodes expenditure.
Results from our study may help inform population health management initiatives. Establishing key determinants for high-cost cancer patients allows policymakers to develop tailored interventions to meet the needs of those high-cost patients and to reduce high cancer costs. For example, among those who are already high cost patients, the age cohort 50-55 was found to be associated Fig. 5 Effect estimates of age on the 10th, 50th and 90th quantile of the episode cost distribution, using natural cubic splines. To obtain sufficient legibility, we did not plot results for the 25th and 75th quantile with the highest costs. Developing strategies to reduce care spending tailored for this age cohort may help avoid waste of scarce resource. The Part B chemotherapy drugs, a chemotherapy clean interval of less than 61 days and multimorbidity were all drivers of high costs among those who already had the highest spending. These findings may provide insights into strategies for expanding the scope of care management programs investigating preventable spending. Currently such programs are relatively narrow and could have included more broad measures of preventable or wasteful spending [6]. Our results may assist in developing algorithms targeted at subgroups defined by identified underlying high-cost drivers to avoid preventable costs through interventions such as reducing duplicate services, contraindicated care, unnecessary laboratory testing or prolonged hospitalizations [6].
There are several limitations in this study. First, we the relationships between clinical and health characteristics and medical costs do not bear a causal interpretation due to the nature of the cross-sectional data [35][36][37]. However, our results identified important factors of high costs for breast cancer and can potentially stimulate future causal inference research in cost analysis. Second, the cost data for this study, made available by CMS, has both pros and cons. This single payer data allows for a comprehensive, consistent dataset that includes all of the health care services provisioned for a patient. However, it is limited to an elderly population and may not be reflect spend drivers for commercial members [38]. Also, the Medicare dataset included Medicare payments only, and did not incorporate out of pocket expenses which can be significant for medications in Part D. Third, our data is from a single institution. Despite the lack of national representation, because the Mount Sinai Hospital is one of the nation's largest hospitals, we were able to include a large number of episodes in our analysis. Our methods are highly flexible and reproducible, and can be applied to a larger set of OCM data for breast cancer or other data sets alike for other kinds of cancer. Finally, there could be other important variables that were not included in our study, either unmeasured or not collected in our data, such as the accurate capture of disease progression [39]. Though the type of drugs at some level reflects the disease severity, cancer stage is not collected in the OCM data. CMS is working to expand the factors of the OCM to consider disease progression. Developing a sensitivity analysis strategy to evaluate the impact of unobserved data could be a worthwhile contribution [40]. Despite the potential omitted variables, by using an innovative and principled machine learning approach on a highquality dataset with sufficiently large sample size, we believe the scope and depth of our analysis can provide important insights on policymaking and lead to more innovative investigations in the area of breast cancer health services research.
Uncovering true underlying determinants and their relative importance is challenging, especially when the exposure-outcome relationship may be nonlinear and the outcome is heavily skewed.
In public health research, determinants are often selected a priori or using test procedures based on some arbitrary threshold value. On the other hand, many cost analyses focus on building predictive models to identify high-cost patients. It remains unclear how the underlying complex web of factors drive up the costs for breast cancer. Our method is highly agnostic, leveraging flexible machine learning, and provides "higher-resolution" analysis for specific insights into important drivers for high costs and the detailed effect mechanisms on the costs among patients with varied level of costs. In conjunction with the relative importance of determinants, our method can provide valuable guidance for tailored and effective high-cost prevention interventions.

Conclusions
High-performance and data-driven machine learning methods provide insights into the underlying web of factors driving up the costs for breast cancer care management. Results from our study may help inform population health management initiatives and allow policymakers to develop tailored interventions to meet the needs of those high-cost patients and to avoid waste of scarce resource.