Skip to main content

Identifying and understanding determinants of high healthcare costs for breast cancer: a quantile regression machine learning approach



To identify and rank the importance of key determinants of high medical expenses among breast cancer patients and to understand the underlying effects of these determinants.


The Oncology Care Model (OCM) developed by the Center for Medicare & Medicaid Innovation were used. The OCM data provided to Mount Sinai on 2938 breast-cancer episodes included both baseline periods and three performance periods between Jan 1, 2012 and Jan 1, 2018. We included 11 variables representing information on treatment, demography and socio-economics status, in addition to episode expenditures. OCM data were collected from participating practices and payers. We applied a principled variable selection algorithm using a flexible tree-based machine learning technique, Quantile Regression Forests.


We found that the use of chemotherapy drugs (versus hormonal therapy) and interval of days without chemotherapy predominantly affected medical expenses among high-cost breast cancer patients. The second-tier major determinants were comorbidities and age. Receipt of surgery or radiation, geographically adjusted relative cost and insurance type were also identified as important high-cost drivers. These factors had disproportionally larger effects upon the high-cost patients.


Data-driven machine learning methods provide insights into the underlying web of factors driving up the costs for breast cancer care management. Results from our study may help inform population health management initiatives and allow policymakers to develop tailored interventions to meet the needs of those high-cost patients and to avoid waste of scarce resource.

Peer Review reports


It is well known that healthcare costs are concentrated among a small group of ‘high-cost’ patients [1]. Despite they receive substantial care, many have unmet critical healthcare needs and receive unnecessary and ineffective treatments [2,3,4,5]. This suggests that ‘high-need, high-cost’ patients are a natural group to seek for healthcare quality improvement and cost reduction. In the US, providers and insurance plans have sought to develop care coordination and disease management programs to reduce hospital use and costs [6]. Research has shown that these programs are more effective when they are targeted to patients who most likely benefit [2, 7, 8]. Studies have looked into developing predictive models to identify high-cost patients prospectively [9]. Little is known, however, about the relative importance of clinical characteristics and demographic and social-economic status to the distribution of health expenditures. Identifying major underlying drivers of high healthcare costs and understanding how they are linked to different percentiles of the cost distribution, especially the upper tail where the medical expenditures are concentrated, will provide insights into designing effective and tailored interventions to meet the needs of high-cost patients and reduce costs.

Breast cancer diagnosis is the top cancer diagnosis among women in the US, accounting for 29% of all newly diagnosed female cancers each year [10]. The costs of breast cancer treatment and follow-up care put a strain on both healthcare system and patients. Cost of care in the first year after diagnosis varies from $54,664 to $127,444 depending on the stage at which breast cancer was diagnosed, based on the claim data from private insurers from 2003 to 2010 [11]. If measured by episode defined by the Oncology Care Model (OCM) – a payment model developed by the Center for Medicare & Medicaid Innovation (CMMI) – the total Medicare expenditure for breast cancer is $20,887 per episode on average, with the largest component chemotherapy accounting for 25.9% of the total spending [12].

The OCM is a new payment and delivery model that began on July 1, 2016 and runs through Jun 30, 2021. It is designed to improve the effectiveness and efficacy of specialty care. It aims to encourage participating practices to improve care and lower costs for Medicare fee-for-service beneficiaries with cancer through an episode-based payment model that financially incentivizes high-quality, coordinated care. The OCM collects rich information on episodes and patients from nearly 200 practices and 17 payers, including Center for Medicare & Medicaid Services (CMS), and is well suited for health services research. Since the main goal of the OCM is to set the target price so that performance of participating providers can be measured by comparing the actual cost to the target price, current research utilizing the OCM data generally focuses on expense prediction [13]. Investigating the underlying drivers of high costs for cancer care and how they affect high-cost patients is largely an untapped area [14]. In this article, we leverage the large number of episodes on breast cancer captured in the OCM data and establish the role of key drivers of high costs for breast cancer patients. We believe this is the first study to utilize the OCM data and aim to clarify the underlying drivers of high costs for cancer management.

Expenditure data is typically skewed and heteroscedastic. Figure 1 shows a histogram of OCM episode expenditures for breast cancer. The skewness measure is 1.67, indicating the expenditure distribution is highly skewed. Quantile regression (QR) methods are well suited to estimate how specified quantiles, or percentiles of the distribution of the outcome variable vary with covariates, and is robust against outliers and is more informative for a skewed distribution than mean-based regression [15]. We demonstrate the value of a highly flexible machine learning based quantile regression method in studying healthcare expenditures.

Fig. 1

Histogram of episode based expenditures

We used episode-based expenditure data on breast cancer, drawn from the OCM, and included 11 variables representing information on treatment, demography and socio-economics status. We then exploited quantile regression random forests (QRFs) – a machine learning modeling technique – to rank the relative importance of the covariates, and proposed and implemented a principled algorithm to identify a set of major determinants for high episode costs. We further quantified the effects of the identified major determinants on different quantiles of episode expenditures and emphasized new insights that can be gained relative to high cost patients.


We extracted the cost and episode/patient related information from the data that OCM provided to Mount Sinai Hospital, which is a participating institution. The OCM is a voluntary 5-year episode-based payment program developed by the CMMI, which started in 2016 among 194 US oncology provider groups with the baseline period between January 2012 and June 2015. It was set to continue for 5 years, with the goal of improving care coordination and lowering care costs through episode-based cost performance and quality measures [16, 17].

The cost is arranged at the episode level. Each episode is triggered by either outpatient chemotherapy claim along with a corresponding cancer diagnosis on the claim, or the filling of a prescription for Part D covered chemotherapy [18]. The duration of an episode is 6 months from the triggering event or at the patient’s death. The eligibility criteria for a beneficiary’s episode to be included in OCM are: 1) beneficiary is enrolled in Medicare Parts A and B; 2) beneficiary does not receive the Medicare End Stage Renal Disease benefit; 3) beneficiary has Medicare as his or her primary payer; 4) beneficiary is not covered under Medicare Advantage or any other group health program; 5) beneficiary received chemotherapy treatment for cancer; 6) beneficiary has at least one qualifying Evaluation & Management visit during the 6 months of the episode. Episodes in which a beneficiary dies or elects hospice care before the end of 6 months are considered eligible; death will be the only case in which an episode will be shorter than 6 months [13]. The Mount Sinai OCM data included 2938 breast-cancer episodes from 1333 patients in both the baseline periods and three performance periods between Jan 1, 2012 and Jan 1, 2018 with the last episode ending on June 30, 2018. All the episodes were included in our analysis with no missing value.

We defined the actual cost associated with each episode as the outcome. It is the Medicare fee-for-service (FFS) expenditures incurred during each episode, which include all Medicare Part A and Part B FFS expenditures (which will include the OCM Monthly Enhanced Oncology Services payments), certain Part D expenditures, and payments resulting from overlapping participation in other Centers for Medicare & Medicaid Services models. We included 11 covariates used in the OCM risk adjustment model [13]. They were (1) Age, (2) Sex, (3) Chemotherapy drugs taken/administered during the episodes. It is grouped into two categorized: Part D (only Part D chemotherapy or long-term oral endocrine chemotherapy) such as tamoxifen and an aromatase inhibitor, and Part B (Part B chemotherapy or other therapies) such as antineo and cetuximab. The drugs included in each category can be found in the OCM therapy drug list provided by CMS [19]. Breast cancer episodes involving only part D or long-term oral endocrine chemotherapy tend to be much less costly than the episodes that involves other therapies [4]. Receipt of cancer-related surgery, [5] Part D eligibility and dual eligibility for Medicare and Medicaid, [6] Receipt of radiation therapy, [7] Clinical trial participation, [8] Comorbidities, which are measured through a subset of the CMS Hierarchical Condition Category (HCC) flags. These flags are created by CMS on a calendar year basis and indicate treatment for 70 different conditions in the prior calendar year. The number of HCC flags that are “turned on” indicates that episode expenditures increase with higher numbers of pre-existing comorbidities. Based on the number of HCC flags, we classify it into 6 categories: 0 flag, 1 flag, 2 flags, 3 flags, 4 flags and over, and new enrollee [9]. History of prior chemotherapy use, denoted by “clean period”. The clean period is calculated by the episode start date minus the date of the most recent chemotherapy claim before the episode start date and categorized into three category as in OCM: between 1 and 61 days; between 62 and 730 days; and 730 days above or no prior chemo claims [10]. Institutional status, indicating whether the beneficiary had been institutionalized in a long-term care facility for more than 90 days as of the month in which the episode started, and 11) Hospital Referral Region (HRR) relative cost, which captures the percentage difference in average episode costs between a given HRR and all HRRs. It is formulated as: HRR relative cost = [(Average episode cost for the HRR/Average episode cost across all HRRs) – 1] * 100. Based on this, a geographic adjustment will be made to distinguish episodes occurring in high- and low-cost areas.

The distribution of episode costs for each factor variable is summarized in Table 1, and scatterplots of episode costs for two continuous variables, age and HRR relative cost, are presented in Fig. 2. Our final analytical dataset included 2938 breast cancer episodes.

Table 1 Distribution of episode costs (in dollars) for each factor variablee
Fig. 2

Scatterplots of episode expenditures versus age (A) and HRR relative cost (B)

We applied a nonparametric machine learning technique, QRFs, on the OCM expenditure data. QRFs extends the framework of the Random forests (RFs). RFs consists of an ensemble of classification and regression trees, each of which is learned from a bootstrapped sample via binary recursive splitting. The RFs is adept at capturing interactions and nonlinearities [20]. For its high prediction accuracy and adaptability, RFs has gained popularity in medical research [20,21,22,23,24,25,26]. QRFs uses the basis of RFs and gives an accurate way of estimating conditional quantiles (rather than the mean) for multivariate covariates [27]. QRFs grows an ensemble of regression trees as in the standard RF algorithm, but for each node in each tree, QRFs keeps the values of all observations in the node instead of just the means as in RFs. Using the entire distribution of the observations, QRFs can examine the effects of exposure for different quantiles and provide a fuller picture of the exposure-response relationship than mean-based RFs. For model validation, as the QRFs model performs prediction using the out-of-bag (OOB) observations – samples left out as the testing data in each decision tree construction, it can provide its own internal estimate of predictive performance that correlates well with either cross-validation estimates for test set estimates [28]. We also conducted a goodness-of-fit test of our QRFs model, using the metric R1, or 1 minus the ratio between the sum of absolute deviations in our QRFs models and the sum of absolute deviations in the null (non-conditional) quantile model [29].

We implemented a backward stepwise variable selection algorithm, which we previously developed, based on the variable importance scores generated by QRFs to determine the key factors for the 90th percentile of the episode expenditures [24]. The 90th percentile is commonly used in practice as the threshold for high-cost patients because the 10% of the population above the 90th percentile represents the group that incurred a disproportionately large share of all expenditures [9, 30]. The algorithm is summarized in Fig. 3. Details of the algorithm have been described elsewhere [24]. To obtain a reduced set of informative clinical characteristics associated with the upper tail of the episode costs, we implemented a backward stepwise QRFs. At each step, we removed the least important variable and rebuilt a QRFs model with the remaining variables and recorded the OOB average quantile loss (AQL) until no variable was left. AQL assesses the prediction error of τ-th (e.g., τ = 0.9) conditional quantile by averaging the quantile loss function over all observations [31, 32]. We identified the key determinants of the 90th percentile of the episode costs for breast cancer as the set of covariates corresponding to the QRFs model with the smallest AQL. Furthermore, we evaluated the relative importance of a variable by the reduction in AQL induced by the inclusion of that variable in the QRFs model.

Fig. 3

Quantile regression forests variable selection algorithm

Finally, to “unblackbox” machine learning, we included the major factors selected by QRFs in a classical linear QR model to quantify the effects of each factor on different quantiles of the episode expenditures. We used nature cubic splines with three degrees-of-freedom to model the smoothed effects of two continuous variables, age and HRR relative cost. All statistical analyses were performed using R version 3.6.1. QRFs models were built using the “quantregForest” R package.


Figure 4 shows, for the 90th percentile, the estimated OOB AQL for each QRFs model built at each iteration of our stepwise backward algorithm. The “optimal” QRFs model with the smallest prediction error suggests eight determinants for the upper tail of the cost distribution, including chemotherapy drugs used or administered, chemotherapy clean period, radiation therapy, eligibility for Medicare and Medicaid, age, comorbidities, HRR relative cost and surgery. The goodness-of-fit test of our QRFs model for the 90th percentile was 0.78, indicating a reasonably good model fit.

Fig. 4

Estimated out-of-bag average quantile loss for the 90th percentile of episode expenditures corresponding to each QRFs model, which includes the remaining k variables (numbered by 1, 2, …, k) after sequentially removing variables (numbered by k + 1, …, 11) with lower importance scores, where k = 1, 2, …, 11. The null model is the intercept only model

The relative importance of each variable is also implied in Fig. 4. Higher numbered variables were removed from the QRFs model earlier than lower numbered variables. The drop in AQL induced by the inclusion of a variable implies the importance of that variable to the outcome. Taken together, chemotherapy drugs used or administered during episodes and chemotherapy clean period were two predominant factors of the 90th percentile of the episode expenditure; they jointly accounted for 77% of the total reduction in AQL from the null model (with no covariates) versus the optimal model (with eight key determinants).

We further provided an “unblackboxing” analysis to quantify the effects of the identified key factors on the episode expenditures. To demonstrate that a variable may have different effects across quantiles of the outcome distribution, we examined the respective effects on the 90th (upper tail), 75th, 50th (median), 25th and 10th (lower tail) quantile. To explore the possible nonlinear age effects, we also fitted a separate model using nature cubic splines with three degrees-of-freedom to capture the smoothed effects of age.

Table 2 summarizes the point estimates and 95% confidence intervals for each of the eight major determinants. First, compared to long-term hormone therapy, other non-chemotherapy drugs, Part B drugs and Part D drugs were all associated with higher costs across all percentiles of the cost distribution. Manifested by the largest effect estimates, Part B drugs were the most expensive drugs for breast cancer. Both short (1–61 days) and long (> 730 days) periods of no chemotherapy were linked to higher costs among high-cost patients compared to the quiescent phase of treatment (clean period of 61–730 days), suggesting a “U” shape with highest costs at onset of disease and at death [33]. Radiation, surgery and multimorbidity were all associated with higher costs across different quantiles. While full medication insurance in general incurred higher costs than other partial insurance types, eligibility for Medicare and Medicaid was only associated with median costs with inconclusive effect on other percentiles of the cost distribution. There was a strong association between HRR relative cost and the episode expenditures among high-cost patients, suggested by much higher effect -- every 30 units increase in HRR relative cost was associated with $1800 (95% CI, $1000, $2600) higher expenditure -- for the 90th quantile than for the 10th quantile -- every 30 units increase in HRR relative cost was associated with $100 (95% CI, 0, $200) higher cost. This finding is consistent with previous study findings that individuals living in the high cost area go on to use more hospital resources [34]. With age, on average, there was a decreasing trend showing that older patients were associated with less episode expenditures; and this trend was more evident among the high cost patients (e.g., 90th percentile) compared to low cost patients (e.g., 10th quantile).

Table 2 The effects (point estimate [95% confidence interval]) of eight major factor variables on episode expenditures varied across the 10th, 25th, 50th, 75th and 90th quantile of the expenditure distribution. Effects are measured in thousands of dollars

The fitted splines of age in Fig. 5 suggest their nonlinear effects on the costs. The 90th percentile of the costs was highest among patients aged 50–55, then gradually decreased through age 80 before turning up towards the end of life.

Fig. 5

Effect estimates of age on the 10th, 50th and 90th quantile of the episode cost distribution, using natural cubic splines. To obtain sufficient legibility, we did not plot results for the 25th and 75th quantile

Second, our results demonstrate that the effects of the eight determinants upon episode-based expenditures are not uniform, but are in general disproportionally larger on the right tail of the cost distribution, i.e., those who already have the highest expenditures. For example, compared to long-term hormone therapy, Part B chemotherapy drugs cost $53,800 (95% CI, $49,900 – $56,800) more among high-cost patients (sitting at the right tail) and $10,200 more among low-cost patients (sitting at the left tail). Compared to the quiescent period with a chemotherapy clean interval of 62–730 days, a clean interval of less than 61 days (e.g., around the onset of disease) cost $6300 more among high-cost patients and only $100 more among low-cost patients. These findings suggest that our QR based analyses provide a full picture about the effects of exposures. For HRR relative cost for example, the effect of HRR relative cost is negligible among low cost patients (10th percentile) but is markedly evident among high cost patients (90th percentile).


In this study, we applied a robust and reproducible machine learning based approach to identify major factors for high-cost breast cancer patients, when the cost distribution was highly skewed, and investigated the underlying effect mechanisms of the major factors, leveraging a high-performance nonparametric quantile regression technique, QRFs. We exploited a Mount Sinai OCM cost data set on nearly 3000 breast cancer patients with episode-based clinical information and demographic and social-economic status.

Our results provided insights into drivers of high medical costs for breast cancer. Our approach identified eight determinants that jointly impact episode-based expenditures for breast cancer among high-cost patients. Among these factors, chemo drugs and chemo clean period were two predominantly influential variables, followed by the number of comorbidities and age. These determinants did not uniformly impact upon the expenditures, but disproportionally affected the high-cost patients, and their effects on low-cost patients may be negligible. Using mean-based methods would have ignored the disproportionality in the effect estimates, leading to a limited and biased conclusion. Our approach offered a “higher-resolution” analysis that can be used to expand and deepen the existing quantitative evidence on clinical risk factors for episodes expenditure.

Results from our study may help inform population health management initiatives. Establishing key determinants for high-cost cancer patients allows policymakers to develop tailored interventions to meet the needs of those high-cost patients and to reduce high cancer costs. For example, among those who are already high cost patients, the age cohort 50–55 was found to be associated with the highest costs. Developing strategies to reduce care spending tailored for this age cohort may help avoid waste of scarce resource. The Part B chemotherapy drugs, a chemotherapy clean interval of less than 61 days and multimorbidity were all drivers of high costs among those who already had the highest spending. These findings may provide insights into strategies for expanding the scope of care management programs investigating preventable spending. Currently such programs are relatively narrow and could have included more broad measures of preventable or wasteful spending [6]. Our results may assist in developing algorithms targeted at subgroups defined by identified underlying high-cost drivers to avoid preventable costs through interventions such as reducing duplicate services, contraindicated care, unnecessary laboratory testing or prolonged hospitalizations [6].

There are several limitations in this study. First, we the relationships between clinical and health characteristics and medical costs do not bear a causal interpretation due to the nature of the cross-sectional data [35,36,37]. However, our results identified important factors of high costs for breast cancer and can potentially stimulate future causal inference research in cost analysis. Second, the cost data for this study, made available by CMS, has both pros and cons. This single payer data allows for a comprehensive, consistent dataset that includes all of the health care services provisioned for a patient. However, it is limited to an elderly population and may not be reflect spend drivers for commercial members [38]. Also, the Medicare dataset included Medicare payments only, and did not incorporate out of pocket expenses which can be significant for medications in Part D. Third, our data is from a single institution. Despite the lack of national representation, because the Mount Sinai Hospital is one of the nation’s largest hospitals, we were able to include a large number of episodes in our analysis. Our methods are highly flexible and reproducible, and can be applied to a larger set of OCM data for breast cancer or other data sets alike for other kinds of cancer. Finally, there could be other important variables that were not included in our study, either unmeasured or not collected in our data, such as the accurate capture of disease progression [39]. Though the type of drugs at some level reflects the disease severity, cancer stage is not collected in the OCM data. CMS is working to expand the factors of the OCM to consider disease progression. Developing a sensitivity analysis strategy to evaluate the impact of unobserved data could be a worthwhile contribution [40]. Despite the potential omitted variables, by using an innovative and principled machine learning approach on a high-quality dataset with sufficiently large sample size, we believe the scope and depth of our analysis can provide important insights on policymaking and lead to more innovative investigations in the area of breast cancer health services research.

Uncovering true underlying determinants and their relative importance is challenging, especially when the exposure-outcome relationship may be nonlinear and the outcome is heavily skewed.

In public health research, determinants are often selected a priori or using test procedures based on some arbitrary threshold value. On the other hand, many cost analyses focus on building predictive models to identify high-cost patients. It remains unclear how the underlying complex web of factors drive up the costs for breast cancer. Our method is highly agnostic, leveraging flexible machine learning, and provides “higher-resolution” analysis for specific insights into important drivers for high costs and the detailed effect mechanisms on the costs among patients with varied level of costs. In conjunction with the relative importance of determinants, our method can provide valuable guidance for tailored and effective high-cost prevention interventions.


High-performance and data-driven machine learning methods provide insights into the underlying web of factors driving up the costs for breast cancer care management. Results from our study may help inform population health management initiatives and allow policymakers to develop tailored interventions to meet the needs of those high-cost patients and to avoid waste of scarce resource.

Availability of data and materials

The OCM data that support the findings of this study are available from Center for Medicare & Medicaid Innovation but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Center for Medicare & Medicaid Innovation. Analysis R codes are available from the corresponding author.



Oncology Care Model CM


Centers of Medicare &Medicaid Services


Center for Medicare and Medicaid Innovations


Quantile regression random forests


Random forests


Quantile regression


Average quantile loss


Out of bag


Hospital referral region




Hierarchical condition category


  1. 1.

    Zook CJ, Moore FD. High-cost users of medical care. N Engl J Med. 1980;302(18):996–1002.

    CAS  Article  Google Scholar 

  2. 2.

    Blumenthal D, Chernof B, Fulmer T, Lumpkin J, Selberg J. Caring for high-need, high-cost patients — an urgent priority. N Engl J Med. 2016;375(10):909–11.

    Article  Google Scholar 

  3. 3.

    Wennberg JE, Bronner K, Skinner JS, Fisher ES, Goodman DC. Inpatient care intensity and patients' ratings of their hospital experiences. Health Aff (Millwood). 2009;28(1):103–12.

    Article  Google Scholar 

  4. 4.

    Colla CH, Lewis VA, Kao L-S, O'Malley AJ, Chang C-H, Fisher ES. Association between Medicare accountable care organization implementation and spending among clinically vulnerable beneficiaries. JAMA Intern Med. 2016;176(8):1167–75.

    Article  Google Scholar 

  5. 5.

    Bodenheimer T, Fernandez A. High and rising health care costs. Part 4: can costs be controlled while preserving quality? Ann Intern Med. 2005;143(1):26–31.

    Article  Google Scholar 

  6. 6.

    Wammes JJG, van der Wees PJ, Tanke MAC, Westert GP, Jeurissen PPT. Systematic review of high-cost patients' characteristics and healthcare utilisation. BMJ open. 2018;8(9):e023113.

    Article  Google Scholar 

  7. 7.

    Anderson GF, Ballreich J, Bleich S, Boyd C, DuGoff E, Leff B, et al. Attributes common to programs that successfully treat high-need, high-cost individuals. Am J Manag Care. 2015;21(11):e597–600.

    PubMed  Google Scholar 

  8. 8.

    Brown RS, Peikes D, Peterson G, Schore J, Razafindrakoto CM. Six features of Medicare coordinated care demonstration programs that cut hospital admissions of high-risk patients. Health Aff. 2012;31(6):1156–66.

    Article  Google Scholar 

  9. 9.

    Maidman A, Wang L. New semiparametric method for predicting high-cost patients. Biometrics. 2018;74(3):1104–11.

    Article  Google Scholar 

  10. 10.

    Siegel RL, Miller KD, Jemal A. Cancer statistics, 2015. CA Cancer J Clin. 2015;65(1):5–29.

    Article  Google Scholar 

  11. 11.

    Allaire BT, Ekwueme DU, Poehler D, Thomas CC, Guy GP Jr, Subramanian S, et al. Breast cancer treatment costs in younger, privately insured women. Breast Cancer Res Treat. 2017;164(2):429–36.

    Article  Google Scholar 

  12. 12.

    Baumgardner J, Shahabi A, Zacker C, Lakdawalla D. Cost variation and savings opportunities in the oncology care model. Am J Manag Care. 2018;24(12):618–23.

    PubMed  Google Scholar 

  13. 13.

    RTI International, Actuarial Research Corporation. OCM performance-based payment methodology [Available from:

  14. 14.

    Saunders C. The oncology care model: performance period 4 results and the next phase with two-sided risk. J Clin Pathways. 2019;5(10):45–7.

    Google Scholar 

  15. 15.

    Wei Y, Kehm RD, Goldberg M, Terry MB. Applications for Quantile regression in epidemiology. Current Epidemiology Reports. 2019;6(2):191–9.

    Article  Google Scholar 

  16. 16.

    Davidoff AJ, Prasad S, Patel K, Polite B. What Is The Oncology Care Model, And Why Is The Evaluation Important? [Available from:

  17. 17.

    Center for Medicare & Medicaid Innovation. Oncology Care Model [Available from:

  18. 18.

    Center for Medicare & Medicaid Innovation. Appendix D: preliminary list of chemotherapy drugs. In: Oncology Care Model (OCM): Request for Applications (RFA): February 2015 [Available from:

  19. 19.

    Oncology Care Models Initiating Therapies List [Internet]. Center for Medicare & Medicaid Innovation. [cited September 17, 2020]. Available from:

  20. 20.

    Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  21. 21.

    Genuer R, Poggi J-M, Tuleau-Malot C. Variable selection using random forests. Pattern Recogn Lett. 2010;31(14):2225–36.

    Article  Google Scholar 

  22. 22.

    Mazumdar M, Lin J-YJ, Zhang W, Li L, Liu M, Dharmarajan K, et al. Comparison of statistical and machine learning models for healthcare cost data: a simulation study motivated by oncology care model (OCM) data. BMC Health Serv Res. 2020;20(1):350.

    Article  Google Scholar 

  23. 23.

    Hu L, Ji J, Liu B, Li Y. Tree-based machine learning to identify and understand major determinants for stroke at the neighborhood level. J Am Heart Assoc. 2020;0(0):e016745.

    Article  Google Scholar 

  24. 24.

    Hu L, Ji J, Li Y, Liu B, Zhang Y. Quantile regression forests to identify determinants of neighborhood stroke prevalence in 500 cities in the USA: implications for neighborhoods with high prevalence. J Urban Health 2020; 0(0):1–12. DOI:

  25. 25.

    Ji J, Hu L, Liu B, Li Y. Identifying and assessing the impact of key neighborhood-level determinants on geographic variation in stroke: a machine learning and multilevel modeling approach. BMC Public Health. 2020;20(1):1666.

    Article  Google Scholar 

  26. 26.

    Hu L, Liu B, Li Y. Ranking sociodemographic, health behavior, prevention, and environmental factors in predicting neighborhood cardiovascular health: a Bayesian machine learning approach. Prev Med. 2020;141:106240.

    Article  Google Scholar 

  27. 27.

    Meinshausen N. Quantile Regression Forests. J Mach Learn Res. 2006;7:983–99.

    Google Scholar 

  28. 28.

    Kuhn M, Johnson K. Applied predictive modeling. 2nd ed. New York: Springer; 2018.

    Google Scholar 

  29. 29.

    Koenker R, Machado JAF. Goodness of fit and related inference processes for Quantile regression. J Am Stat Assoc. 1999;94(448):1296–310.

    Article  Google Scholar 

  30. 30.

    Lee JY, Muratov S, Tarride J-E, Holbrook AM. Managing high-cost healthcare users: the international search for effective evidence-supported strategies. J Am Geriatr Soc. 2018;66(5):1002–8.

    Article  Google Scholar 

  31. 31.

    Wang L, Wu Y, Li R. Quantile regression for analyzing heterogeneity in ultra-high dimension. J Am Stat Assoc. 2012;107(497):214–22.

    Article  Google Scholar 

  32. 32.

    Fang Y, Xu P, Yang J, Qin Y. A quantile regression forest based method to predict drug response and assess prediction reliability. PLoS One. 2018;13(10):e0205155.

    Article  Google Scholar 

  33. 33.

    Brown ML, Riley GF, Schussler N, Etzioni R. Estimating health care costs related to Cancer treatment from SEER-Medicare data. Med Care. 2002;40(8):IV104–IV17.

    Google Scholar 

  34. 34.

    Fleishman JA, Cohen JW. Using information on clinical conditions to predict high-cost patients. Health Serv Res. 2010;45(2):532–52.

    Article  Google Scholar 

  35. 35.

    Hu L, Hogan JW. Causal comparative effectiveness analysis of dynamic continuous-time treatment initiation rules with sparsely measured outcomes and death. Biometrics. 2019;75(2):695–707.

    Article  Google Scholar 

  36. 36.

    Hu L, Hogan JW, Mwangi AW, Siika A. Modeling the causal effect of treatment initiation time on survival: application to HIV/TB co-infection. Biometrics. 2018;74(2):703–13.

    Article  Google Scholar 

  37. 37.

    Hu L, Gu C, Lopez M, Ji J, Wisnivesky J. Estimation of causal effects of multiple treatments in observational studies with a binary outcome. Stat Methods Med Res. 2020;29(11):3218–34.

    Article  Google Scholar 

  38. 38.

    Sagar B, Lin YS, Castel LD. Cost drivers for breast, lung, and colorectal cancer care in a commercially insured population over a 6-month episode: an economic analysis from a health plan perspective. J Med Econ. 2017;20(10):1018–23.

    Article  Google Scholar 

  39. 39.

    Ennis RD, Parikh AB, Sanderson M, Liu M, Isola L. Interpreting oncology care model data to drive value-based care: a prostate Cancer analysis. J Oncol Pract. 2019;15(3):e238–e46.

    Article  Google Scholar 

  40. 40.

    Hogan JW, Daniels MJ, Hu L. A Bayesian perspective on assessing sensitivity to assumptions about unobserved data. In: Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G, editors. Handbook of missing data methodology. Boca Raton, FL: CRC Press; 2014. p. 405–34.

    Google Scholar 

Download references


Not applicable.


This study was in part supported by a methodology award ME 2017C3 9041 from the Patient-Centered Outcomes Research Institute, and by grant R21CA245855 and P30CA196521 from the National Cancer Institute. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information




LH: Conceptualization, study design, methodology development, supervision of statistical analysis, funding acquisition, and writing and editing original draft. LL: Data curation, results interpretation, manuscript reviewing and editing. JJ: formal statistical analysis. MS: results interpretation, manuscript editing. All authors contributed to and have approved the final manuscript.

Corresponding author

Correspondence to Liangyuan Hu.

Ethics declarations

Ethics approval and consent to participate

Ethical approval for the OCM data analysis study was obtained from Icahn School of Medicine at Mount Sinai Program for the Protection of Human Subjects, Institutional Review Boards (reference number HS#17–00291). Because the Mount Sinai OCM data contains no personal identifiers and is retrospective in nature, the need for consent was waived by the IRB. No other administrative permissions were required to access and use the data.

Consent for publication

Not Applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hu, L., Li, L., Ji, J. et al. Identifying and understanding determinants of high healthcare costs for breast cancer: a quantile regression machine learning approach. BMC Health Serv Res 20, 1066 (2020).

Download citation


  • Medical care costs
  • Cancer
  • Machine learning
  • Quantile regression