Skip to main content

Prediction of health care expenditure increase: how does pharmacotherapy contribute?



Rising health care costs are a major public health issue. Thus, accurately predicting future costs and understanding which factors contribute to increases in health care expenditures are important. The objective of this project was to predict patients healthcare costs development in the subsequent year and to identify factors contributing to this prediction, with a particular focus on the role of pharmacotherapy.


We used 2014–2015 Swiss health insurance claims data on 373′264 adult patients to classify individuals’ changes in health care costs. We performed extensive feature generation and developed predictive models using logistic regression, boosted decision trees and neural networks. Based on the decision tree model, we performed a detailed feature importance analysis and subgroup analysis, with an emphasis on drug classes.


The boosted decision tree model achieved an overall accuracy of 67.6% and an area under the curve-score of 0.74; the neural network and logistic regression models performed 0.4 and 1.9% worse, respectively. Feature engineering played a key role in capturing temporal patterns in the data. The number of features was reduced from 747 to 36 with only a 0.5% loss in the accuracy. In addition to hospitalisation and outpatient physician visits, 6 drug classes and the mode of drug administration were among the most important features. Patient subgroups with a high probability of increase (up to 88%) and decrease (up to 92%) were identified.


Pharmacotherapy provides important information for predicting cost increases in the total population. Moreover, its relative importance increases in combination with other features, including health care utilisation.

Peer Review reports


Rising health care costs are a major economic and public health issue worldwide [1, 2]: According to the World Health Organization, health care accounted for 7.9% of Europe’s gross domestic product (GDP) in 2015 [3]. In Switzerland, the health care sector contributes substantially to the national GDP, and has increased from 10.7 to 12.1% between 2010 and 2015 [3]. Moreover, because health care utilisation costs may serve as a surrogate for an individual’s health status [4], understanding which factors contribute to increases in health expenditures may provide insight into risk factors and potential starting points for preventive measures.

Several studies [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21] have addressed the prediction of health care costs, approaching the issue as either a regression problem or a classification problem (classifying costs into predefined “buckets”). Morid et al. [22] conducted a literature review summarising and comparing the existing models. As far as the annual difference in costs is concerned, we are aware of only 1 study [23], which classified healthcare costs development into only two classes (binary classification). Previous studies also examined a broad variety of features. The most commonly used features include different sets of demographic features, health care utilisation parameters (e.g. hospitalisation or outpatient visits), drug codes, diagnosis codes, procedure codes, various chronic disease scores and cost features.

In this study, we aimed to predict changes in patients’ health care costs in the subsequent year and to identify factors contributing substantially to this prediction. In particular, we focused on the role of pharmacotherapy and other medical features such as hospitalisations and outpatient physician visits. We approached the problem as a binary classification task, predicting whether patient’s total costs would increase or decrease in 2015, based on their characteristics in 2014. We compared the performance of 3 different models: feedforward neural networks (FNN), boosted decision trees (BDT) and logistic regression (LR). To capture different patterns in the data, we performed extensive feature engineering and introduced new domain-specific features, such as the drug administration mode. Finally, we performed a detailed feature importance analysis and subgroup analysis, based on the decision tree model.


Study data

We used anonymised claims data provided by the Helsana Group, one of the largest health insurance companies in Switzerland, which covers about 15% of the population across all regions of the country [24]. Basic health insurance coverage is mandatory in Switzerland. All residents are free to choose their preferred insurance providers, which are privately owned. Insurance coverage is financed by a premium and includes co-payments and deductibles [25]. The amount of the deductible can be chosen by the patient and changed every year. All health care invoices submitted for reimbursement are recorded in Helsana’s claims database [24]. The full dataset comprised information on adults (aged ≥18 years) without additional private insurance. All patients were insured by Helsana throughout the study period (2014–2015), allowing for complete records for both years. Furthermore, we required that all patients had at least 5 drug prescriptions in both calendar years and complete records on all demographic variables. In total, 373′264 patients met these requirements. Our dataset comprised demographic parameters, information on health insurance status, prescribed drugs, claimed health care utilisation, and total costs for each patient. Total costs were defined as gross costs for all invoices submitted for reimbursement, thus not taking co-payments and deductibles into account. Prescribed drugs are displayed using the Global Trade Item Number (GTIN). Additionally, the active component (5th-level Anatomical Therapeutic Chemical (ATC) code [26]) is available for every drug. Diagnoses are not available in our dataset because of legal regulations in Switzerland.

Introduction of features

Feature engineering plays an important role in most of the machine learning models and can greatly improve prediction accuracy for any task.

Our exploratory linear regression analysis revealed that, compared with the prediction of total costs, the variance of the difference in costs is harder to explain using basic features such as demographics [5, 6, 13,14,15,16, 19] or simple count measures [17] described in the literature (Additional file 1: Table S1). Therefore, we performed extensive feature generation to include additional predictors in our models. We assigned names to the feature sets, which we later use to discuss their relative importance for the overall accuracy.

Basic features

The included demographic features were age, gender, deductible amount, insurance model and area of residence. We also included the simple count measures of numbers of hospitalisations, outpatient physician office visits, different drugs, and the number of individual prescriptions (GTINs). Because our dataset lacks diagnosis codes, we approximated chronic conditions following the ATC classification proposed by Huber et al. [25] and computed the number of prescribed ATC codes corresponding to each group.

Features representing pharmacotherapy

In addition to the derived chronic conditions, we included explicit drug information. To reduce sparsity, we chose 4th-level ATC [26] codes (eg. C01AA, statins) over the 8′705 unique GTINs or the 1′027 5th-level ATC codes. For each of the resulting 449 categories, we computed the number of corresponding prescriptions.

Additional features

We included the following additional features: Hospitalisation was identified using Swiss diagnosis-related group (DRG) codes [27]. We generated features displaying the major diagnostic categories derived from DRG codes (e.g., hospitalisation for diseases of the respiratory system), the type of hospital, and the type of harm (e.g., accident, disease), as well as the overall length of hospital stay. To capture temporal patterns [28], we computed the frequencies of outpatient office and bedside visits per month and per quarter of the year. We also included physician’s specialisation, the institution dispensing the drug and the number of visits on weekends (which might indicate acuteness) as features. Additionally, we computed the frequencies of prescriptions for different fine-grained periods of time and the number of prescribed products with certain modes of administration (e.g., intravenous) for each patient. The number of different drug classes and prescriptions (defined as different purchase dates), as well as features representing psychiatric treatment, rehabilitation, nursing home stays, and home care were also included. Finally, we generated a number of descriptive statistics (median, mean, standard deviation, minimum, and maximum) for intervals between, for example, prescriptions, visits and home care to capture a regularity pattern. Our expectation was that, the more regular these events are, the more continuous is the treatment, and that irregularity might point to a more acute condition.

Costs feature

Total healthcare costs in 2014 was included only to assess the overall accuracy and to determine whether the medical features provided complementary information.

Data split

Using random assignment, we divided the dataset into 3 parts: training set (80%), validation set (10%), and test set (10%). The training set was used to develop the prediction models, and the validation set was used for assessing the performance of various methods and for subsequent tuning of the hyperparameters. The test set was reserved for reporting the performance of the final models. We report the basic descriptive statistics in Table 1.

Table 1 Study population characteristics (2014)


We used 3 different methods to develop models for our analysis. As a reference model, we used LR and contrasted its performance to FNN and BDT. All models were developed starting with a set of demographic features. Additional feature sets were added in a stepwise manner, resulting in a total of 747 different features in the complete model (Table 2).

Table 2 Comparison of prediction performance of logistic regression (LR), boosted decision tree (BDT) and feedforward neural network (FNN) using different sets of features

Because we use BDT (in particular the XgBoost [29] library) extensively for the subsequent analyses, a short overview is in order: BDT is a variant of decision tree methods with a gradient boosting algorithm governing the learning process. In decision trees, the input is mapped to a target label by a recursive creation of decision rules [30], which can be represented as nodes in a graphical tree model. The gradient boosting method produces a prediction model in the form of a weighted average of several weak predictors (decision trees).

Feature importance analysis using BDT

We used BDT to conduct detailed feature and drug-importance analyses. Using BDT, decision rules can be mapped into respective cuts in our feature space, generating subgroups of patients with a high probability of an increase in costs. In particular, we were interested in medically relevant subgroups, with a particular emphasis on pharmacotherapy.

General feature importance

We used backward deletion to assess the general feature importance. Backward deletion begins with all candidate features (here, the complete model), and the deletion of each feature is tested using a chosen model fit criterion. The feature that makes the most statistically insignificant contribution to the model fit quality is deleted. The process is repeated until no further variables can be deleted without a large loss in accuracy. This process is displayed in Additional file 1: Figure S1 in the supplement.

Drug importance analysis

Conditional drug probabilities

The feature importance analysis based on backward deletion selects features according to their overall contribution to the total accuracy. As the latter depends on the feature’s frequency in the dataset and its relative discriminative contribution, more frequently prescribed drugs have an advantage over those that are prescribed less frequently, even if discriminating less efficiently. In order to get additional insight into the drug-importance, we computed the probability of increase, conditioned on the drug classes and stratified by hospitalisation.

Weight analysis

Although conditional drug probabilities provide an important overview, interactions of the drug classes with other features (except for hospitalisation) could not be assessed. Therefore, we performed a weight analysis to investigate the decision tree model predictions using the test set. To understand the concept of weight analysis, it is important to clarify how the BDT prediction is generated during the inference stage. For a given input sample, the BDT maps every feature in the sample to learned weights or scores. The individual score can be either positive or negative, depending on whether the feature contributes to increase or decrease prediction, respectively. The final prediction is an increase, if the sum of all scores is positive; otherwise it is a decrease. Thus, by analysing the weights of particular features using a sample of inputs, one can understand how often and how strongly these features contribute [31]. Using this intuition, we filtered out the drug classes that contributed to increases or decreases with a high proportion (at least 5% of the overall positive or negative score).

Subgroup analysis

BDT produces a prediction model in the form of a weighted average of several weak predictors. To find examples of highly predictive subgroups involving drug classes, we employed the following strategy: First, we filtered out all decision paths in all trees where a particular drug class was used. More precisely, we considered only the paths where the prescription of the drug contributed. Next, we measured the conditional probability of increase for the cuts given by the filtered paths. We denote this probability by P(increase | cut). For every such a cut, we computed the conditional probability without the drug class cut, P(increase | cut without drug class). We defined a gain to be the difference |P(increase | cut) - P(increase | cut without drug class)|. Lastly, we chose the subgroups with high values of gain.


In Table 1 we show the basic descriptive statistics for the total dataset, as well as for the three subsets. As one can see from the table, the training, validation and test datasets follow the same distribution over all parameters. In particular, it is important that the variation of the annual cost difference and the proportion of cost increase/decrease is small (within ±0.6% for the cost increase).

Performance of models

The BDT model performed the best, leading to 67.6% accuracy and an area under the curve (AUC) score of 0.74, indicating good discrimination between the classes. The receiver operating characteristic curves of all 3 models are presented in Fig. 1. Table 2 indicates the performance of the models on different sets of features. Whereas demographic features alone were not predictive at all, adding simple count measures — especially the number of outpatient office visits and the number of hospitalisations — substantially improved prediction accuracy. The effects of additional features (n = 264), total costs, and pharmacotherapy (n = 449) were about the same (2–3%), depending on the chosen model. Once combined, the overall accuracy further improved by more than 1%, indicating that these features contain complementary information. As for the model comparison, FNN and BDT consistently outperformed LR by about 2%. Moreover, BDT generalised better on the unseen samples, outperforming the FNN in accuracy by about 0.4%.

Fig. 1

Area under the receiver operating characteristic curve (AUC): Comparison of prediction performance. LR = logistic regression, BDT = boosted decision tree. FNN = feedforward neural network

General feature importance

Gradually adding feature sets already provides some intuition about their relative importance, but decision tree models can be further utilised for the systematic analysis of feature importance. Using backward deletion, we found that the number of features could be reduced up to 36, with only a 0.5% loss in the accuracy (Table 2, Additional file 1: Figure S1). We identified the length of hospital stay, total costs, and intravenous mode of drug administration as the most important features. The full list of 36 features is presented in Additional file 1: Table S2. The list comprises both demographic and various medical features such as the number of individual prescriptions, the temporal pattern of outpatient visits, and diabetes as a chronic condition. Interestingly, the following 6 drug classes remained in the model: A03BA (belladonna alkaloids), B03BB (folic acid), N01AH (opioid anaesthetics), N01AX (other general anaesthetics), S01BC (ophthalmologic non-steroidal anti-inflammatory agents) and S01CA (ophthalmologic corticosteroids and anti-infectives in combination).

Drug importance analysis

Conditional drug probabilities

For the total study population, irrespective of prescribed drugs, the probability of cost increase was 51.9%. Conditioned on hospitalisation, the probabilities for increase were 23.1 and 58.1% with and without hospitalisation in 2014, respectively. We subsequently computed the probabilities of increase or decrease in costs conditioned on the 449 drug classes and on hospitalisation. The results are presented in Table 3. In particular, we present the drug classes with the highest probabilities for cost increase or decrease and with frequent prescriptions. All 6 drug classes identified in the previous section are included in this table, with only folic acid (B03BB) being an indicator for an increase in costs.

Table 3 Probabilities of cost increase and decrease for patient groups, conditioned on drug groups and hospitalisation

Weight analysis

Through the weight analysis, we identified additional drug classes that contributed to the accuracy of prediction. Many of them were found to contribute to predictions of both increases and decreases (Table 4). For instance, magnesium is among the drug groups with a high accuracy for increase (71.4%), but also an important feature for decreases among the patients without hospitalisation (78.6%).

Table 4 Weight analysis: Contribution of drug classes to the prediction

Subgroup analysis

We present examples of the subgroup analysis in Table 5. We found small (100–600 people) but highly predictive subgroups for costs increases (as high as 88%). Moreover, the gain because of the drug class was high, reaching up to 23% for folic acid (Example #1) and 21% for oral iron supplements (Example #3). In addition to drug classes, subgroups were further characterised by a variety of features, including outpatient visits, drug prescription information (both counts and temporal information), information on the deductible, home care, and hospitalisation. Example #7 represents a rather large subgroup of patients without hospitalisation that have a high fraction of decrease (fraction of decrease 0.74, gain 18%).

Table 5 Examples of subgroups derived from the decision tree


Our models classify patients according to their probability of an increase in costs, with especially a few features contributing substantially to the prediction. Pharmacotherapy provides important information on the cost increase prediction, and its relative importance increases in interaction with other features including health care utilisation. We identified patient subgroups with very high probabilities of increase and decrease.

Performance of models

Our models predict whether patients’ total health care costs will increase in the subsequent year, with an accuracy of up to 67.6% (AUC 0.74). Lahiri et al. [23] reported a higher accuracy (77.6%) when investigating increases in inpatient claims costs using Medicare data. Although this study is the closest in terms of setting to our study, some major differences should be emphasised: First, Lahiri et al. predicted inpatient expenditures using both inpatient and outpatient information, whereas we consider the change in total health care costs using only outpatient claims and whether or not a patient was hospitalised. Moreover, they found diagnoses and features indicating the development of a new chronic condition the most important features. Diagnoses are not available in our dataset because of legal regulations in Switzerland, and the derivation of features indicating the development of a new chronic condition requires information from the year for which predictions are made. Because these data are typically not available in a prospective scenario, our study was designed so that all the features could be generated without any information from the subsequent year. We found that, for the prediction of a costs increase, medical and costs features contained complementary information. Additionally, the inclusion of medical features facilitates the identification of potential targets for preventive measures [32, 33].

General feature importance

In general, we found that high healthcare utilisation in the first year was an indicator for a decrease in the following year. Using backward deletion, we identified the 36 most important features, including, for example, length of hospital stay, home care, and count measures for outpatient visits and drug prescriptions. Simple count measures accurately capture the intensity of health care utilisation and therefore may reflect the severity of the disease state [17]. Additionally, when they are generated for multiple timeframes, these measures can be used to introduce valuable temporal information, which was highlighted in a recent study by Morid et al. [28] Interestingly, the counts of drug prescriptions and outpatient visits in the last quarter and the last month of the year are among the most important features, which indicates that the model assigns a risk of therapy continuation in the next year. Intravenously administered drugs are typically associated with some severe conditions, explaining why the intravenous mode of administration was an important feature in our study. Likewise, Pritchard et al. [1] reported that physician-administered injectable or infusible treatments account for a comparably higher fraction of expenditures in high-resource patients. We identified diabetes as an important chronic condition for the prediction of a cost increase, which is consistent with diagnoses identified as important in other studies [23]. In general, chronic conditions [2, 34] and multimorbidity [35] are well-described risk factors for high health care utilisation.

Drug importance analysis

We found that high probabilities of increase are mainly associated with drug groups used to treat chronic conditions that have a higher likelihood of worsening over time (e.g., anticholinesterases and dopa derivatives for treating dementia or parkinson). In contrast, drug groups associated with higher probability of decrease are predominantly used for severe acute conditions requiring extensive treatment (e.g., adrenergic and dopaminergic agents) or are proxies for expensive procedures, such as (local) anaesthetics used in day surgery. Evaluating the contribution of drug classes to the prediction using a weight analysis, we found that many drug groups contribute to the prediction of both increases and decreases. This finding indicates that the contribution of pharmacotherapy depends on other features and can vary greatly across subgroups.

Subgroup analysis

When evaluating several example drug groups in more detail, their contribution becomes even clearer. We identified subgroups with a high probability of increase (up to 88%). Although there may be even more, we can derive at least 3 higher-level groups from our examples: 1.) potentially pregnant patients who have not yet delivered; 2.) healthy patients; and 3.) patients suffering from chronic conditions with low use of health care resources. Pregnancy without delivery is considered an important condition for predicting future resource use [36] and is therefore included as a feature in some diagnosis-based comorbidity scores. Lacking diagnosis codes, our model identifies combinations of ATC codes (e.g., folic acid, magnesium), outpatient specialist visits for gynaecology, and few outpatient visits at the beginning of the year as patterns indicating potential pregnancy. For a subgroup of patients hospitalised for delivery, the model predicted a decrease in costs, with as much as 92% accuracy. The “healthy patients” group was characterised by few prescriptions (including at least 1 prescription for oral iron supplements) and a high deductible that did not change in the next year, indicating a self-assessment of very good health status. Self-reported general health has been found to be an important indicator of future health care utilisation in previous studies [18, 37]. Claims data do not include information on self-reported health, so changes in the deductible may serve as an indicator of patients’ individual expectations regarding upcoming health expenditures. Tamang et al. [21] found that patients with a large increase in costs were younger and less likely to have hospitalisation costs and chronic conditions, compared with persistent high-costs patients, which is consistent with our subgroup findings. The final group represents elderly patients suffering from a chronic or worsening conditions, with low use of health care resources, yet having a higher likelihood for an increase in the latter for the following year. Subgroups of patients with a high probability of a cost decrease were characterised by chronic conditions, with intensive health-related claims (hospitalisation, home care), or expensive diagnostic procedures or day surgery.


Change in health care costs is a very broad outcome, and our data represents a whole population, without restrictions on underlying diseases or demographic groups. We therefore found multiple reasons for the increase and decrease in costs, many of which are not predictable or preventable (e.g., accidents). Diagnoses might have provided additional patient information, but they were not available in Swiss claims data. Expensive claims such as hospitalisation in the first year may mask less expensive changes such as new drug prescriptions or additional physician visits in the following year, making the development of costs unsuitable for the evaluation of causal drug-related risk-factors. Model-wise, the main limitation was associated with the sparsity in representing the prescriptions. We think that learning distributed embeddings via techniques similar to skip-gram [38] might mediate this problem. Moreover, it is an active research area to apply recurrent neural networks for learning representations of medical codes and patients [39,40,41,42,43]. In this context, the findings of our study can provide a good starting point for interpreting the results of such advanced models.


This research focused on cost increase on the population level covering two subsequent years. Future research should cover multiple subsequent years. In a recent Danish study, Tamang et al. [21] reported that over the course of eight years, the majority of high-cost patients showed only one high-cost year. Among those with multiple high-cost years, many did not experience them consecutively. In the light of high fluctuation of individual annual costs, the evaluation of an increase in costs using a longer study period may provide insight into long-term effects.

Our project was designed to evaluate the risk factors for cost increase for the total population. While this approach allows for a broad investigation, it naturally reduces the impact of rare drug classes on the overall accuracy. However, such drug classes including chemotherapeutics or biologicals would be of special interest due to their contribution to the overall cost increase in healthcare. To evaluate the impact of rare but high-cost treatments in more detail, future studies have to focus on specific subgroups. This approach would reduce sparsity in the data and would allow to use substances instead of drug classes. Additionally, temporal information on treatment induction, duration and intensity should be included in future analyses.

Our results provide subgroups with high probability of cost increase. This information can help decision makers to optimise the healthcare services for these subgroups through an improved resource allocation planning. For instance, we identified a subgroup of healthy patients which are likely to develop a cost increase. This group may be further investigated with respect to causes, amount and preventability of cost increase. For patients suffering from chronic conditions with low use of health care resources, preventive measures such as disease management programs could be established. Additionally, patients may better choose their deductibles for the next year based on the prediction of the future cost development.


The development of costs can be predicted using a binary classification. Our results indicate that the contribution of pharmacotherapy depends strongly on other features and can vary across subgroups. Therefore, further studies may focus on the development of models for predefined and therefore less heterogeneous subgroups. The detailed understanding of such subgroups may help to identify potential starting points for improving patient management.

Availability of data and materials

The datasets analysed during the current study are not publicly available as they are part of the confidential Helsana health insurance claims database. Additional information not included in the paper is available from the corresponding author on reasonable request.



Anatomical Therapeutic Chemical Classification


Area under the curve


Boosted decision tree


Diagnosis-related group


Feedforward neural networks


Gross domestic product


Global Trade Item Number


Logistic regression


  1. 1.

    Pritchard D, Petrilla A, Hallinan S, et al. What contributes Most to high health care costs? Health care spending in high resource patients. JMCP. 2016;22(2):102–9.

    Article  Google Scholar 

  2. 2.

    Hu Z, Hao S, Jin B, et al. Online prediction of health care utilization in the next six months based on electronic health record information: a cohort and validation study. J Med Internet Res. 2015;17(9):e219.

    Article  Google Scholar 

  3. 3.

    World Health Organisation Global Health Observatory data repository 2019 [Available from:] Accessed 2 Feb. 2019.

  4. 4.

    Bertsimas D, Bjarnadóttir MV, Kane MA, et al. Algorithmic prediction of health-care costs. Oper Res. 2008;56(6):1382–92.

    Article  Google Scholar 

  5. 5.

    Powers CA, Meyer CM, Roebuck MC, et al. Predictive modeling of Total healthcare costs using pharmacy claims data: a comparison of alternative econometric cost modeling techniques. Med Care. 2005;43(11):1065–72.

    Article  Google Scholar 

  6. 6.

    Kuo RN, Dong Y-H, Liu J-P, et al. Predicting healthcare utilization using a pharmacy-based metric with the WHO’s anatomic therapeutic chemical algorithm. Med Care. 2011;49(11):1031–9.

    Article  Google Scholar 

  7. 7.

    Yang C, Delcher C, Shenkman E, et al. Machine learning approaches for predicting high cost high need patient expenditures in health care. Biomed Eng Online. 2018;17(Suppl 1):131.

    Article  Google Scholar 

  8. 8.

    König HH, Leicht H, Bickel H, et al. Effects of multiple chronic conditions on health care costs: an analysis based on an advanced tree-based regression model. BMC Health Serv Res. 2013;13:219.

    Article  Google Scholar 

  9. 9.

    Lee S-M, Kang J-O, Suh Y-M. Comparison of hospital charge prediction models for colorectal Cancer patients: neural network vs. decision tree models. J Korean Med Sci. 2004;19:677–81.

    Article  Google Scholar 

  10. 10.

    Guo X, Gandy W, Coberley C, et al. Predicting health care cost transitions using a multidimensional adaptive prediction process. Popul Health Manag. 2015;18(4):290–9.

    Article  Google Scholar 

  11. 11.

    Sushmita S, Newman S, Marquardt J, et al. Population Cost Prediction on Public Healthcare Datasets. In: DH '15 Proceedings of the 5th International Conference on Digital Health; 2015. p. 87–94.

    Google Scholar 

  12. 12.

    Duncan I, Loginov M, Ludkovski M. Testing alternative regression frameworks for predictive modeling of health care costs. North American Actuarial Journal. 2016;20(1):65–87.

    Article  Google Scholar 

  13. 13.

    Huber CA, Schneeweiss S, Signorell A, et al. Improved prediction of medical expenditures and health care utilization using an updated chronic disease score and claims data. J Clin Epidemiol. 2013;66(10):1118–27.

    Article  Google Scholar 

  14. 14.

    Sales AE, Liu C-F, Sloan KL, et al. Predicting costs of care using a pharmacy-based measure risk adjustment in a veteran population. Med Care. 2003;41(6):753–60.

    PubMed  Google Scholar 

  15. 15.

    Zhao Y, Ellis RP, Ash AS, et al. Measuring population health risks using inpatient diagnoses and outpatient pharmacy data. Health Serv Res. 2001;36(6):180–93.

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Kuo RN, Lai MS. Comparison of Rx-defined morbidity groups and diagnosis- based risk adjusters for predicting healthcare costs in Taiwan. BMC Health Serv Res. 2010;10:126.

    Article  Google Scholar 

  17. 17.

    Farley JF, Harley CR, Devine JW. A comparison of comorbidity measurements to predict healthcare expenditures. Am J Manag Care. 2006;12(2):110–7.

    PubMed  Google Scholar 

  18. 18.

    Frees EW, Jin X, Lin X. Actuarial applications of multivariate two-part regression models. Annals of Actuarial Science. 2013;7(02):258–87.

    Article  Google Scholar 

  19. 19.

    Fishman PA, Goodman MJ, Hornbrook MC, et al. Risk adjustment using automated ambulatory pharmacy data. Med Care. 2003;41(1):84–99.

    Article  Google Scholar 

  20. 20.

    Dove HG, Duncan I, Robb A. A prediction model for targeting low-cost, high-risk members of managed care organizations. Am J Manag Care. 2003;9(5):381–9.

    PubMed  Google Scholar 

  21. 21.

    Tamang S, Milstein A, Sørensen HT, et al. Predicting patient 'cost blooms' in Denmark: a longitudinal population-based study. BMJ Open. 2017;7(1):e011580.

    Article  Google Scholar 

  22. 22.

    Morid MA, Kawamoto K, Ault T, et al. Supervised learning methods for predicting healthcare costs: systematic literature review and empirical evaluation. AMIA Annu Symp Proc. 2017:1312–21.

  23. 23.

    Lahiri B, Agarwal N. Predicting healthcare expenditure increase for an individual from Medicare data. Proceedings of the ACM SIGKDD Workshop on Health Informatics. 2014. “[Available from]. Accessed 19 Feb 2019

  24. 24.

    Reich O, Rosemann T, Rapold R, et al. Potentially inappropriate medication use in older patients in Swiss managed care plans: prevalence, determinants and association with hospitalization. PLoS One. 2014;9(8):e105425.

    Article  Google Scholar 

  25. 25.

    Huber CA, Szucs TD, Rapold R, et al. Identifying patients with chronic conditions using pharmacy data in Switzerland: an updated mapping approach to the classification of medications. BMC Public Health. 2013;13:1030.

  26. 26.

    World Health Organisation Collaborating Centre for Drug Statistics Methodology ATC Structure and principles [Available from:] Accessed 22 Jan. 2018.

  27. 27.

    SwissDRG. Online Definitionshandbuch SwissDRG 3.0 Abrechnungsversion 2013. Available from: Accessed 5 Dec 2017.

  28. 28.

    Morid MA, Liu Sheng OR, Kawamoto K, et al. Healthcare cost prediction: leveraging fine-grain temporal patterns. J Biomed Inform. 2019;91:103113.

    Article  Google Scholar 

  29. 29.

    Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In Proc 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016) 2016:785–794.

    Google Scholar 

  30. 30.

    Schapire RE. The boosting approach to machine learning: an overview. In: Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B, editors. Nonlinear estimation and classification. Lecture notes in statistics. New York: Springer; 2003. p. 171.

    Google Scholar 

  31. 31.

    ELI5 [Available from:] Accessed,3 Nov. 2018.

  32. 32.

    Forrest CB, Lemke KW, Bodycombe DP, et al. Medication, diagnostic, and cost information as predictors of high-risk patients in need of care management. Am J Manag Care. 2009;15(1):41–8.

    PubMed  Google Scholar 

  33. 33.

    Ash AS, Zhao Y, Ellis RP, et al. Finding future high-cost cases: comparing prior cost versus diagnosis-based methods. Health Serv Res. 2001;36(6):194–206.

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Hartmann J, Jacobs S, Eberhard S, et al. Analysing predictors for future high-cost patients using German SHI data to identify starting points for prevention. Eur J Pub Health. 2016;26(4):549–55.

    Article  Google Scholar 

  35. 35.

    Bähler C, Huber CA, Brüngger B, et al. Multimorbidity, health care utilization and costs in an elderly community-dwelling population: a claims data based observational study. BMC Health Serv Res. 2015;15:23.

    Article  Google Scholar 

  36. 36.

    Johns Hopkins University Bloomberg School of Public Health: The Johns Hopkins ACG System Technical Reference Guide 2011.

    Google Scholar 

  37. 37.

    Rosella LC, Kornas K, Yao Z, et al. Predicting high health care resource utilization in a single-payer public health care system. Med Care. 2018;56(10):e61–169.

    Article  Google Scholar 

  38. 38.

    Le Q. Mikolov T. Distributed Representations of Sentences and Documents. In Proceedings of ICML 2014. [Available from]. Accessed 11 Mar 2019

  39. 39.

    Choi E, Bahadori MT, Schuetz A, et al. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. arXiv:151105942v11 2016.

  40. 40.

    Choi E, Bahadori MT, Song L, et al. GRAM: Graph-based Attention Model for Healthcare Representation Learning. arXiv:161107012v3. 2017.

  41. 41.

    Choi E, Schuetz A, Stewart WF, et al. Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction. arXiv:160203686v2. 2017.

  42. 42.

    Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality. arXiv:13104546v1. 2013.

  43. 43.

    Miotto R, Li L, Kidd BA, et al. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016;6:26094.

    CAS  Article  Google Scholar 

Download references


The authors thank the Helsana Group for providing the data.



Author information




ME, AJ, HS, TN, IC, MR and GK designed the study. IT and UZ extracted and organised the data according to the study needs. AJ, TN and HS analysed the data, generated features and developed the models. MR provided statistical advice. AJ, IC and ME medically interpreted the results. AJ and HS drafted the manuscript. All authors critically reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Marco Egbring.

Ethics declarations

Ethics approval and consent to participate

The harmlessness of the study was confirmed by the Cantonal Ethics Committee of Zurich, although no formal ethical approval was required under Swiss law.

Consent for publication

Not applicable.

Competing interests

UZ and IT are employed by the Helsana Group. The other authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Supplementary information: Variance of cost difference explained by basic features using multiple linear regression analysis (Table S1. Multiple linear regression models using features observed in 2014) and Backward deletion (Table S2. Features included in the small model derived from backward deletion, Figure S1. Backward deletion: Number of features included in the complete model and corresponding accuracy levels).

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jödicke, A.M., Zellweger, U., Tomka, I.T. et al. Prediction of health care expenditure increase: how does pharmacotherapy contribute?. BMC Health Serv Res 19, 953 (2019).

Download citation


  • Machine learning
  • Health care utilisation
  • Health care costs
  • Boosted decision tree
  • Neural network
  • Pharmacology
  • Claims data