Predicting hospital length of stay using machine learning on a large open health dataset

Jain, Raunak; Singh, Mrityunjai; Rao, A. Ravishankar; Garg, Rahul

doi:10.1186/s12913-024-11238-y

Research
Open access
Published: 29 July 2024

Predicting hospital length of stay using machine learning on a large open health dataset

Raunak Jain¹,
Mrityunjai Singh¹,
A. Ravishankar Rao² &
…
Rahul Garg¹

BMC Health Services Research volume 24, Article number: 860 (2024) Cite this article

98 Accesses
Metrics details

Abstract

Background

Governments worldwide are facing growing pressure to increase transparency, as citizens demand greater insight into decision-making processes and public spending. An example is the release of open healthcare data to researchers, as healthcare is one of the top economic sectors. Significant information systems development and computational experimentation are required to extract meaning and value from these datasets. We use a large open health dataset provided by the New York State Statewide Planning and Research Cooperative System (SPARCS) containing 2.3 million de-identified patient records. One of the fields in these records is a patient’s length of stay (LoS) in a hospital, which is crucial in estimating healthcare costs and planning hospital capacity for future needs. Hence it would be very beneficial for hospitals to be able to predict the LoS early. The area of machine learning offers a potential solution, which is the focus of the current paper.

Methods

We investigated multiple machine learning techniques including feature engineering, regression, and classification trees to predict the length of stay (LoS) of all the hospital procedures currently available in the dataset. Whereas many researchers focus on LoS prediction for a specific disease, a unique feature of our model is its ability to simultaneously handle 285 diagnosis codes from the Clinical Classification System (CCS). We focused on the interpretability and explainability of input features and the resulting models. We developed separate models for newborns and non-newborns.

Results

The study yields promising results, demonstrating the effectiveness of machine learning in predicting LoS. The best R² scores achieved are noteworthy: 0.82 for newborns using linear regression and 0.43 for non-newborns using catboost regression. Focusing on cardiovascular disease refines the predictive capability, achieving an improved R² score of 0.62. The models not only demonstrate high performance but also provide understandable insights. For instance, birth-weight is employed for predicting LoS in newborns, while diagnostic-related group classification proves valuable for non-newborns.

Conclusion

Our study showcases the practical utility of machine learning models in predicting LoS during patient admittance. The emphasis on interpretability ensures that the models can be easily comprehended and replicated by other researchers. Healthcare stakeholders, including providers, administrators, and patients, stand to benefit significantly. The findings offer valuable insights for cost estimation and capacity planning, contributing to the overall enhancement of healthcare management and delivery.

Peer Review reports

Introduction

Democratic governments worldwide are placing an increasing importance on transparency, as this leads to better governance, market efficiency, improvement, and acceptance of government policies. This is highlighted by reports from the Organization for Economic Co-operation and Development (OECD) an international organization whose mission it is to shape policies that foster prosperity, equality, opportunity and well-being for all [1]. Openness and transparency have been recognized as pillars for democracy, and also for fostering sustainable development goals [2], which is a major focus of the United Nations (https://sustainabledevelopment.un.org/sdg16).

An important government function is to provide for the healthcare needs of its citizens. The U.S. spends about $3.6 trillion a year on healthcare, which represents 18% of its GDP [3]. Other developed nations spend around 10% of their GDP on healthcare. The percentage of GDP spent on healthcare is rising as populations age. Consequently, research on healthcare expenditure and patient outcomes is crucial to maintain viable national economies. It is advantageous for nations to combine investigations by the private sector, government sector, non-profit agencies, and universities to find the best solutions. A promising path is to make health data open, which allows investigators from all sectors to participate and contribute their expertise. Though there are obvious patient privacy concerns, open health data has been made available by organizations such as New York State Statewide Planning and Research Cooperative System (SPARCS) [4].

Once the data is made available, it needs to be suitably processed to extract meaning and insights that will help healthcare providers and patients. We favor the creation and use of an open-source analytics system so that the entire research community can benefit from the effort [5,6,7]. As a concrete demonstration of the utility of our system and approach, we revealed that there is a growing incidence of mental health issues amongst adolescents in specific counties in New York State [8]. This has resulted in targeted interventions to address these problems in these communities [8]. Knowing where the problems lie allows policymakers and funding agencies to direct resources where needed.

Healthcare in the U.S. is largely provided through private insurance companies and it is difficult for patients to reliably understand what their expected healthcare costs are [9, 10]. It is ironic that consumers can readily find prices of electronics items, books, clothes etc. online, but cannot find information about healthcare as easily. The availability of healthcare information including costs, incidence of diseases, and the expected length of stay for different procedures will allow consumers and patients to make better and more informed choices. For instance, in the U.S., patients can budget pre-tax contributions to health savings accounts, or decide when to opt for an elective surgery based on the expected duration of that procedure.

To achieve this capability, it is essential to have the underlying data and models that interpret the data. Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to a problem of interest to both healthcare providers and patients. Significant advances have been made recently in the fields of data mining, machine-learning and artificial intelligence, with growing applications in healthcare [11]. To make our work concrete, we use our machine-learning system to predict the length of stay (LoS) in hospitals given the patient information in the open healthcare data released by New York State SPARCS [4].

The LoS is an important variable in determining healthcare costs, as costs directly increase for longer stays. The analysis by Jones [12] shows that the trends in LoS, hospital bed capacity and population growth have to be carefully analyzed for capacity planning and to ensure that adequate healthcare can be provided in the future. With certain health conditions such as cardiovascular disease, the hospital LoS is expected to increase due to the aging of the population in many countries worldwide [13]. During the COVID-19 pandemic, hospital bed capacity became a critical issue [14], and many regions in the world experienced a shortage of healthcare resources. Hence it is desirable to have models that can predict the LoS for a variety of diseases from available patient data.

The LoS is usually unknown at the time a patient is admitted. Hence, the objective of our research is to investigate whether we can predict the patient LoS from variables collected at the time of admission. By building a predictive model through machine learning techniques, we demonstrate that it is possible to predict the LoS from data that includes the Clinical Classifications Software (CCS) diagnosis code, severity of illness, and the need for surgery. We investigate several analytics techniques including feature selection, feature encoding, feature engineering, model selection, and model training in order to thoroughly explore the choices that affect eventual model performance. By using a linear regression model, we obtain an R² value of 0.42 when we predict the LoS from a set of 23 patient features. The success of our model will be beneficial to healthcare providers and policymakers for capacity planning purposes and to understand how to control healthcare costs. Patients and consumers can also use our model to estimate the LoS for procedures they are undergoing or for planning elective surgeries.

Background

Stone et al. [15] present a survey of techniques used to predict the LoS, which include statistical and arithmetic methods, intelligent data mining approaches and operations-research based methods. Lequertier et al. [16] surveyed methods for LoS prediction.

The main gap in the literature is that most methods focus on analyzing trends in the LoS or predicting the LoS only for specific conditions or restrict their analysis to data from specific hospitals. For instance, Sridhar et al. [17] created a model to predict the LoS for joint replacements in rural hospitals in the state of Montana by using a training set with 127 patients and a test set with 31 patients. In contrast, we have developed our model to predict the LoS for 285 different CCS diagnosis codes, over a set of 2.3 million patients over all hospitals in New York state. The CCS diagnosis code refers to the code used by the Clinical Classifications Software system, which encompasses 285 possible diagnosis and procedure categories [18]. Since the CCS diagnosis codes are too numerous to list, we give a few examples that we analyzed, including but not limited to abdominal hernia, acute myocardial infarction, acute renal failure, behavioral disorders, bladder cancer, Hodgkins disease, multiple sclerosis, multiple myeloma, schizophrenia, septicemia, and varicose veins. To the best of our knowledge, we are not aware of models that predict the LoS on such a variety of diagnosis codes, with a patient sample greater than 2 million records, and with freely available open data. Hence, our investigation is unique from this point of view.

Sotodeh et al. [19] developed a Markov model to predict the LoS in intensive care unit patients. Ma et al. [20] used decision tree methods to predict LoS in 11,206 patients with respiratory disease.

Burn et. al. examined trends in the LoS for patients undergoing hip-replacement and knee-replacement in the U.K. [21]. Their study demonstrated a steady decline in the LoS from 1997–2012. The purpose of their study was to determine factors that contributed to this decline, and they identified improved surgical techniques such as fast-track arthroplasty. However, they did not develop any machine-learning models to predict the LoS.

Hachesu et al. examined the LoS for cardiac disease patients [22] and found that blood pressure is an important predictor of LoS. Garcia et al. determined factors influencing the LoS for undergoing treatment for hip fracture [23]. B. Vekaria et al. analyzed the variability of LoS for COVID-19 patients [24]. Arjannikov et al. [25] used positive-unlabeled learning to develop a predictive model for LoS.

Gupta et al. [26] conducted a meta-analysis of previously published papers on the role of nutrition on the LoS of cancer patients, and found that nutrition status is especially important in predicting LoS for gastronintestinal cancer. Similarly, Almashrafi et al. [27] performed a meta-analysis of existing literature on cardiac patients and reviewed factors affecting their LoS. However, they did not develop quantitative models in their work. Kalgotra et al. [28] use recurrent neural networks to build a prediction model for LoS.

Daghistani et al. [13] developed a machine learning model to predict length of stay for cardiac patients. They used a database of 16,414 patient records and predicted the length of stay into three classes, consisting of short LoS (< 3 days), intermediate LoS ( 3–5 days) and long LoS (> 5 days). They used detailed patient information, including blood test results, blood pressure, and patient history including smoking habits. Such detailed information is not available in the much larger SPARCS dataset that we utilized in our study.

Awad et al. [29] provide a comprehensive review of various techniques to predict the LoS. Though simple statistical methods have been used in the past, they make assumptions that the LoS is normally distributed, whereas the LoS has an exponential distribution [29]. Consequently, it is preferable to use techniques that do not make assumptions about the distribution of the data. Candidate techniques include regression, classification and regression trees, random forests, and neural networks. Rather than using statistical parametric techniques that fit parameters to specific statistical distributions, we favor data-driven techniques that apply machine-learning.

In 2020, during the height of the COVID-19 pandemic, the Lancet, a premier medical journal drew widespread rebuke [30,31,32] for publishing a paper based on questionable data. Many medical journals published expressions of concern [33, 34]. The Lancet itself retracted the questionable paper [35], which is available at [36] with the stamp “retracted” placed on all pages. One possible solution to prevent such incidents from occurring is for top medical journals to require authors to make their data available for verification by the scientific community. Patient privacy concerns can be mitigated by de-identifying the records made available, as is already done by the New York State SPARCS effort [4]. Our methodology and analytics system design will become more relevant in the future, as there is a desire to prevent a repetition of the Lancet debacle. Even before the Lancet incident, there was declining trust amongst the public related to medicine and healthcare policy [37]. This situation continues today, with multiple factors at play, including biased news reporting in mainstream media [38]. A desirable solution is to make these fields more transparent, by releasing data to the public and explaining the various decisions in terms that the public can understand. The research in this paper demonstrates how such a solution can be developed.

Requirements

We describe the following three requirements of an ideal system for processing open healthcare data

1.
Utilize open-source platforms to permit easy replicability and reproducibility.
2.
Create interpretable and explainable models.
3.
Demonstrate an understanding of how the input features determine the outcomes of interest.

The first requirement captures the need for research to be easily reproduced by peers in the field. There is growing concern that scientific results are becoming hard for researchers to reproduce [39,40,41]. This undermines the validity of the research and ultimately hurts the fields. Baker termed this the “reproducibility crisis”, and performed an analysis of the top factors that lead to irreproducibility of research [39]. Two of the top factors consist of the unavailability of raw data and code.

The second requirement addresses the need for the machine-learning models to produce explanations of their results. Though deep-learning models are popular today, they have been criticized for functioning as black-boxes, and the precise working of the model is hard to discern. In the field of healthcare, it is more desirable to have models that can be explained easily [42]. Unless healthcare providers understand how a model works, they will be reluctant to apply it in their practice. For instance, Reyes et al. determined that interpretable Artificial Intelligence systems can be better verified, trusted, and adopted in radiology practice [43].

The third requirement shows that it is important for relevant patient features to be captured that can be related to the outcomes of interest, such as LoS, total cost, mortality rate etc. Furthermore, healthcare providers should be able to understand the influence of these features on the performance of the model [44]. This is especially critical when feature engineering methods are used to combine existing features and create new features.

In the subsequent sections, we present our design for a healthcare analytics system that satisfies these requirements. We apply this methodology to the specific problem of predicting the LoS.

Methods

We have designed the overall system architecture as shown in Fig. 1. This system is built to handle any open data source. We have shown the New York SPARCS as one of the data sources for the sake of specificity. Our framework can be applied to data from multiple sources such as the Center for Medicare and Medicaid Services (CMS in the U.S.) as shown in our previous work [6]. We chose a Python-based framework that utilizes Pandas [45] and Scikit learn [46]. Python is currently the most popular programming language for engineering and system design applications [47].

In Fig. 2, we provide a detailed overview of the necessary processing stages. The specific algorithms used in each stage are described in the following sections.

Recent research has shown that it is highly desirable for machine learning models used in the healthcare domain to be explainable to healthcare providers and professionals [48]. Hence, we focused on the interpretability and explainability of input features in our dataset and the models we chose to explore. We restricted our investigation to models that are explainable, including regression models, multinomial logistic regression, random forests, and decision trees. We also developed separate models for newborns and non-newborns.

Brief description of the dataset

During our investigation, we utilized open-health data provided by the New York State SPARCS system. The data we accessed was from the year 2016, which was the most recent year available at the time. This data was provided in the form of a CSV file, containing 2,343,429 rows and 34 columns. Each row contains de-identified in-patient discharge information. The dataset columns contained various types of information. They included geographic descriptors related to the hospital where care was provided, demographic descriptors such as patient race, ethnicity, and age, medical descriptors such as the CCS diagnosis code, APR DRG code, severity of illness, and length of stay. Additionally, payment descriptors were present, which included information about the type of insurance, total charges, and total cost of the procedure.

Detailed descriptions of all the elements in the data can be found in [49]. The CCS diagnosis code has been described earlier. The term “DRG” stands for Diagnostic Related Group [49], which is used by the Center for Medicare and Medicaid services in the U.S. for reimbursement purposes [50].

The data includes all patients who underwent inpatient procedures at all New York State Hospitals [51]. The payment for the care can come from multiple sources: Department of Corrections, Federal/State/Local/Veterans Administration, Managed Care, Medicare, Medicaid, Miscellaneous, Private Health Insurance, and Self-Pay. The dataset sourced from the New York State SPARCS system, encompassing a wider patient population beyond Medicare/Medicaid, holds greater value compared to datasets exclusively composed of Medicare/Medicaid patients. For instance, Gilmore et al. analyzed only Medicare patients [52].

We examine the distribution of the LoS in the dataset, as shown in Fig. 3. We note that the providers of the data have truncated the length of stay to 120 days. This explains the peak we see at the tail of the distribution.

Data pre-processing and cleaning

We identified 36,280 samples, comprising 1.55% of the data where there were missing values. These were discarded for further analysis. We removed samples which have Type of Admission = ‘Unknown’ (0.02% samples). So, the final data set has 2,306,668 samples. ‘Payment Typology 2’, and ‘Payment Typology 3’, have missing values (> = 50% samples), which were replaced by a ‘None’ string.

We note that approximately 10% of the dataset consists of rows representing newborns. We treat this group as a separate category. We found that the ‘Birth Weight’ feature had a zero value for non-newborn samples. Accordingly, to better use the ‘Birth Weight’ feature, we partitioned the data into two classes: newborns and non-newborns. This results in two classes of models, one for newborns and the second for all other patients. We removed the ‘Birth Weight’ feature in the input for the non-newborn samples as its value was zero for those samples.

The column ‘Total Costs’ (and in a similar way, ‘Total Charges’) are usually proportional to the LoS, and it would not be fair to use these variables to predict the LoS. Hence, we removed this column. We found that the columns 'Discharge Year', 'Abortion Edit Indicator'' are redundant for LoS prediction models, and we removed them. We also removed the columns ‘CCS Diagnosis Description’, ‘CCS Procedure Description’, ‘APR DRG Description’, ‘APR MDC Description’, and ‘APR Severity of Illness Description’ as we were given their corresponding numerical codes as features.

Since the focus of this paper is on the prediction of the LoS, we analyzed the distribution of LoS values in the dataset.

We developed regression models using all the LoS values, from 1–120. We also developed classification models where we discretized the LoS into specific bins. Since the distribution of LoS values is not uniform, and is heavily clustered around smaller values, we discretized the LoS into a small number of bins, e.g. 6 to 8 bins.

We utilized 10% of the data as a holdout test-set, which was not seen during the training phase. For the remaining 90% of the data, we used tenfold cross-validation in order to train the model and determine the best parameters to use.

Feature encoding

Many variables in the dataset are categorical, e.g., the variable “APR Severity of Illness Description” has the values in the set [Major, Minor, Moderate, Extreme]. We used distribution-dependent target encoding techniques and one-hot techniques to improve the model performance [53]. We replaced categorical data with the product of mean LoS and median LoS for a category value. The categorical feature can then better capture the dependence distribution of LoS with the value of the categorical feature.

For the linear regression model [54], we sampled a set of 6 categorical features, [‘Type of Admission’, ‘Patient Disposition’, ‘APR Severity of Illness Code’, ‘APR Medical Surgical Description’, ‘APR MDC Code’] which we target encoded with the mean of the LoS and the median of the LoS. We then one-hot encoded every feature (all features are categorical) and for each such one-hot encoded feature, created a new feature for each of the features in the sampled set, by replacing the ones in the one-hot encoded feature with the value of the corresponding feature in the sampled set. For example, we one-hot encoded ‘Operating Certificate Number’, and for samples where ‘Operating Certificate Number’ was 3, we created 6 features, each where samples having the value 3 were assigned the target encoded values of the sampled set features, and the other samples were assigned zero. We used such techniques to exploit the linear relation between LoS and each feature.

According to the sklearn documentation [55], a random forest regressor is “a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting”. The random forest regressor leverages ensemble learning based on many randomized decision trees to make accurate and robust predictions for regression problems. The averaging of many trees protects against single trees overfitting the training data.

The random forest classifier is also an ensemble learning technique and uses many randomized decision trees to make predictions for classification problems. The 'wisdom of crowds' concept suggests that the decision made by a larger group of people is typically better than an individual. The random forest classifier uses this intuition, and allows each decision tree to make a prediction. Finally, the most popular predicted class is chosen as the overall classification.

For the Random Forest Regressor [56, 57] and Random Forest Classifier [58], we only used a similar distribution dependent target encoding as a random forest classifier/ regressor is unsuitable for sparse one-hot encoded columns.

Multinomial logistic regression is a type of regression analysis that predicts the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. It allows for more than two discrete outcomes, extending binomial logistic regression for binary classification to models with multiple class membership. For the multinomial logistic regression model [59], we used only one-hot encoding, and not target encoding, as the target value was categorical.

Finally, we experimented with combinations of target encoding and one-hot encoding. We can either use target encoding, or one-hot encoding, or both. When both encodings are employed, the dimensionality of the data increases to accommodate the one-hot encoded features. For each combination of encodings, we also experimented with different regression models including linear regression and random forest regression.

Feature importance, selection, and feature engineering

We experimented with different feature selection methods. Since the focus of our work is on developing interpretable and explainable models, we used SHAP analysis to determine relevant features.

We examine the importance of different features in the dataset. We used the SHAP value (Shapley Additive Explanations), a popular measure for feature importance [60]. Intuitively, the SHAP value measures the difference in model predictions when a feature is used versus omitted. It is captured by the following formula.

$$\emptyset_i(p)=\sum\limits_{S\subseteq N/i}\frac{|S|!(n-|S|-1)!}{n!}(p(S\cup i)-p(S))$$

where ${{\varnothing }}_{i}$ is the SHAP value of feature $i$, $p$ is the prediction by the model, n is the number of features and S is any set of features that does not include the feature $i$. The specific model we used for the prediction was the random forest regressor where we target-encoded all features with the product of the mean and the median of the LoS, since most of the features were categorical.

Classification models

One approach to the problem is to bin the LoS into different classes, and train a classifier to predict which class an input sample falls in. We binned the LoS into roughly balanced classes as follows: 1 day, 2 days, 3 days, 4–6 days, > 6 days. This strategy is based on the distribution of the LoS as shown earlier in Figs. 3 and 4.

We used three different classification models, comprising the following:

1.
Multinomial Logistic Regression
2.
Random Forest Classifier
3.
CatBoost classifier [62].

We used a Multinomial Logistic Regression model [59] trained and tested using tenfold cross validation to classify the LoS into one of the bins. The multinomial logistic regression model is capable of providing explainable results, which is part of the requirements. We used the feature engineering techniques described in the previous section.

We used a Random Forest Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins. We used a maximum depth of 10 so as to get explainable insights into the model.

Finally, we used a CatBoost Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins.

Regression models

We used three different regression models with the feature engineering techniques mentioned above (Feature encoding section). These comprise:

1.
Linear regression
2.
Catboost regression
3.
Random forest regression

The linear regression was implemented using the nn.Linear() function in the open source library PyTorch [63]. We used the ‘Adam’ optimization algorithm [64] in mini-batch settings to train the model weights for linear regression.

We investigated CatBoost regression in order to create models with minimal feature sets, whereby models with a low number of input features would provide adequate results. Accordingly, we trained a CatBoost Regressor [65] in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R² correlation score.

The random forest regression was implemented using the function RandomForestRegressor() in scikit learn [55].

Model performance measures

For the regression models, we used the following metrics to compare the model performance.

1.
The R² score and the p-value. We use a significance level of α = 0.05 (5 %) for our statistical tests. If the p-value is small, i.e. less than α = 0.05, then the R² score is statistically significant.

For classifier models, we used the following metrics to compare the model performance.

1.
True positive rate, false negative rate, and F1 score [66].
2.
We computed the Brier score using Brier’s original calculation in his paper [67]. In this formulation, for R classes the Brier score B can vary between 0 and R, with 0 being the best score possible.
$$B= \frac{1}{N}{\sum }_{i}{\sum }_{c}{( {\widehat{y}}_{i,c} - {I}_{i,c})}^{2}$$

where ${\widehat{y}}_{i,c}$ is the class probability as per the model and ${I}_{i,c}=1$ if the i th sample belongs to class c and ${I}_{i,c}=0$ if it does not belong to class c.
3.
We used the Delong test [68] to compare the AUC for different classifiers.

These metrics will allow other researchers to replicate our study and provide benchmarks for future improvements.

Results

In this section we present the results of applying the techniques in the Methods section.

Descriptive statistics

We provide descriptive statistics that help the reader understand the distributions of the variables of interest.

Table 1 summarizes basic statistical properties of the LoS variable.

Table 1 Descriptive statistics regarding the LoS variable

Predicting hospital length of stay using machine learning on a large open health dataset

Abstract

Background

Methods

Results

Conclusion

Introduction

Background

Requirements

Methods

Brief description of the dataset

Data pre-processing and cleaning

Feature encoding

Feature importance, selection, and feature engineering

Classification models

Regression models

Model performance measures

Results

Descriptive statistics

Feature encoding

Feature importance, selection and feature engineering

Classification

Regression

Models with minimal feature sets

Classification trees

Model performance measures

Regression

Classification

Model parameters

Additional results shown in the Appendix/Supplementary material

Discussion

Limitations of our models

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Supplementary Material 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Health Services Research

Contact us