Comparison of hospital charge prediction models for gastric cancer patients: neural network vs. decision tree models
© Wang et al; licensee BioMed Central Ltd. 2009
Received: 11 December 2008
Accepted: 14 September 2009
Published: 14 September 2009
In recent years, artificial neural network is advocated in modeling complex multivariable relationships due to its ability of fault tolerance; while decision tree of data mining technique was recommended because of its richness of classification arithmetic rules and appeal of visibility. The aim of our research was to compare the performance of ANN and decision tree models in predicting hospital charges on gastric cancer patients.
Data about hospital charges on 1008 gastric cancer patients and related demographic information were collected from the First Affiliated Hospital of Anhui Medical University from 2005 to 2007 and preprocessed firstly to select pertinent input variables. Then artificial neural network (ANN) and decision tree models, using same hospital charge output variable and same input variables, were applied to compare the predictive abilities in terms of mean absolute errors and linear correlation coefficients for the training and test datasets. The transfer function in ANN model was sigmoid with 1 hidden layer and three hidden nodes.
After preprocess of the data, 12 variables were selected and used as input variables in two types of models. For both the training dataset and the test dataset, mean absolute errors of ANN model were lower than those of decision tree model (1819.197 vs. 2782.423, 1162.279 vs. 3424.608) and linear correlation coefficients of the former model were higher than those of the latter (0.955 vs. 0.866, 0.987 vs. 0.806). The predictive ability and adaptive capacity of ANN model were better than those of decision tree model.
ANN model performed better in predicting hospital charges of gastric cancer patients of China than did decision tree model.
Gastric cancers form the leading cause of deaths. Ninety-five percent of gastric cancers are adenocarcinomas, derived from the epithelium . Substantial geographic variations exist in the incidence of gastric carcinomas internationally and they consist the most common cancers in China . With changing diet and life style and the development of new health policies, more and more gastric cancer patients have been discovered in routine medical check ups. Consequently, total hospital charge on gastric cancer patients and its composition is changing. Therefore, it is important to develop appropriate methodologies to model and predict hospital charges on gastric cancer patients and their relations with other factors.
Standard regression methods had been commonly applied in predicting hospital charges in previous studies [3–5]. However hospital charge data characterize non-normal distribution, existence of many related factors and substantial inter-actions between the related factors. These undermine the fundamentals of standard regression analysis. In recent years, uses of ANN models in the mining of complex information in medical fields have been increasing . ANN is advantageous in that it has no particular requirements about data distribution and is fault tolerable. These make ANN most suitable in dealing with complex multi-variable relationships . Examples of such applications include identification of prostate cancer, prediction of risk factors of coronary heart diseases, collocation of medicine dosage and so on . Some researchers had found that the ANN models were more accurate in predicting hospital charges [9–12]. Although the nature of "black box" with ANN makes it difficult in interpreting modeled results, we had retrieved 28 articles from the literature documenting application of ANN in medical field. By comparing the prediction efficient of ANN model and standard regression methods, these studies all conluded that ANN model is superior to standard regression methods .
With regard to decision tree of data mining technique, it was recommended by some researchers because of its richness of classification arithmetic rules and appeal of visibility. In one study , Seung-Mi Lee compared the prediction efficient of ANN and decision tree on hospital charges on colon cancer patients. In other studies [15–18] researchers compared these two kinds of methods in other fields. Findings of these comparisons were mixed and researchers had not reached an agreement on whether ANN model is superior to decision tree model or vice vers.
This study aims at applying ANN and decision tree models to predict hospital charges on gastric cancer patients and comparing their predictive abilities, so as to shed new lights on methodology for the prediction of the hospital charge on gastric cancer patients.
1. Human right protection
Our research was in compliance with the Helsinki Declaration  approved by the ethics committee of Anhui Medical University.
2. Data preprocessing
First, we performed variable selection with the help of medical experts and administrators of the hospital. For example, we were advised by the exports that "type of operations" is surely related to charges and that patient's "dwelling area" may reflect his or her income conditions, so we included these two variables. In addition, the original dataset contained variables with null values. For example, the dataset included 'emergency treatments _ 1' through to 'emergency treatment _10' used to record whether a patient had been rescued for up to ten times. However it is rare that a patient receives many times of emergency treatments during an admission period, which led to a lot of blanked fields. Therefore we created, 'emergency treatments' for storing the times of emergency treatments of each critical patient and 'being rescued' for storing times of the successful emergency treatments. We also performed variable selection by means of uni-variable analyses methods, such as t-test, chi-square test and analysis of variance with the inclusion criteria of 0.05. As a result, 12 variables were selected including age, sex, marriage, dwelling area, operation or not, type of operations, chemotherapy or not, radiotherapy treatment or not, emergency treatments, times of being rescued, length of stay (LOS) and ways of payment.
3. Artificial neural networks (ANN)
Neural networks, with neurons as the basic building blocks, were computer systems that attempt to model the way human brain works. A well-known method was a feed-forward back-propagation (BP) network, since the data used to train the network was presented at the first (input) layer and then feed forward through the hidden layer(s) to produce a response at the output layer. The connections between the artificial neurons were adjustable parameters with a sign and a magnitude and training involved adjustment of these connection strengths (weight) until some desired output (target) signals was produced. A common form of training involves starting the network with random values for the connection weights, presentation of the data and calculation of output signals. The output values were compared with the targets and the weights were adjusted backwards through the network to the input layer.
In this study, BP network was used and the input variables were the twelve factors selected above and the output variable was the hospital charge on gastric cancer patients. The transfer function of sigmoid was used in ANN model with 1 hidden layer and 3 hidden nodes. The training algorithm was f (x) = 1/[1+exp (-x)].
The sensitivity was calculated as follows: firstly, output value Y was translated into 0~1; according to a variable X, the value of each case in the sample was changed one by one (in which if the variable was category, all the possible category combinations were tested; if the variable was numerical, the minimum, lower quartile, median, upper quartile with the maximum values were tested); then, the system would note the output values of the maximum Ymax and minimum Ymin and calculate the proportion of (Ymax-Ymin)/Ymax; the sensitivity of the variable X was the mean of proportions of all the cases. The accuracy was equal to Σ(1-|ti-oi|)/n, in which ti and oi expressed the measured value and predicted value respectively of the ith case and n was the number of sample.
4. Decision tree
Decision tree was a tree construction similar to control diagram, in which each inside node expressed the test of an input property with its branch representing the output of test and each branch node represented a category or its distribution. Recursion partition method was applied to make decision tree from top to bottom. When a decision tree was made, a new sample was categorized from the tree root node to a leaf node based on the values of all the properties with category rules. The arithmetic of classification and regression was commonly used. Rules are directly observable through decision tree induction; that is to say, the classification rules of the hospital charge on gastric cancer patients could be captured from the decision tree.
The data were split into training (70% of sample and 706 cases) and test (30% of sample, 302 cases) datasets by stratified sampling method. The hospital charge was used as the output variable with the 12 variables including age and sex etc as input variables. The fitness of BP ANN model and C&T decision tree model was analyzed using SPSS Clementine11.0. Mean absolute error and linear correlation coefficient were used to evaluate and compare the fitness strength. Moreover, sensitivity analysis in the fitness of BP ANN model was performed on these input variables, in order to estimate the relative importance of them. If the sensitivity of a variable was larger, its importance on the hospital charge was stronger.
The dataset was derived from the digitalized records of gastric cancer patients who had been treated in the First Affiliated Hospital of Anhui medical University from January 1st of 2005 to December 31st of 2007. A total of 1008 patients met our selection criteria, i.e. diagnosed by either radiography or endoscope as suffering from gastric cancers. 405 men and 603 women, aged from 22 to 85 years with an average age of 56.75 years. Their average length of stays was 11.36 days ranging from1 to 51 days. Among them, 20 patients were married and the other 988 unmarried. The number of patients came from counties, suburban and urban areas added up to 230, 478 and 300 respectively. 13 patients were allergic to penicillin and 3 patients were allergic to procaine. 573 patients were treated with various types of operations including subtotal gastrectomy, total gastrectomy, gastrectomy with extended lymphadenectomy and palliviate operations. 368 patients received chemotherapies and 8 patients underwent radiotherapy treatments. 3 patients had been rescued twice with success. 5 patients had been rescued once time and 2 survived. 190 patients paid the charge themselves and 818 patients were paid by public health insurance.
The median of the hospital charge on these 1008 gastric cancer patients was 13803.84 RMB with the lowest hospital charge of 149 RMB and the highest 70606.3 RMB. This sum of charges on gastric cancer patients consisted of medicine charge, operation charge, treatment charge, bed charge and other charges in which the highest proportion was the medicine charge followed by the treatment charge.
1. Univariate analyses
Results of univariate analyses on the total charge
11957.63 ± 8733.77
13696.08 ± 10388.73
13066.63 ± 10277.53
12372.86 ± 9520.96
12473.81 ± 9708.01
13361.71 ± 10629.09
19806.29 ± 5137.35
11384.77 ± 9114.22
14022.70 ± 9665.03
3292.30 ± 2137.66
19666.21 ± 6765.87
14268.60 ± 10227.06
19526.35 ± 6246.40
9759.44 ± 7930.49
19954.17 ± 5404.27
12693.31 ± 9688.01
22856.43 ± 2105.23
2833.26 ± 11.823
13302.72 ± 2483.79
20703.42 ± 12433.20
2. ANN model
It was found that the importance of these factors on the hospital charge on gastric cancer patients were different after the fitting of ANN model. Based on the sensitivity analysis, the most important 5 factors were LOS (0.248396) followed by operation or not (0.196829), emergency treatments (0.176399), type of operations (0.163112) and times of being rescued (0.141685) respectively (The numbers in the brackets represented sensitivities). The linear correlation coefficient and accuracy of the test dataset in the ANN model were 0.987 and 98.35% respectively which were greater than those of training dataset in the ANN model (being 0.955 and 97.418%). Mean absolute error of the test dataset was 1162.279, which was smaller compared with 1819.197 of the training dataset. All these comparisons showed that the prediction ability of the ANN model on the hospital charge of gastric cancer patients was better than its adaptive capacity.
3. Decision tree model
Classification rules of decision tree of the hospital charge on gastric cancer patients
without operation and chemotherapy, LOS< = 3.5 days
without operation and chemotherapy, 3.5<LOS< = 5.5 days, age< = 42.5 years
without operation and chemotherapy, LOS>5.5 days, age< = 42.5 years
without operation and chemotherapy, age>42.5 years, without radiotherapy
without operation and chemotherapy, age>42.5 years, with radiotherapy
without operation, with chemotherapy, LOS< = 8.5 days, age< = 59.5 years
without operation, with chemotherapy, LOS< = 8.5 days, 59.5 year<age< = 70.5 years
without operation, with chemotherapy, LOS< = 8.5 days, age>70.5 year
without operation, with chemotherapy, LOS>8.5 days
4. Comparison of the two predictive models
Comparison of ANN model with decision tree model of the hospital charge on gastric cancer patients
Mean absolute error
Linear correlation coefficient
In our study, the ANN model and decision tree model of two datasets were compared in terms of mean absolute errors and linear correlation coefficients. It was found that: for both the training and the test datasets, mean absolute errors of ANN model were lower than those of decision tree model and linear correlation coefficients of the former were higher than those of the latter. The predictive ability and adaptive capacity of ANN model were better than those of decision tree model.
Seung-Mi Lee  conducted a similar study, in which the prediction efficients of ANN and decision tree on the hospital charge on colon cancer patients in Korean were compared. Lee compared ANN model and decision tree model using training and test datasets generated from two groups of patients with different payment schemes, i.e. payment by patients themselves and by public insurance. Lee's study revealed that the prediction efficient of ANN model was superior to the decision tree model for patients whose hospital charge was paid by public insurance; but it was difficult to measure which model was better than the other for patients who paid the hospital charge by themselves. We think that the reason may be as follows: if hospital charges is to be covered by public insurance, the treatment could be performed based on needs of progress of the disease and thus lead to more "reasonable" hospital charges; on the contrary, if hospital charge is to be paid by patients themselves, the treatment could be interfered by some subject factors of the patients, and thus lead to uneven compositions of hospital charges. Given these Lee still concluded that the prediction efficient of ANN model for the hospital charge on colon cancer patients was superior to the decision tree model.
In our study, 18.85% of the patients paid the charges by themselves and 81.15%, by public insurance. Payment system was not found to have significant effects on the hospital charges on gastric cancer patients in the fitness of the two models. This is inconsistent with the report by Lim JH . So we did not compare the two models for these two different groups of patients.
It was found, from the results of the fitness of ANN model, that length of stay was the most important factor on the hospital charge on gastric cancer patients. One explanation for this may be that inpatients' medication was not interrupted and bed charge was inevitable; at the same time, the major components of the hospital charge were just medicine and treatment charges. Furthermore, operation, emergency treatments, type of operation and times of being rescued were also important factors on the hospital charge of gastric cancer patients.
In spite that the prediction efficient of decision tree model was found inferior to ANN model in our study, the method provides directly visible classification rules. Acting on these classification rules, the hospital charge on gastric cancer patients could be easily controlled so as to avoid the phenomenon of inappropriate services.
Additionally we should consider that, as a "black box", BP in ANN would hide some effects of any possible interactions, which was the limitation of this method.
Of course, the gastric cancer patients were drawn only from one hospital in Anhui province of China. If more information of the hospital charge of gastric cancer patients was collected from every region of China, the results should be more reliable. Moreover, given the arithmetic traits of ANN and decision tree models, the choice of predictive models could be performed depending on different research emphasis; or the two models could be used in combination.
ANN model performed better in predicting hospital charges of gastric cancer patients of China than did decision tree model.
This research was supported by the of Humane and Social Sciences Research Fund, Education Department of Anhui Province (Reference ID: 2009sk192zd). At the same time, thanks for the help of Dr. De bin Wang in revision of our manuscript.
- Ferlay J, Bray F, Pisani P, Parkin DM: cancer incidence, mortality and prevalence worldwide. 2001, version 1.0. IARC Cancer Base No 5. Lyon: IARC PressGoogle Scholar
- Townsend CM, Beauchamp RD, Evers BM, Mattox KL: Sabiston Textbook of Surgery. 2008, Saunders, An Imprinter of Elsevier. Philadelphia, 18Google Scholar
- Beekmann SE, Diekema DJ, Chapin KC, Doern GV: Effects of rapid detection of bloodstream infections on length of hospitalization and hospital charges. J Clin Microbiol. 2003, 41: 3119-3125. 10.1128/JCM.41.7.3119-3125.2003.View ArticlePubMedPubMed CentralGoogle Scholar
- Chang KC, Tseng MC: Costs of acute care of first-ever ischemic stroke in Taiwan. Stroke. 2003, 34: e219-221. 10.1161/01.STR.0000095565.12945.18.View ArticlePubMedGoogle Scholar
- Rosenman M, Madsen K, Hui S, Breitfeld PP: Modeling administrative outcomes in fever and neutropenia: clinical variables significantly influence length of stay and hospital charges. J Pediatr Hematol Oncol. 2002, 24: 263-268. 10.1097/00043426-200205000-00009.View ArticlePubMedGoogle Scholar
- Patel JL, Goyal RK: Applications of artificial neural networks in medical science. Curr Clin Pharmacol. 2007, 2: 217-226. 10.2174/157488407781668811.View ArticlePubMedGoogle Scholar
- Kelly DG: Stability in contractive nonlinear neural networks. IEEE Trans Biomed Eng. 1990, 37: 231-242. 10.1109/10.52325.View ArticlePubMedGoogle Scholar
- Dayhoff JE, DeLeo JM: Artificial neural networks: opening the black box. Cancer. 2001, 91 (8 Suppl): 1615-1635. 10.1002/1097-0142(20010415)91:8+<1615::AID-CNCR1175>3.0.CO;2-L.View ArticlePubMedGoogle Scholar
- Marshall AH, McClean SI, Shapcott CM, Millard PH: Modelling patient duration of stay to facilitate resource management of geriatric hospitals. Health Care Manag Sci. 2002, 5: 313-319. 10.1023/A:1020394525938.View ArticlePubMedGoogle Scholar
- Walczak S, Scharf JE: Transfusion cost containment for abdominal surgery with neural networks. Neural Processing Letters. 2000, 11: 229-238. 10.1023/A:1009667711423.View ArticleGoogle Scholar
- Burke HB, Goodman PH, Rosen DB, Henson DE, Weinstein JN, Harrell FE, Marks JR, Winchester DP, Bostwick DG: Artificial neural networks improve the accuracy of cancer survival prediction. Cancer. 1997, 79: 857-862. 10.1002/(SICI)1097-0142(19970215)79:4<857::AID-CNCR24>3.0.CO;2-Y.View ArticlePubMedGoogle Scholar
- Stephan C, Büker N, Cammann H: Artificial neural network (ANN) velocity better identifies benign prostatic hyperplasia but not prostate cancer compared with PSA velocity. BMC Urology. 2008, 8: 10-10.1186/1471-2490-8-10.View ArticlePubMedPubMed CentralGoogle Scholar
- Sargent DJ: Comparison of artificial neural networks with other statistical approaches: results from medical data sets. Cancer. 2001, 91 (Suppl 8): 1636-1642. 10.1002/1097-0142(20010415)91:8+<1636::AID-CNCR1176>3.0.CO;2-D.View ArticlePubMedGoogle Scholar
- Lee Seung-Mi, Kang Jin-Oh, Suh Yong-Moo: Comparison of hospital charge prediction models for colorectal cancer patients:neural network vs.decision tree models. J Korean Med Sci. 2004, 19: 677-681.View ArticlePubMedPubMed CentralGoogle Scholar
- Biddiss EA, Chau TT: Multivariate prediction of upper limb prosthesis acceptance or rejection. Disabil Rehabil Assist Technol. 2008, 10: 1-12.Google Scholar
- Gortzis LG, Sakellaropoulos F, Ilias I, Stamoulis K, Dimopoulou I: Predicting ICU survival: a meta-level approach. BMC Health Serv Res. 2008, 8: 157-10.1186/1472-6963-8-157.View ArticlePubMedPubMed CentralGoogle Scholar
- Wegrzyn JL, Drudge TM, Valafar F, Hook V: Bioinformatic analyses of mammalian 5'-UTR sequence properties of mRNAs predicts alternative translation initiation sites. BMC Bioinformatics. 2008, 9: 232-10.1186/1471-2105-9-232.View ArticlePubMedPubMed CentralGoogle Scholar
- Moseley LG, Mead DM: Predicting who will drop out of nursing courses: a machine learning exercise. Nurse Educ Today. 2008, 28: 469-475. 10.1016/j.nedt.2007.07.012.View ArticlePubMedGoogle Scholar
- Williams JR: The Declaration of Helsinki and public health. Bulletin of the World Health Organization. 2008, 86: 650-651. 10.2471/BLT.08.050955.View ArticlePubMedPubMed CentralGoogle Scholar
- Lim JH, Choi KS, Kim SG, Park EC, Park JH: Effects of private health insurance on health care utilization and expenditures in Korean cancer patients: focused on 5 major cancers in one cancer center. J Prev Med Pub Health. 2007, 40: 329-335. 10.3961/jpmph.2007.40.4.329.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6963/9/161/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.