Comparing efficiency of health systems across industrialized countries: a panel analysis

Background Rankings from the World Health Organization (WHO) place the US health care system as one of the least efficient among Organization for Economic Cooperation and Development (OECD) countries. Researchers have questioned this, noting simplistic or inappropriate methodologies, poor measurement choice, and poor control variables. Our objective is to re-visit this question by using newer modeling techniques and a large panel of OECD data. Methods We primarily use the OECD Health Data for 25 OECD countries. We compare results from stochastic frontier analysis (SFA) and fixed effects models. We estimate total life expectancy as well as life expectancy at age 60. We explore a combination of control variables reflecting health care resources, health behaviors, and economic and environmental factors. Results The US never ranks higher than fifth out of all 36 models, but is also never the very last ranked country though it was close in several models. The SFA estimation approach produces the most consistent lead country, but the remaining countries did not maintain a steady rank. Discussion Our study sheds light on the fragility of health system rankings by using a large panel and applying the latest efficiency modeling techniques. The rankings are not robust to different statistical approaches, nor to variable inclusion decisions. Conclusions Future international comparisons should employ a range of methodologies to generate a more nuanced portrait of health care system efficiency.


Background
As key provisions of the 2010 Patient Protection and Affordable Care Act (ACA) roll out in 2014, researchers and policymakers will be asking whether the US health care system is gaining efficiency. The ACA will increase insurance coverage, which may increase workload of an already overburdened primary care workforce [1][2][3]. Simultaneous adoption of health information technology (or the use of computers to store, retrieve, share and use health care information, data and knowledge) spurred by the 2009 Health Information Technology for Economics and Clinical Health (HITECH) Act has been reported to add to workforce burdens, although the long-run goal is to reduce administrative waste, and duplicative and unnecessary services [4][5][6]. As these major changes take effect, efficiency may be an elusive goal in the near-term for the US health care system. But even more elusive may be finding a methodology to monitor the efficiency of the US health care system.
The World Health Organization (WHO) published one of the most recognized rankings of health care system performance in the 2000 report titled Health Systems: Improving Performance [7]. The US placed 37 th out of 191, behind Costa Rica, on overall health system performance. The report has been criticized for its objectives, confounding of social influences with health care system performance, poor data quality, and narrow scope in methodology [8][9][10]. Some of these critics re-estimated efficiency and rankings, using different approaches, which generally led to different rankings [9,11,12]. The US tended to rank higher in these newer studies, although none placed the US at the top.
Since the 2000 WHO report, several studies have focused on the efficiency of the health care systems of industrialized countries [13][14][15][16][17][18]. These studies vary in their choice of output measures and statistical approaches. The most common, although not universal conclusion, is that US still does not fare well compared to other member countries of the Organization for Economic Cooperation and Development (OECD).
Researchers continue to question whether the methodologies employed in all these aforementioned studies on efficiency and rankings are strong enough to make valid conclusions, and to what extent the limitations impact the conclusions [19,20]. In this paper, we re-visit the question of whether the US health care system is efficient relative to other industrialized health care systems. Our research goals are to apply the recommendations from past studies and capitalize on advancements in modeling techniques to produce new results. We discuss the sensitivity of our results to variations of the model specifications, and how these variations affect the ranking of the US.
In the next section we discuss the dataset used for our analysis and our approach to choosing the variables in our model. We then present the statistical approaches used in the past, and the more recently available approaches, to rank countries. We compare the results of the various permutations of input and output variables along with statistical approaches, and discuss the implications of our results in the Discussion section.

Data
We primarily use data from the 2013 OECD Health Data, which is a comprehensive health systems dataset on 34 members of the OECD [21]. Although some data are available as far back as 1960, for our variables of interest, the most complete set of data runs from 1990 through 2010 (21 years). All data used in this study are available for public use. This study is not human subject research, and is not subject to Institutional Review Board review. The set of countries included in our analyses varied due to the choice of variables and estimation approaches. Our final set of countries includes 25 out of the 34 countries, and excludes Canada, Chile, France, Greece, Ireland, Netherlands, Portugal, Switzerland, and Turkey, due to missing data. Some of our country time series contains gaps of varying length. All international comparisons face this major limitation given that not all countries report all variables for all years; thus we have an unbalanced panel. We did not impute values for missing data. Instead, we dropped observations with missing data, which we considered in our sensitivity analyses.
Our output measure is life expectancy in order to match the approach used in other health system ranking studies [14,15,18]. We estimated three output measures: 1) total life expectancy, 2) life expectancy at age 60 for males, and 3) life expectancy at age 60 for females. Total life expectancy by gender was not available, nor was a combined gender measure of life expectancy at age sixty. A limitation of total life expectancy is that causes of death in the earlier years are more strongly related to social differences than to health care (e.g. drug overdose, accidental injuries, perinatal conditions, and homicide) [22]. Given this limitation, we also included life expectancy at age sixty (for females and males) as an output measure since the health care system plays an increasingly important role at later stages in life and this care may have an impact on the remaining life expectancy.
We tested numerous combinations of 33 input variables based on theory and empirical evidence estimating each of our three output measurestotal life expectancy, life expectancy at age 60 for males, and life expectancy at age 60 for females. Generally, we tested metrics for health care resources, service utilization, lifestyle and risk behaviors, environmental factors, and economic factors. Although the OECD Health Data includes a wide range of potential input variables, our final choice of variables was in part determined due to availability of data and frequency of data reporting (Table 1). Our final models included a combination of the following 11 input variables: physicians per 1,000 capita (for which we manually inserted two recent data points for the US from OECD cited data sources), magnetic resonance imaging (MRI) and computed tomography (CT) scanners per million population, years of potential life lost (YPLL) from self-harm, homicide, transportation, and alcohol use per 100,000 capita, calories consumed per capita per day (adjusted for average population height in centimeters), percent of the population fifteen years and older who smoked daily, cubic tons of total nitrogen dioxide (adjusted for square meter of land), and gross domestic product (GDP) per capita (in purchasing power parities and deflated to 2005 dollars) [23][24][25]. We applied the adjustments to calories for height as well as to nitrogen dioxide to land mass in effort to provide more meaningful comparisons across countries. We tested several other input variables in our sensitivity analyses, which we discuss in our Results section.

Estimation approach
We estimated technical efficiency of a health care system by its ability to maximize the life expectancy of its population using a minimum amount of health care and non-health care inputs. We compared three estimation approaches: 1) country fixed effects (FE) model, 2) country and time FE model, and 3) stochastic frontier analysis (SFA). We applied each estimation approach to predict each of the three output measures of life expectancy (i.e., total, at age 60 for males, and at age 60 for females) as a function of various permutations of the input variables described above. In this paper, we present the results from our final 36 models. FE models are common approaches for estimating technical efficiency of health care systems. To identify efficient countries, we compared the size of the country FE component of the residual. The most efficient country is defined as the country with the largest positive residuals. In other words, efficient countries have higher-than-predicted life expectancies given the same set of inputs compared to other countries. The least efficient country has the most negative residuals, or lower-than-predicted life expectancies. A series of F-tests suggested that FE was more appropriate than an ordinary least squares (OLS) approach. Breusch-Pagan Lagrange Multiplier tests suggested that random effects was also more appropriate than an OLS approach. Hausman test results support the use of FE over RE although the variance matrix was not positive definite for these models because the Hausman test requires strict conditions for homoscedasticity, which our model most likely do not have. To address heteroscedasticity, we used robust standard errors.
SFA has been used for efficiency comparisons in health care [26]. SFA has appeared only once in the peer-reviewed literature to compare efficiency of health care in OECD countries; however, the comparison focused on hospitals [27]. SFA is a parametric approach that imposes a functional form (i.e., Cobb-Douglas production function estimated as a log-linear model) on the relationship between the inputs and outputs [28]. We tried to estimate a SFA models with maximum likelihood techniques [29], but after trying multiple versions of the model with different arrays of variables, we could not reach convergence. Instead we used a less data intensive least squares approach, which has been shown to provide reliable results [30,31].
The SFA is often preferred over a non-parametric approach, data envelopment analysis (DEA), which is limited in the number of allowable inputs and does not separate out "noise" from the inefficiency term [32]. SFA assumes a random disturbance term that is normally distributed and a technical inefficiency term that has a strictly non-negative distribution. We assumed a half-normal distribution of the inefficiency term, which is a common yet narrow assumption. The choice in the distribution of the inefficiency term has been found to have a negligible influence on empirical results [12]. We allowed technical inefficiency to vary over time rather than stay constant as in the FE models. In SFA, the most efficient country has an inefficiency residual term that is equal to zero; all other countries with an inefficiency residual term less than zero are considered less efficient. For ranking, within a model, the country with the largest positive residual (or zero in the SFA model) is ranked number one. The remaining countries are ordered based on the distance of their residual from the number one ranked country; the larger the difference in residuals from the number one country, the lower the country rank.

Results
In Table 2, we show the US rankings for 12 out of 36 models -four different sets of input variables by three different estimation approachesestimating total years of life expectancy at birth of the population. Japan ranks first for over half of the models presented in Table 2, and for 75 % of the models estimating female life expectancy at age sixty. The US rank varies considerably going from as high as sixth out of 25 to as low as thirteenth out of 14 countries. Across all 36 models, we find a generally dispersed picture with the US ranking near the center (Tables 3, 4 and 5). The US never ranks higher than fifth out of all 36 models, but is also never the very last ranked country though it was close in several models. The SFA estimation approach produces the most consistent lead country, but the remaining countries did not maintain a steady rank. In Tables 6 and 7, we see that physicians per capita is consistently significant in the country only FE models, but loses significance in the country and time FE and SFA models. YPLL from transportation is the only variable to be significant across the three estimation approaches in Model A, but it loses significance with the addition of technology, environmental, lifestyle, and economic variables. Generally no consistent pattern emerges in the significance of the input variables, and the significance of the inputs appear to be sensitive to the variation in the set of countries in the analysis.
We note that the inclusion of GDP per capita (Model D) does not add much explanatory power to the models and loses its significance with the addition of time fixed effects. Including GDP raises concerns about endogeneity. A long literature exists pursuing this line of causation that generally states that when health status improves, workers may be more productive, raising GDP [33]. To gauge how rankings perform relative to a measure of real resources in health care, we plotted the ranks against our main health resources variable -number of physicians per 1,000 capita. In Figure 1, we plot the rankings from the SFA estimation of Model A and find a generally dispersed picture with the US near the center. The pattern generally holds across all models.

Sensitivity analysis
We tried alternative input measures to estimate life expectancy. We found that our results were not sensitive to the height and land mass adjustments to calories and pollution, respectively. Although physicians per 1,000 capita is a very common metric in estimating health system efficiency, we lost about twenty percent of our sample by using this metric. We tested total health employment as an alternative, but the ranking results for the countries in both models remained about the same. We chose physicians over total health employment given the more definitive nature of counting physicians versus the more ambiguous definition of a "health employee"; Note: FE = country fixed effects; FE Full = country and time fixed effects; SFA = stochastic frontier analysis for example, some countries include social assistance workers as health employees. We tested number of physician visits but the US was missing one-third of the years of observations. We tried physician-to-nurse and general practitioner-to-specialist ratios, but given that countries did not provide consistent years of data for these variable sets, many countries were excluded. We tested the inclusion of acute care beds as well as total number of hospital discharges (two variables that were also highly correlated with each other), but found that the measure was not significant in any of our models and its inclusion did not noticeably change the rankings. We tested alcohol consumption as an alternative to YPLL from alcohol consumption, but again generally found the same results. We ideally would have more measures for chronic disease and related risk factors, but data availability was limited. For example, obesity would seem to be an important measure, but each country collects the data differently (i.e., measured versus self-report) and the data has only been reported for a few years. We also would like to lag cigarette use over more years, but again the years of data availability is limited. Despite these data limitations, our findings remained consistent. The rankings varied based on the set of countries, years and input variables included. We also tried different estimation approaches. In a residual analysis, we saw a clear time trend in the random disturbance from the FE models. We tested and found evidence for serial correlation in our FE models. Although the traditional corrections for serial correlation are available, the latest literature advises against these corrections in the absence of strong prior on the nature of the serial correlation and the use of robust standard errors instead [34]. With the SFA model, we see a more randomly dispersed pattern to the error term although there is a slight upward drift. While we control for heteroskedasticity in the FE models, we do not have a mechanism to correct for heteroskedasticity in the SFA models, which may bias our coefficients but not necessarily the relative rankings. SFA also requires at least

Discussion
Our study sheds light on the fragility of health system rankings by using a large panel and applying the latest efficiency modeling techniques. While we find that the US tends to be in the middle to lower third of country rankings, the US is never the last ranked country and is occasionally near the top ranked countries. Rankings vary considerably when different variables are included in the model and also when different statistical approaches are used. Among the statistical approaches presented in this study, the SFA model is the preferred given the distribution properties of the error term and the allowance of technical efficiency to vary over time.
As a result, the rankings resulting from SFA may be considered superior over the other statistical approaches.
Although health care spending measures are often included in discussions around health system performance, we exclude this variable from our analysis since it conflates the discussion around prices versus quantity. We included measure of quantity (e.g., health service utilization and resources), but the OECD Health Data does not have any measures for prices (e.g., medical price index or medical care purchasing power parities exchange rate). Given the complexities of what price captures such as differences in labor markets, health care financing and organization and quality of care, health care spending seriously overstates the real resources used in health care in some countries, especially in the US [35].
There are three limitations with our analysis. First we were limited in the statistical methods that are possible, given data problems. Second, though we have better controls for environmental, lifestyle and behavioral risk factors than past literature, these controls are imperfect. Many important variables are not generally available or are not well measured. Third, while closer to measuring health care system efficiency than previous efforts, our results still capture efficiency differences that are beyond the health care system.

Suggestions for an international data repository
The OECD has been a trusted source of data to compare industrialized countries. They have accomplished the significant and highly valuable task of harmonizing secondary data collected in different manners across countries with different capacities to share data. We put forth two recommendations for improvement. First, we recommend a permanent public repository of the data. Every year the OECD releases a new set of historical data, which includes revisions to older OECD releases. While this a tremendous asset, there is also an archival concern. Data or specific countries and years that may have been available in the past via the OECD data set are sometimes no longer to be found in new releases. These revisions need to be carefully tracked and documented to protect confidence in the reliability of the data.
Second, we recommend that the OECD invest in a federated data locater system where the location of independent data sources from countries are clearly  identified for further analysis. The current process of manual harmonization of data sometime leaves out existing data. For example, the 2013 OECD Health Data did not include the number of physicians per 1,000 capita for the years of 2009 and 2010 for the US. Our team manually extracted these data points from the original source cited in the OECD in order to include these data points in our analysis.. Future analysts of OECD will be well served by data systems able locate and harmonize data automatically while allowing the data to be maintained and housed autonomously by member countries.

Conclusion
In this study, we addressed criticisms that we and others have made regarding the validity of international rankings by using a wide array of statistical methods and more inclusive data. Our results do not refute prevailing belief in the literature that the US is a big health care spender and does not consistently deliver the highest quality health care or achieve the best health outcomes. Depending on the estimation approach and choice of variables, however, the US does rank high in technical efficiency. The lack of consistency in our results suggest that users of the existing set of published rankings proceed with caution. With the change in US investment in the health economy following the ACA and HITECH Act, change in efficiency will be critical to measure. Doing so consistently must be a priority for accurate comparison of countries health care systems in the future.