Is there much variation in variation? Revisiting statistics of small area variation in health services research
© Ibáñez et al. 2009
Received: 15 September 2008
Accepted: 02 April 2009
Published: 02 April 2009
Skip to main content
© Ibáñez et al. 2009
Received: 15 September 2008
Accepted: 02 April 2009
Published: 02 April 2009
The importance of Small Area Variation Analysis for policy-making contrasts with the scarcity of work on the validity of the statistics used in these studies. Our study aims at 1) determining whether variation in utilization rates between health areas is higher than would be expected by chance, 2) estimating the statistical power of the variation statistics; and 3) evaluating the ability of different statistics to compare the variability among different procedures regardless of their rates.
Parametric bootstrap techniques were used to derive the empirical distribution for each statistic under the hypothesis of homogeneity across areas. Non-parametric procedures were used to analyze the empirical distribution for the observed statistics and compare the results in six situations (low/medium/high utilization rates and low/high variability). A small scale simulation study was conducted to assess the capacity of each statistic to discriminate between different scenarios with different degrees of variation.
Bootstrap techniques proved to be good at quantifying the difference between the null hypothesis and the variation observed in each situation, and to construct reliable tests and confidence intervals for each of the variation statistics analyzed. Although the good performance of Systematic Component of Variation (SCV), Empirical Bayes (EB) statistic shows better behaviour under the null hypothesis, it is able to detect variability if present, it is not influenced by the procedure rate and it is best able to discriminate between different degrees of heterogeneity.
The EB statistics seems to be a good alternative to more conventional statistics used in small-area variation analysis in health service research because of its robustness.
Small Area Variation Analysis (SAVA) is a method used in health services research to describe how rates of healthcare utilization vary across geographic areas . While utilization rates can be calculated to summarize non-binary events (hospital days, costs), they are usually computed to represent counts (procedures, hospital admissions). Studies based on SAVA have documented dramatic variations across areas in the use of medical and surgical procedures, showing that the amount and type of medical care that the individuals of a population receive depend on where they live. The principal finding of these studies remains unchanged: for medical care, geography is destiny . SAVA methods are, thus, used extensively to characterize medical care, assuming that high variability conditions are associated with higher uncertainty and supply-sensitive care , and Wennberg constructed an influential general theory describing how to detect physician uncertainty from the variation in small area analysis .
The importance of these studies in terms of their impact on policy-making contrasts with the dearth of work testing the validity of the SAVA statistics themselves. Very little has been done to determine whether higher than randomly expected variability across areas is in fact detected, or whether certain procedures are more variable than others [5–12]. Not surprisingly, statistical analysis of area variations in health service research is often informal, consisting of plots and maps illustrating admission or surgery rates by healthcare area, and statistics with important statistical limitations [13, 14].
Two groups of statistics of variation are commonly used: those that describe the distribution of rates (based on standardization by direct method) and those that use differences between expected and observed cases (based on indirect standardization). Statistics among the former usually include the high-low ratio or extremal quotient (EQ, maximum rate divided by minimum rate) [5, 7], and the unweighted (CV) and weighted (CVw) coefficients of variation. Among the latter, the systematic component of variation (SCV) proposed by McPherson et al,  and the chi-squared statistic (χ2) [7, 10] are the most frequently used. Through simulations studies, these statistics have been shown to be sensitive to specific characteristics, such as the prevalence of the procedure or condition, the possibility of multiple admissions, the number of areas considered, and the population size of small areas . Simulation studies have also shown that the expected variation, when the hypothesis of homogeneity in rates is true, can be surprisingly large; especially, in low-incidence procedures or when readmissions are frequent . Therefore, it is important to assess how far variation estimates are from the null hypothesis, and how precise the statistics are in each particular situation.
Several studies conducted by Diehr and her colleagues, including work assessing the power of the tests applied , the effect of multiple admissions , and the comparison of variability between procedures , have contributed extraordinarily to the advance in SAVA methodology. Nevertheless, these authors were "unable to recommend a single good descriptive for small-area analysis" . Additionally, SAVA statistics methodology has still limitations. The simulation procedure constructed by Diehr et al did not take into account the well-known age and sex variability of most health conditions (although these authors developed an interesting approach in one appendix) . On the other hand, it is important to study not only the behaviour of the statistics under the null hypothesis along with their power, but also to evaluate their capacity to discriminate between procedures with different variability. Finally, Diehr carried out the analyses using a setting with a small number of geographic areas, and where utilization rates were several times higher than the usual rates observed in the Spanish context.
Our work has pursued three objectives: 1) to determine whether variation in rates between areas is higher than would be expected by chance, complementing the study under the null hypothesis of homogeneity by constructing confidence intervals for the observed statistics based on non-parametric bootstrap techniques; 2) to estimate the power of the variation statistics; and 3) to evaluate the ability of different statistics to compare variability among procedures regardless of their rates. Additionally, we extended the simulation procedure to other barely used statistics, such as the empirical Bayes (EB) statistic, that was first proposed in this context by Shwartz et al.. The EB focuses on estimating rates rather than on testing significance, and the model underlying this statistic has been applied in some SAVA papers [16, 17]. We also considered the Dean (DT) , and Bohning statistics (BT) , which have been used to test if geographical variation in rates is larger than that assumed under homogeneity in mortality studies , but have not been applied yet in health-services research variation analysis.
We used data from the Atlas of Variations in Medical Practice in the Spanish National Health System (NHS) , a research project designed to inform Spanish decision-makers on differences in such parameters as hospital admissions or surgery for specific conditions across geographic areas (see: http://www.atlasvpm.org). The Spanish Atlas emulates the Dartmouth Atlas of Health Care Project . Hospital Discharge Administrative Databases in 2002 (calendar year), with additional data from ambulatory surgery registries, were used to build the numerator of the rates. These administrative databases, produced by every acute care hospital in the Spanish NHS, provide the following information from every single admission: age, sex, admission and discharge dates, diagnosis and procedure codes [International Classification of Diseases 9th revision Clinical Modification codes (ICD9CM)], and postal codes identifying the patient's area of residence. This latter was used to assign every patient admitted in a hospital to the Healthcare Area in which he lives.
Population distribution of the geographical areas
10,000 – 49,999
50,000 – 99,999
100,000 – 149,999
150,000 – 199,999
200,000 – 249,999
250,000 – 299,999
300,000 – 399,999
400,000 – 499,999
500,000 – 999,999
1,000,000 – 1,500,000
Codes of the ICD9MC used for selecting cases.
All appendectomies, including laparoscopic and incidentals.
Inguinal hernia repair
53.0x; 53.1x; 53.2x; 53.3x
Uni or bilateral repair, with or without mesh, of femoral or inguinal hernias.
Lower extremity amputation
84.10 to 84.17
Lower extremity amputation at any level.
Only emergency admissions.
Total or partial knee replacement
37.80; 37.81; 37.82; 37.83
Pacemaker implant, permanent or not, in programmed or emergency admissions.
Extremal Quotient , Coefficient of Variation , Weighted Coefficient of Variation , Systematic Component of Variation , Empirical Bayes variance component , χ2 statistic , Dean statistic , and Bohning statistic  were all studied. Because some of the Spanish Atlas' calculations exclude the 5% of extreme standardized rates for each tie , we have also eliminated the outliers beyond of the 5–95 percentiles, and labelled our statistics as EQ5–95, CV5–95; CVW5–95 and SCV5–95. The formulation of the statistics is given in Additional file 1. The EQ, CV and CVw use direct age-standardized rates for each i-th Healthcare Area, denoted by DSR i for i = 1, ..., I, and all three are well-known measures of variation in general contexts. The remaining statistics use the observed and expected cases per area, denoted by y i and e i respectively. These expected cases were derived based on the age-specific rate for 8 groups (0–24, 25–44, 45–64, 65–69, 70–74, 75–79, 80–84, 85 years and over) and the sex stratum in the standard population, which was the population from the 147 healthcare areas under study. More precisely, e i = ∑j, k n ijk R jk , where nijk is the population in area i, age group j and sex stratum k, and Rjk is the age-sex specific rate for the whole region under study. Hence, the quotient of the observed to the expected number of cases is the indirect Standardized Utilization Ratio, SURi = yi/ei for the i-th Healthcare Area. This quotient is in fact the maximum-likelihood estimator of ri, the unknown relative risk of suffering a given surgical procedure in the area, under the assumption that yi ~Poisson(eiri) independently for each i-th Health Area. The Poisson distribution is frequently adopted because the Bernoulli process at the individual level (surgery vs non surgery) becomes a Binomial process at the area level, which can be approximated by the Poisson distribution when rare events are modelled . Hence, the null hypothesis indicating an homogenous risk surface for the whole region can be represented by the model yi~Poisson(eir), with r the common risk. The X2, DT and BT versions applied here were derived to detect heterogeneity with respect to this homogeneous Poisson model. Finally, the SCV and the EB statistics are derived under a more general framework where the number of admissions per area is modelled hierarchically in a two-step procedure. The first step assumes that, conditional on the risk ri, the number of counts yi follows a Poisson distribution, yi|ri ~Poisson(eiri), whereas in the second one, heterogeneity in rates is modelled adopting a common distribution π for the risk ri (or for its logarithm), r i ~ π(r| θ ), with θ the vector of parameters of the density function. Whereas the derivation of the SCV does not require a parametric form for π, as the SCV is precisely the moment estimator of the variance in the distribution of π , the EB statistics is based on the assumption that the log-relative risks are normally and identically distributed, log(ri) ~N(μ, σ2). This last model, called multivariate Poisson log-normal model or exchangeable model, is widely used in the disease mapping literature [23, 24], and can be easily extended to accommodate spatial autocorrelation [25, 26]. The marginal distribution of this model is not available in closed form, and two approaches can be used to derive estimates for the parameters such as the variance component σ2 and predictions for the random effects representing ri. These are the Empirical Bayes (EB) approach, which can be accomplish using the Penalized Quasi Likelihood method , or the Full bayes (FB) approach [25, 26] for which prior distributions for the parameters are required. The EB statistics used in this paper is the estimate of the variance component σ2 derived under the EB approach, but similar results would have been obtained under the FB approach. . Under the null hypothesis of homogeneity among rates, both the SCV and the EB statistics would be zero.
The null hypothesis of homogeneity tested here is that the expected admission rate for each procedure is the same in all counties, so that differences in observed rates are no bigger than that expected by chance, assuming an underlying Poisson process to model admission counts . This hypothesis has been tested using bootstraping [29, 30], which is a resampling procedure that estimates the properties of an estimator (such as its variance) by sampling from an approximating empirical distribution. There are two types of bootstrap procedures, for parametric and non-parametric inference. The former can be adopted when exists a parametric model from which samples can be randomly generated to derive the empirical distribution of the statistic, whereas the latter relies on the discrete empirical distribution obtained by random sampling with replacement from the original dataset. Given that the homogeneous Poisson model (or alternatively the normal model with common rate) was assumed under the null hypothesis, the parametric bootstrap procedure was used to derive upper and lower limit values from "R" random samples generated from the hypothesis being tested. Even though the same philosophy was first proposed by Diehr et al. , we implemented as additional analysis the age-sex adjustment. The source of information used for each statistics (i.e., standardized rates vs. observed-expected cases) was also considered in the analysis. The steps for carrying this analysis out are shown in Additional file 1.
In order to assess the alternative hypothesis, confidence intervals for the observed statistics were derived. Here we used non-parametric methodology in order to avoid parametric assumptions about the distribution of both rates and observed cases. Thus, sampling with re-sampling R times from the observed standardized rates sample or from the observed-expected cases paired sample (depending on the statistic) was used to calculate statistics for each of the R simulated samples. This made it possible to obtain confidence intervals from the percentile 2.5 and 97.5 as before.
In order to derive the ability of the aforementioned statistics to distinguish different degrees of variability, we simulated several situations that emulate different types of induced variability. This exercise pursued three objectives: First, to assess the statistical power of each one of the above described statistics, which in this case represents the probability of detecting geographic variability when it is present. Second, to evaluate whether any were better than the others at distinguishing and ordering the six different scenarios with regard to the degree of variability; and finally, to study how different rates influenced statistics of variation when they are used to compare procedures according to their variability.
Apart from the scenario named H0 representing the homogeneous Poisson model with a common risk surface yj ~Poisson(ejrj), with rj = 1 for all j in 1, ..., J, and generated as described in Additional file 2, six additional scenarios with different degrees of variability were designed. The population structure was based on that observed in the real geographical pattern with I = 147 areas, whereas the expected counts were derived using the overall age-sex specific rates for the most frequent (hip fracture) and the least frequent (lower extremity amputation) procedure. While most of the regions were assumed to have homogeneous rates, an artificially elevated risk was induced in a randomly selected group of areas. These was carried out using two sources of additional variation: incrementing the risks of the selected areas to rj = 1.2 (S1–S3) or rj = 1.6 (S4–S6) (and equivalently their mean rates in 1.2p and 1.6p respectively), and varying the number of these areas with induced elevated risk, being 10 (S1 and S4), 20 (S2 and S5), and 40 Healthcare Areas (S3 and S6) out of the 147. Counts in all background areas (all but these 10, 20 or 40 respectively) were generated from the null model with a common underlying rate aforementioned. Hence, the scenarios were numbered from the lowest to the highest expected variability S1<S2<S3<S4<S5<S6.
Once the scenarios were designed, 2000 samples were simulated from the null distribution following the procedure previously described and named H0. The critical value of the tests was estimated using the 95-th percentile of the empirical distribution of the statistics, whereas confidence interval limits were obtained form the same distribution using percentiles 2.5 and 97.5. Another 2000 samples were simulated from each scenario S1–S6, and the empirical distribution of the statistics was derived. This allowed us to obtain not only the empirical statistical power of the test for each scenario, by calculating the proportion of values greater than the critical values obtained in H0 (the proportion of times that the null hypothesis is surpassed in each scenario), but also the confidence intervals for the statistics in each scenario using percentiles 2.5 and 97.5 of the empirical distribution.
Number of cases, rates by 10,000 inhabitants and observed statistics of variation in procedures of low and high variability
A: Low variability
B: Low variability
Lower Ext. Amput.
Regarding the observed variation, confidence intervals for the observed statistics are wider than their null counterparts, and these discrepancies in amplitude are higher in the statistics based on the observed-expected comparisons than in the rate-based statistics. Of note is the agreement among the statistics in detecting which is the most variable procedure, all suggesting that knee replacement has the highest point and the widest confidence interval estimates, being very far removed from the null hypothesis. However, this agreement is not observed when trying to elucidate which procedure presents the lowest variability. While most statistics detect that admissions for hip fracture and appendectomy seem to have the lowest point estimates, the χ2, Bohning and Dean tests suggest that pacemaker implant or lower extremity amputation, the two procedures with the lowest rates, appear to have lower point estimates than those obtained for the rest of the procedures. Representing together confidence intervals of the statistics and those obtained under homogeneity in the same graph allows us to derive more reliable conclusions regarding the underlying variability. Specifically, the closer they are, the less probability for systematic variation (i.e., beyond chance). Note also that excluding the 5% of extreme rates in some statistics seems negligible with regard to the comparison between null and observed intervals, because the expected variability depicted is lower when excluding them both under the null hypothesis and under the observed variability.
Figure 3 also reveals some interesting findings with regard to the differences between the statistics' behaviour when the procedure rate is considered. To compare the variability of different procedures, statistics must not be affected by the procedure rate. As we can see in Figure 3, only the SCV, SCV5–95 and EB statistics uphold this assumption. Furthermore, EB has the narrowest confidence intervals, particularly for the low-rate procedures compared to the SCV and SCV5–95. This finding explains why EB is more powerful than the others when low rate procedures are studied. In contrast, all the statistics based on rates (upper row) show that a high-rate procedure would always be considered to have lower variability than a low-rate procedure. For the χ2 and the Bohning and Dean statistics the opposite is true, and only under H0 they are equivalent.
All analyses were carried out using the free statistical package R 2.4.0. 
Our first objective was to analyze whether variation among areas is higher than would be expected by chance. Our findings are not completely consistent with previous literature, in which the expected variability when the null hypothesis of homogeneity is true was said to be surprisingly large [7, 8, 10]. In our work, practically all the statistics under the null hypothesis have narrow intervals which are close to the zero value (or 1 for the EQ5–95) compared to those derived from the observed data, which are shifted to the right (see Figure 1). The distance between the upper limits of the null intervals null and the lower limit of the observed intervals is present in all procedures and occurs for all statistics, indicating that we observed more variability than the expected by chance even for some procedures that are known to have low variation, such as hip fracture. This discrepancy with previous studies could be related to the size of our sample (n = 147 Healthcare Areas), which was larger than the sample size used in the reference article by Diehr et al (n = 39 counties) , and suggests that significant variation is expected to be found for most procedures in studies with large number Healthcare Areas, such as the Dartmouth Atlas of Health Care, with more than 300 hospital reference areas, or the Spanish NHS Atlas of Variations, with more than 140 areas, making the interpretation of procedure variations difficult when the significance of statistics such as the X2 is given. Furthermore, the fact that some statistics perform differently under the null hypothesis depending on the rate of the procedure (see Figure 1) indicates that it is not adequate to provide only the observed statistics, because the same observed value may represent different degrees of variability depending on the procedure rate. These are relevant aspects that suggest that providing both the null and observed performance jointly over a simple observed descriptive statistic value or a simple p-value is more appropriate.
Another interesting question broached by Diehr  and other authors  was the better performance of the χ2 statistic (compared to others) because of its lower dependence on population size, condition rates or readmissions. In fact, Diehr et al recommended its use. However, these findings were only described under the null hypothesis. Our work has confirmed the aforementioned behaviour of the χ2 statistic (see scenario H0 on Figure 3), but its stability for different rates diminished when the alternative was true (see S1 to S6 scenarios, Figure 3). In fact, the χ2 statistic appears to have higher values as a procedure rate increased, regardless of the actual underlying variability.
With regard to the new statistics we have tested, Dean and Bohning tests performed almost identically to χ2, as they all were designed to detect departures from homogeneity rather than to discriminate among degrees of variability. In fact, the expected value of the first statistic under the null hypothesis is the number of area minus one, whereas the other two have asymptotically a standard normal distribution; all the three show a good performance under the null hypothesis, but are highly dependent on the procedure rate under alternative scenarios. On the other hand, the results obtained with EB were closer to those find using SCV and SCV5–95. This concordance among statistics was also expected since the first three tests are based on the discrepancy between observed and expected cases given the homogeneous Poisson model, while the other three are based on a generalized linear mixed model where the area-specific effect is the random effect. Overall, the last three statistics, and especially the EB, show a good performance both under the null and the alternative hypotheses, being stable even when procedure rates change. EB's good behaviour is consistent with Shwartz's results , and confirms that this statistic should become an essential part of SAVA studies.
The second aim of our paper consisted on estimating the statistical power of the variation statistics, and our results are consistent with Diehr's . There were relevant differences between theirs and our scenarios of study: their procedures were more prevalent than ours (18 per 10,000, while ours ranged from 2.2 to 10.6 per 10,000 depending on conditions) and their sample size was smaller (39 counties compared to our 147 Healthcare Areas). In spite of this, the outcomes of both studies point in the same direction: the χ2 test appears to have the most statistical power, and the CV and EQ the least. Nevertheless, Diehr's work did not evaluate other statistics such as the EB, which has practically the same power that the widely recommended χ2 and performs better in terms of stability under the alternative hypothesis.
With regard to our third objective, our work sought to compare variation profiles between different procedures. Traditionally, this objective in SAVA studies is pursued by using simple dot plots, descriptive statistics without significance testing or ratios between the SCV of the revised procedures and the SCV of hospitalization for hip fracture, a known low-variation condition [22, 33, 34]. In our work, all the statistics evaluated seem to agree when the procedure or condition presents high variability. This finding is important because it confirms that conditions identified as highly variable remain consistent across statistics, suggesting that SAVA analysis is a useful method for targeting conditions for intervention or further study. Moreover, it is important to be aware that the sensitivity to low-rate procedure of the χ2 statistic (and the Dean and Bohning tests) may suggest low variability, as seems to have happened in the case of lower extremity amputation. Because of this problem, the χ2 statistic appears not to be the best choice in SAVA studies.
In order to truly compare variation among procedures, SAVA studies must use reliable statistics that are able to detect variability when it exists. These statistics must perform robustly when there are differences in utilization rates among the procedures, and when small-sized samples are studied. The main conclusion of our study is that the SCV and, mainly, the EB statistic have been shown to be the best, because they do not seem to be influenced by the utilization rates of the conditions or procedures under study (a relevant advantage when conditions of very different rates are compared), and because it is able to accurately discriminate between different degrees of heterogeneity (see confidence intervals in S1 to S6, Figure 3).
Our work has not included all the statistics suggested in the literature, but has concentrated on those most widely used, and those that are commonly used in other contexts, such as mortality analysis. Diehr et al proposed the use of the CVA , which was recommended when procedures had high prevalence rates. They showed that the CVA, which is derived from an analysis of variance where the response variable is the number of admissions for each person in each area and the area is the random effect, do not correlate with the procedure rate in contrast to other estimates of variation (CV, CVw). Our study corroborates the influence of prevalence in the latter statistics and also shows that neither the SCV nor the EB have this limitation. Furthermore, the underlying Poisson distribution assumed for SCV and EB statistics  was considered more appropriate than normal assumptions with equal variances needed for the CVA calculations. In particular, the peculiarities of the model underlying the EB computation, that takes into account the reliability of each area to weight the information each of them gives to the pooled variation, encouraged us to prefer the properties of the EB to be used in these studies. Smoothing techniques such as the EB are now dominating the literature in disease mapping, and can be easily programmed using standard software such as R.
Our work has several limitations. First, we have not addressed the analysis considering recurrent events (i.e. readmissions). Although the six procedures under study are not likely to have recurrent events in a one-year period (with the exception of lower extremity amputation) it is important to note that the possibility of multiple counts in recurrent events violates the assumption of independence of Poisson events. The variance may be higher and the standard approaches may not account for the extra variation, underestimating variability [7, 10, 35]. Different approaches to overcome this problem have been proposed in the literature. These include the Multiple Admission Factor , or the use of other distributions rather than Poisson. Additionally, the assumed null model does not consider the variability that may be present due to disease prevalence variation. This could have been incorporated with models accounting for overdispersion and estimated if reliable outpatient registers had been available. Although some interesting attempts are being carried out in this direction [16, 36], at present these registers are not reliable enough in our setting. The approach presented here has neither taken into account the spatial autocorrelation that may exist in the data, because a comparison of smoothing techniques incorporating it did not suggest that its inclusion would lead to different results, given the high populated regions usually considered in health service research studies. Nevertheless, the EB estimate can easily be extended to account for spatial correlation  and it provides estimates close to the full-Bayes counterparts [28, 37], so that we recommend SAVA studies to go in this direction to be of benefit for the advances produced in disease mapping studies. Another limitation is related with the simulation study, where only two variation sources were used, the number of heterogeneous areas above the overall level (10, 20 or 40) and the magnitude of differences (RR = 1.6 or RR = 1.2), and two different procedure rates were considered. It may happen that other settings with different number of regions, different rates, different population distributions or different degrees of induced variability could have led to different results.
Despite the importance of our findings, some questions remained unsolved. With the exception of the EQ, the remaining statistics assessed in this work do not provide information easily translated into action. Unfortunately, while the EQ appears to be the most intuitive statistic, it is also the worst one in terms of sensitivity and robustness. It is, further, also difficult to build when considering areas with no cases. As Coory and Gibberd note , we need new measures for reporting the magnitude and impact of small-area variation in rates. In the meantime, it is worth drawing health services researchers' attention to the importance of using adequate measures of its estimation.
For this reason, and in conclusion, we recommend: 1) to use bootstrap techniques to obtain a joint picture of the observed variability and that obtained under homogeneity, as they provide a complete and reliable measure of the magnitude of variation; 2) to be careful with the interpretation of some statistic estimates, particularly for the rate-based statistics, as their performance differ even under homogeneity depending on the procedure rate: and 3) when variability of different procedures needs to be compared, SCV and specially, EB statistic, are the most robust measures, overcoming problems derived from differences in procedures prevalence rates.
This research is part of the "Atlas of Medical Practice Variation in the Spanish National Health System" a Project funded by the Institute for Health, Ministry of Health, Spain (Grants G03/202, PI05/2490, PI06/1673, CIBERESP) and IBERCAJA.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.