Our first objective was to analyze whether variation among areas is higher than would be expected by chance. Our findings are not completely consistent with previous literature, in which the expected variability when the null hypothesis of homogeneity is true was said to be surprisingly large [7, 8, 10]. In our work, practically all the statistics under the null hypothesis have narrow intervals which are close to the zero value (or 1 for the EQ_{5–95}) compared to those derived from the observed data, which are shifted to the right (see Figure 1). The distance between the upper limits of the null intervals null and the lower limit of the observed intervals is present in all procedures and occurs for all statistics, indicating that we observed more variability than the expected by chance even for some procedures that are known to have low variation, such as hip fracture. This discrepancy with previous studies could be related to the size of our sample (n = 147 Healthcare Areas), which was larger than the sample size used in the reference article by Diehr *et al* (n = 39 counties) [7], and suggests that significant variation is expected to be found for most procedures in studies with large number Healthcare Areas, such as the Dartmouth Atlas of Health Care, with more than 300 hospital reference areas, or the Spanish NHS Atlas of Variations, with more than 140 areas, making the interpretation of procedure variations difficult when the significance of statistics such as the X^{2} is given. Furthermore, the fact that some statistics perform differently under the null hypothesis depending on the rate of the procedure (see Figure 1) indicates that it is not adequate to provide only the observed statistics, because the same observed value may represent different degrees of variability depending on the procedure rate. These are relevant aspects that suggest that providing both the null and observed performance jointly over a simple observed descriptive statistic value or a simple p-value is more appropriate.

Another interesting question broached by Diehr [7] and other authors [32] was the better performance of the χ^{2} statistic (compared to others) because of its lower dependence on population size, condition rates or readmissions. In fact, Diehr *et al* recommended its use. However, these findings were only described under the null hypothesis. Our work has confirmed the aforementioned behaviour of the χ^{2} statistic (see scenario H_{0} on Figure 3), but its stability for different rates diminished when the alternative was true (see S_{1} to S_{6} scenarios, Figure 3). In fact, the χ^{2} statistic appears to have higher values as a procedure rate increased, regardless of the actual underlying variability.

With regard to the new statistics we have tested, Dean and Bohning tests performed almost identically to χ^{2}, as they all were designed to detect departures from homogeneity rather than to discriminate among degrees of variability. In fact, the expected value of the first statistic under the null hypothesis is the number of area minus one, whereas the other two have asymptotically a standard normal distribution; all the three show a good performance under the null hypothesis, but are highly dependent on the procedure rate under alternative scenarios. On the other hand, the results obtained with EB were closer to those find using SCV and SCV_{5–95}. This concordance among statistics was also expected since the first three tests are based on the discrepancy between observed and expected cases given the homogeneous Poisson model, while the other three are based on a generalized linear mixed model where the area-specific effect is the random effect. Overall, the last three statistics, and especially the EB, show a good performance both under the null and the alternative hypotheses, being stable even when procedure rates change. EB's good behaviour is consistent with Shwartz's results [12], and confirms that this statistic should become an essential part of SAVA studies.

The second aim of our paper consisted on estimating the statistical power of the variation statistics, and our results are consistent with Diehr's [9]. There were relevant differences between theirs and our scenarios of study: their procedures were more prevalent than ours (18 per 10,000, while ours ranged from 2.2 to 10.6 per 10,000 depending on conditions) and their sample size was smaller (39 counties compared to our 147 Healthcare Areas). In spite of this, the outcomes of both studies point in the same direction: the χ^{2} test appears to have the most statistical power, and the CV and EQ the least. Nevertheless, Diehr's work did not evaluate other statistics such as the EB, which has practically the same power that the widely recommended χ^{2} and performs better in terms of stability under the alternative hypothesis.

With regard to our third objective, our work sought to compare variation profiles between different procedures. Traditionally, this objective in SAVA studies is pursued by using simple dot plots, descriptive statistics without significance testing or ratios between the SCV of the revised procedures and the SCV of hospitalization for hip fracture, a known low-variation condition [22, 33, 34]. In our work, all the statistics evaluated seem to agree when the procedure or condition presents high variability. This finding is important because it confirms that conditions identified as highly variable remain consistent across statistics, suggesting that SAVA analysis is a useful method for targeting conditions for intervention or further study. Moreover, it is important to be aware that the sensitivity to low-rate procedure of the χ^{2} statistic (and the Dean and Bohning tests) may suggest low variability, as seems to have happened in the case of lower extremity amputation. Because of this problem, the χ^{2} statistic appears not to be the best choice in SAVA studies.

In order to truly compare variation among procedures, SAVA studies must use reliable statistics that are able to detect variability when it exists. These statistics must perform robustly when there are differences in utilization rates among the procedures, and when small-sized samples are studied. The main conclusion of our study is that the SCV and, mainly, the EB statistic have been shown to be the best, because they do not seem to be influenced by the utilization rates of the conditions or procedures under study (a relevant advantage when conditions of very different rates are compared), and because it is able to accurately discriminate between different degrees of heterogeneity (see confidence intervals in S_{1} to S_{6}, Figure 3).

Our work has not included all the statistics suggested in the literature, but has concentrated on those most widely used, and those that are commonly used in other contexts, such as mortality analysis. Diehr *et al* proposed the use of the CVA [7], which was recommended when procedures had high prevalence rates. They showed that the CVA, which is derived from an analysis of variance where the response variable is the number of admissions for each person in each area and the area is the random effect, do not correlate with the procedure rate in contrast to other estimates of variation (CV, CVw). Our study corroborates the influence of prevalence in the latter statistics and also shows that neither the SCV nor the EB have this limitation. Furthermore, the underlying Poisson distribution assumed for SCV and EB statistics [12] was considered more appropriate than normal assumptions with equal variances needed for the CVA calculations. In particular, the peculiarities of the model underlying the EB computation, that takes into account the reliability of each area to weight the information each of them gives to the pooled variation, encouraged us to prefer the properties of the EB to be used in these studies. Smoothing techniques such as the EB are now dominating the literature in disease mapping, and can be easily programmed using standard software such as R.

Our work has several limitations. First, we have not addressed the analysis considering recurrent events (i.e. readmissions). Although the six procedures under study are not likely to have recurrent events in a one-year period (with the exception of lower extremity amputation) it is important to note that the possibility of multiple counts in recurrent events violates the assumption of independence of Poisson events. The variance may be higher and the standard approaches may not account for the extra variation, underestimating variability [7, 10, 35]. Different approaches to overcome this problem have been proposed in the literature. These include the Multiple Admission Factor [10], or the use of other distributions rather than Poisson. Additionally, the assumed null model does not consider the variability that may be present due to disease prevalence variation. This could have been incorporated with models accounting for overdispersion and estimated if reliable outpatient registers had been available. Although some interesting attempts are being carried out in this direction [16, 36], at present these registers are not reliable enough in our setting. The approach presented here has neither taken into account the spatial autocorrelation that may exist in the data, because a comparison of smoothing techniques incorporating it did not suggest that its inclusion would lead to different results, given the high populated regions usually considered in health service research studies. Nevertheless, the EB estimate can easily be extended to account for spatial correlation [20] and it provides estimates close to the full-Bayes counterparts [28, 37], so that we recommend SAVA studies to go in this direction to be of benefit for the advances produced in disease mapping studies. Another limitation is related with the simulation study, where only two variation sources were used, the number of heterogeneous areas above the overall level (10, 20 or 40) and the magnitude of differences (RR = 1.6 or RR = 1.2), and two different procedure rates were considered. It may happen that other settings with different number of regions, different rates, different population distributions or different degrees of induced variability could have led to different results.

Despite the importance of our findings, some questions remained unsolved. With the exception of the EQ, the remaining statistics assessed in this work do not provide information easily translated into action. Unfortunately, while the EQ appears to be the most intuitive statistic, it is also the worst one in terms of sensitivity and robustness. It is, further, also difficult to build when considering areas with no cases. As Coory and Gibberd note [38], we need new measures for reporting the magnitude and impact of small-area variation in rates. In the meantime, it is worth drawing health services researchers' attention to the importance of using adequate measures of its estimation.