 Research
 Open Access
 Published:
Outlier identification and monitoring of institutional or clinician performance: an overview of statistical methods and application to national audit data
BMC Health Services Research volume 23, Article number: 23 (2023)
Abstract
Background
Institutions or clinicians (units) are often compared according to a performance indicator such as inhospital mortality. Several approaches have been proposed for the detection of outlying units, whose performance deviates from the overall performance.
Methods
We provide an overview of three approaches commonly used to monitor institutional performances for outlier detection. These are the commonmean model, the ‘NormalPoisson’ random effects model and the ‘Logistic’ random effects model. For the latter we also propose a visualisation technique. The commonmean model assumes that the underlying true performance of all units is equal and that any observed variation between units is due to chance. Even after applying casemix adjustment, this assumption is often violated due to overdispersion and a posthoc correction may need to be applied. The random effects models relax this assumption and explicitly allow the true performance to differ between units, thus offering a more flexible approach. We discuss the strengths and weaknesses of each approach and illustrate their application using audit data from England and Wales on Adult Cardiac Surgery (ACS) and Percutaneous Coronary Intervention (PCI).
Results
In general, the overdispersioncorrected commonmean model and the random effects approaches produced similar pvalues for the detection of outliers. For the ACS dataset (41 hospitals) three outliers were identified in total but only one was identified by all methods above. For the PCI dataset (88 hospitals), seven outliers were identified in total but only two were identified by all methods. The commonmean model uncorrected for overdispersion produced several more outliers. The reason for observing similar pvalues for all three approaches could be attributed to the fact that the betweenhospital variance was relatively small in both datasets, resulting only in a mild violation of the commonmean assumption; in this situation, the overdispersion correction worked well.
Conclusion
If the commonmean assumption is likely to hold, all three methods are appropriate to use for outlier detection and their results should be similar. Random effect methods may be the preferred approach when the commonmean assumption is likely to be violated.
Background
The detection and management of outliers when monitoring institutional performance is important in maintaining and improving the quality of health care. In the UK, NHS England monitors the performance of hospitals and individual clinicians to help them identify necessary improvements for patient care. For example, national audit programs exist in Diabetes, Dementia, Lung cancer, Cardiovascular Outcomes, and other fields. Nowadays, outlier detection is an essential aspect of such audits. Patients seek to receive the best possible health care, and government bodies and healthcare providers seek to identify high and low performers to guide quality improvement in clinical care.
The implications of being classified as an outlier can be huge. Lowperforming hospitals are likely to face intense scrutiny and patients might choose to avoid lowperforming hospitals or surgeons. However, failing to identify outliers with poor performance may jeopardise patient safety.
Outlier methodology may be applied to both measures of processes of clinical care (e.g., waiting times) as well as outcomes of care (e.g., complication rates, procedural mortality). For the purposes of this paper, we use the term ‘unit’ to refer to the entities whose performance is monitored; units can be hospitals, individual hospital clinicians, general practices or general practitioners.
The aim is to identify units whose performance diverges substantially from the expected performance of a group of units or from an externally set target. These units are often said to be ‘outliers’. Depending on the degree of divergence from the performance target, units have been described [1, 2] as ‘normal’, ‘high/low alerts’ or ‘high/low alarms’, with each term describing progressively greater deviation from the performance target. For example, we may want to monitor hospital performance with respect to mortality following cardiac surgery. In this case, the units are the hospitals and each observation within a hospital corresponds to a surgical procedure on an individual patient.
Differences in the performance between units will in part be due to differences in the characteristics of patients in each unit (the unit’s casemix). For example, when hospitals are compared with respect to inhospital mortality following a cardiac procedure, it is likely that different hospitals treat patients with different risk profiles. Hospitals treating higherrisk patients would be expected to also have higher proportions of inhospital deaths (raw mortality). Adjusting for the predicted risk of inhospital death for each patient can account for some of these differences and help to understand differences in outcomes that are due to quality of care provided. Riskadjustment is often applied by obtaining the predicted risk for individual patients using a risk model. For example, the predicted risk of inhospital mortality following cardiac surgery can be obtained using the EuroSCORE risk model [3, 4].
Detecting outliers is an important process with potentially significant implications. Therefore, the ability to detect outliers reliably using appropriate methodology is vital. The principle underpinning all approaches to outlier detection is that a distribution is assumed for the unitlevel performance to establish allowable variation in unit performance. Deviations from this distribution indicate outliers. Firstly, the most commonly used approach, the ‘Commonmean model’ [1], uses aggregated unitlevel data and is visualised using a funnel plot. It assumes a common true performance for all units, subject to sampling variation. Levels of acceptable variation, the control limits, are constructed around the overall average or an externally set target. However, as explained later, the variability in the data is often higher than that expected under the assumed model, e.g., because of imperfect riskadjustment or problems with data quality. This is called overdispersion and it may be accounted for by applying a posthoc correction [5] to the levels of acceptable variation. More recently, the use of random effects models has been proposed to account for the clustered nature of the data within units and overdispersion. A random effects model can be applied to either unit or individuallevel data [2, 6]. The second approach we consider, the ‘NormalPoisson model’, uses random effects for the units and is applied to unitlevel aggregate data. The third approach, the ‘Logistic random effects model’ also uses random effects for the units but is applied to individuallevel data (or procedurelevel data).
In this paper we provide an overview of these approaches for outlier detection. We examine their corresponding assumptions, discuss their strengths and weaknesses, and review methods for visual representation of the results. For the logistic random effects model, we propose a graphical way to present the results. We illustrate the application of the methods using cardiac data, and include software to implement these approaches in R.
Methods
The commonmean model for unitlevel data
We consider the case where the individuallevel outcome is binary (e.g., death = 1/alive = 0). For unitlevel aggregated data, the performance indicator is often taken to be either the proportion, p_{i} (i = 1, …, K where K denotes the number of units) or the riskadjusted proportion \(\left({p}_i^{ra}\right)\) of events.
The observed proportion of events (\({\hat{p}}_i)\) in unit i, is the observed number of events (e.g., deaths), O_{i}, divided by the total number of observations in that unit, n_{i}. When a risk model for riskadjustment is available, each patient within a unit is assigned a predicted risk of having the event. The expected number of events in unit i, E_{i}, is calculated by summing the predicted risks for all observations in that unit. The observed riskadjusted proportion of events \(\left({\hat{p}}_i^{ra}\right)\) is equal to the ratio of the observed to the expected number of events, multiplied by the overall proportion of events, \(\overline{p}=\frac{\sum {O}_i}{\sum {n}_i}\). If no riskadjustment is used, then \({E}_i={n}_i\overline{p}\) for all units.
Without loss of generality, we let p_{i} denote the performance indicator in what follows, where this is either the proportion or riskadjusted proportion of events. The common population proportion, p, is often the overall proportion of events in the sample of all units, \(\overline{p}\), and the variance of p_{i} is \({\sigma}_i^2={p}_i\frac{\left(1{p}_i\right)}{n_i}\), the binomial variance.
The commonmean model assumes that there exists a single underlying true performance, p, which is common for all units, and that the observed value occurs with variance \({\sigma}_i^2=\frac{p\left(1p\ \right)}{n_i}\). Using a Normal approximation
Any difference between the observed performance in each unit, \({\hat{p}}_i\), and p is assumed to be due to random sampling variation. To detect outliers, we test the null hypothesis that the underlying true performance of unit i, p_{i}, is equal to the population proportion, p:
This can be tested using the following teststatistic:
If the null hypothesis is true, \({Z}_i^{(1)}\sim N\left(0,1\right)\). The associated pvalue for each unit is
Often, the assumption of a commonmean will be untenable, e.g., due to imperfect risk adjustment. So, in fact, the underlying true proportion of events for each unit is bound to deviate to some extent from the population proportion of events, and consequently the variability in the outcome will be higher than just the random variation in (1). This excess variability is called ‘overdispersion’. Failing to account for overdispersion will mean that the assumed variability is smaller than the variability actually present in the data. This will result in identifying too many units as outliers. Overdispersion in the commonmean model can be accounted for by multiplying the variance with  or adding to it  a corrective overdispersion parameter which may be estimated from the data [1, 5]. For example, if using a multiplicative correction, a value >1 for the overdispersion parameter φ indicates that there is unaccounted variability in the performance indicator, i.e. overdispersion is present. Then, the test statistic in (3) is corrected by multiplying the variance under the null hypothesis (denominator term of (3)) by the factor φ: \(\sqrt{\frac{\varphi\;p\left(1p\right)}{n_i}}\).
Visualisation using a funnel plot
The result from applying the commonmean model to a dataset is a pvalue for each of the units obtained using (4). The pvalue for a given unit reflects the probability of obtaining the unit’s observed performance if it was actually consistent with the population proportion. A statistically significant pvalue at a given significance level suggests that unit is an outlier. In the literature [1], units have been usually categorised as: outliers at the α = 5% significance level (“Alerts/Better than Expected”), outliers at the α = 0.2% level (“Alarms/Substantially better than expected”) and as “Normal” if they are neither Alerts/Better than Expected nor Alarms/Substantially better than expected.
A common way of visualising the results of the outlier process from the commonmean model is a ‘funnel plot’ where the observed value of the performance indicator for a given unit is plotted against a measure of its precision, e.g., the sample size.
For the commonmean model (1), the assumed true proportion of events (known as the target) needs to be set first. The target value could be the overall proportion (or riskadjusted proportion) of events or an externally set value, p. On the vertical axis is the observed proportion (or riskadjusted proportion) of events and on the horizontal axis the sample size. The target value p, is first drawn as a horizontal line. Then, under the assumption that the null hypothesis is true, control limits are drawn around this value for a range of sample sizes, n. For a given sample size, n, the control limits (potentially with adjustment of overdispersion with the parameter ϕ) around the target are \(p\pm {z}_{1\alpha /2}\times \frac{\phi\ p\left(1p\right)}{\sqrt{n}},n=1,2\dots\). These reflect the range of acceptable variation around the commonmean value at the significance level a for a unit of size n, assuming the null hypothesis is true. For proportions, the width of the 95 and 99.8% control limits decreases with increasing sample size (with rate \(1/\sqrt{n} )\) giving rise to the funnel shape of the graph. The observed values of the performance indicators are then plotted against their size for all units. Units whose observed performance lies beyond the control limits are deemed to be inconsistent with the null hypothesis and hence are denoted as outliers.
The NormalPoisson random effects model for unitlevel data
The random effects approach relaxes the assumption of model (1) that there is a common underlying true performance for all units and that the variation in the observed performance across units is just by chance. Instead, it assumes that because of imperfect casemix adjustment or other reasons, the underlying true performance Y_{i} will differ between units:
where \({\sigma}_i^2\) denotes random variation around population mean μ. So, the observed performance of each unit is subject to two sources of variation: the random variation \({\sigma}_i^2\) as for the commonmean model, and additionally an acceptable betweenunit variance τ^{2}.
To detect outliers, we test the null hypothesis that the underlying true performance of unit i, Y_{i}, is equal to μ:
When the individuallevel outcome is binary, a simple Normal random effects [2] model has been used where the performance indicator of interest is the logrelative standardised event ratio (or standardised mortality ratio (SMR) if the event is death), \({Y}_i=\log \left(\frac{O_i}{E_i}\right)\); as before, O_{i} and E_{i} denote the observed and expected number of events, respectively. As the distribution of the standardised event ratio O/E tends to be skewed, the log transformation is used to produce a more symmetric distribution; other transformations are also possible including the squareroot of O/E [7]. Assuming that O_{i} ∼ Poisson(E_{i}), the random variation component of \(\log \left(\frac{O_i}{E_i}\right)\) can be approximated by \({\sigma}_i^2=\frac{1}{E_i}.\) The acceptable betweenunit variance, τ^{2}, can be estimated from the data; estimation details are provided in Appendix 1. The population mean, \(\mu, \textrm{is}\ \textrm{usually}\ \textrm{log}\left(\frac{\textit{O}}{\textit{E}}\right)\), where O and E denote the sum of the observed and expected values across units, respectively. Because of the assumed distributions, this model is called the NormalPoisson random effects model for unitlevel data. The NormalPoisson model is appropriate as long as the implied Normal approximation holds. This may not hold when the number of events (and unit size) is small; in these situations, further approximations may be necessary [8] or the unit may be excluded from the outlier process.
Under the null hypothesis, model (5) is written as
The teststatistic for testing the null hypothesis is given by:
leading to the pvalue \({p_{val}}_i^{(2)}=1\Phi \left({Z}_i^{(2)}\right)\).
Visualisation using a funnel plot
A funneltype plot can also be drawn for the NormalPoisson randomeffects model, similar to that used for the commonmean model. However, the quantities on the axes are different because the assumed null model in (7) is different. For a binary outcome such as inhospital mortality, the performance indicator on the vertical axis is \({Y}_i=\log \left(\frac{O_i}{E_i}\right)\). This is plotted against a measure of its precision, the expected number of events \(\left({\sigma}_i^2=\frac{1}{E_i}\right)\). The target value is usually μ= log \(\left(\frac{O}{E}\right)\), which is drawn as a horizontal line on the graph. This value will be close to zero as the expected number of events will be usually close to the observed number of events (if the risk model is correctly calibrated in an overall sense). The variance of Y_{i} under the null hypothesis incorporates two sources of variation: \({\tau}^2+{\sigma}_i^2\). These can be estimated from the data as \({\hat{\tau}}^2+\frac{1}{E_i}\), where \({\hat{\tau}}^2\) is an estimate of τ^{2}. By varying the expected number of events, the control limits around this target are \(\mu \pm {z}_{1\frac{\alpha }{2}}\times \sqrt{{\hat{\tau}}^2+\frac{1}{E}},E=1,2\dots\), reflecting the acceptable variation around the target value, μ under the null hypothesis. Units whose observed performance \(\log \left(\frac{O_i}{E_i}\right)\) lies beyond the control limits are outliers.
Most often the quantities presented on the vertical axis are the original \(\frac{O}{E}\) ratios, instead of the log\(\left(\frac{O}{E}\right)\). The control limits are then:\(\exp \left(\mu \pm {z}_{1\frac{\alpha }{2}}\times \sqrt{{\hat{\tau}}^2+\frac{1}{E}}\right).\)
The logistic random effects model for individuallevel data
When the outcome is binary, individuals with Y = 1 are said to have experienced the event of interest and individuals with Y = 0 to have not. All units will contain multiple observations and these observations are said to be clustered within units; for example, patients might be clustered within hospitals. For clustered data, the binary outcome can be modelled using the logistic random effects model, an extension of the wellknown logistic regression model. The simplest form of the logistic random effects model for π_{ij} = P(Y_{ij} = 1), with random intercept terms is
where β_{0} is a fixed effect, u_{i} is the random intercept for unit i, and j is the indicator for the jth member of unit i. In this model, β_{0} can be viewed as the average logodds and the u_{i}^{′}s correspond to unitspecific deviations from β_{0}. Usually, it is assumed that \({u}_i\sim N\left(0,{\sigma}_u^2\right)\), where \({\sigma}_u^2\) is the variance of the random intercepts. When the data are clustered, observations within the same unit tend to be more similar than those from different units. The intracluster correlation coefficient (ICC) is often used to quantify the degree of similarity; this is also known as the degree of clustering. ICC takes values between 0 and 1, and quantifies the proportion of total variation due to the clustering of patients within units. For binary outcomes, ICC can be estimated by \(ICC=\frac{\sigma_u^2}{\frac{\pi^2}{3}+{\sigma}_u^2\ }\) [9].
When a risk model is available, riskadjustment can be readily incorporated by adding the logodds of the predicted risk, \({\hat{p}}_{ij},\) for each observation, \({\hat{\eta}}_{ij}=\log \left(\frac{{\hat{p}}_{ij}}{1{\hat{p}}_{ij}}\right),\) as a covariate:
Estimates of the random effects are often obtained using Empirical Bayes prediction, where the estimation of the unitspecific effects is effectively a weighted average of the population proportion and the unit proportion (on the logodds scale); this is the approach we follow in this paper. The implication of using Empirical Bayes prediction for the random effects is that the effects for smaller units tend to be ‘shrunk’ towards the overall average [10, 11]. Model (10) can be fitted in standard software (e.g., R using the function glmer in package lme4 or Stata using the function melogit) to estimate the fixed and random effects.
Under the null hypothesis, all units have random effects from the assumed distribution:
Rejecting the null hypothesis at a given significance level suggests that the random effect for unit i is unlikely to be consistent with the random effects distribution under the null hypothesis, i.e., the unit is an outlier.
The teststatistic used for testing H_{0} is
where \(\hat{u_i}\) denotes the estimated random effect of unit i and \({SE}^D\left({\hat{u}}_i\right)\) the diagnostic standard error [6]. This leads to a onesided pvalue \({p_{val}}_i^{(3)}=1\Phi \left({Z}_i^{(3)}\right)\). It is important to highlight that the hypothesis being tested is whether u_{i} is consistent with the assumed distribution in (11). Crucially, the diagnostic standard error does not represent the precision with which \({\hat{u}}_i\) is estimated.
Visualisation using a twopanel plot
We now describe an approach to present the results from the outlier process based on an individuallevel logistic random effects model.
A twopanel plot is used to present key information about the observed and predicted risks in each unit (left panel) and the results of the outlier process based on the logistic random effects model (right panel). An example of a two panelplot is given in Fig. 5.
Left panel
For the left panel, the units and their sizes are presented on the vertical axis. The observed and predicted risks are on the horizontal axis denoted with the following signs:

Dashed vertical line: the overall proportion of events across all units.

Square: the proportion of events in each unit (e.g., inhospital death after cardiac surgery).

Cross: the average predicted risk per unit. A low predicted mortality relative to overall mortality across all units, i.e., a ‘cross’ positioned to the left of the population average mortality line indicates that the unit deals with lower risk patients compared to the average.
Right panel
For the right panel, the units are also on the vertical axis, and the estimated random effects with intervals for outlier detection are on the horizontal axis, giving rise to a forest plot. This plot includes a vertical line at zero, the ‘target value’ for the random effects. The estimated random effect, \({\hat{u}}_i\), for each unit, and its 100 ×(1 − a)% ‘interval for outlier detection’, \({\hat{u}}_i\pm {z}_{1\frac{\alpha }{2}}\times {SE}^D\left({\hat{u}}_i\right)\), is added as a point and a horizontal bar, respectively. Intervals that do not include the target value of 0, suggest that the corresponding units are outliers at the significance level a.
It is important to note that the intervals for outlier detection based on diagnostic standard errors do not represent the precision with which the random effect is estimated, but the evidence that the given unit is an outlier. Hence, their width does not necessarily tend to decrease with increasing hospital size. For each unit, the usual 95 and 99.8% intervals for outlier detection are shown with black and grey solid horizontal bars, respectively. These specify whether a hospital is an outlier at the given significance level.
Results
Data
We illustrate the application of the methods described earlier and discuss their results using data from two cardiac audit datasets from England and Wales. The first dataset includes adult patients from 41 hospitals who underwent Cardiac Surgery (ACS). The second dataset includes patients who underwent Percutaneous Coronary Intervention (PCI) at 88 hospitals. Both datasets were obtained for procedures performed during the threeyear period April 2015–March 2018 (97,173 procedures in total for the ACS data and 262,035 for the PCI data). The outcome of interest for both datasets was early mortality (inhospital mortality for ACS, and 30day mortality for PCI). The average mortality was 1.8% for ACS and 2.7% for PCI. The median number of procedures per hospital (interquartile range) for the ACS data was 2361(1064) and for the PCI data was 2535(2887). For each dataset, the aim is to compare hospitals with respect to mortality to identify outlying hospitals.
Riskadjustment models
For both datasets, suitable riskmodels were available for riskadjustment. For the ACS data, a recalibrated EuroSCORE logistic risk model [3, 4] to predict the probability of inhospital death has been used (details about the risk factors and the model recalibration are provided in Appendix 1). For the PCI data, the British Cardiovascular Intervention Society (BCIS) logistic regression model [12] was used to obtain the predicted risk of 30day mortality (details about the risk factors are provided in Appendix 1).
We assessed the quality of the models used for riskadjustment using measures of calibration (calibration slope and calibration inthelarge) and discrimination (Cstatistic). A value of 0 for the calibration inthelarge suggests that the average predicted probability is equal to the observed proportion of events. A value of 1 for the calibration slope suggests a perfectly calibrated model. The Cstatistic takes values between 0.5 to 1, with higher values meaning higher ability to discriminate between high and lowrisk patients. The estimated model performance measures with 95% confidence intervals are provided in Table 1. The models were well calibrated. This is also confirmed by the calibration plots which show the agreement between the observed proportion of deaths and the average predicted risk in groups defined by deciles of the predicted risks (Fig. 1 and Fig. S1 in Appendix 1 for the ACS and the PCI data, respectively). The model used for the PCI data had a greater discrimination than that used for the ACS data.
Results: commonmean model
In the plots to follow, unless otherwise stated, it should be assumed that the commonmean model was corrected for overdispersion. Hospitals which are not outliers at either the 5% or 0.02% level are said to be ‘Normal’ (black colour). Outliers at the 5% level are said to be ‘Better than Expected’ (blue) if they perform better than normal and ‘Alerts’(purple) if they perform worse than normal. Outliers at the 0.02% level are said to be ‘Substantially Better than Expected’ (green) if they perform better than normal and ‘Alarms’ (red) if they perform worse than normal.
Funnel plots based on the commonmean model without and with correction for overdispersion for the ACS data are shown in Fig. 2 and Fig. 3, respectively, where the riskadjusted proportion of events is on the vertical axis. Figure 2 shows that without correction for overdispersion several units are identified as outliers. Figure 3 shows that after correction for overdispersion there was just one ‘Alert’ (hospital 2) and one hospital ‘Better than expected’ (hospital 6). The overdispersion parameter was estimated to be 4.39 indicating that overdispersion was indeed present. An analogous funnel plot for the PCI data (the overdispersion parameter was 4.45) is presented in Appendix 1 (Fig. S2).
Results: NormalPoisson random effects model
A funnel plot based on the NormalPoisson random effects model for unit level data for the ACS data is shown in Fig. 4. The estimated betweenhospital variance after risk adjustment was \(\hat{\tau}=0.38\) (\(\hat{\tau}=0.18\) for the PCI data). Figure 4 shows there was one hospital ‘Better than expected’ (hospital 16) and one ‘Substantially better than expected’ (hospital 6). An analogous plot for the PCI data is presented in Appendix 1 (Fig. S3).
Results: logistic random effects model
The betweenhospital variability in the outcome after riskadjustment was \({\hat{\sigma}}_u=0.28\) and ICC = 0.024 for the ACS data (\({\hat{\sigma}}_u=0.20,\) ICC = 0.012 for the PCI data). These figures suggest that the degree of clustering was small in both of the datasets, partially due to the high quality of riskadjustment.
The twopanel plot for the ACS data is shown in Fig. 5. It shows that there was just one hospital with a mortality rate ‘Substantially better than expected’ (hospital 6). For this hospital, the observed mortality was markedly lower than the predicted mortality (left panel). An analogous plot for the PCI data is presented in Appendix 1 (Fig. S4) and shows more outlying hospitals.
Comparison of the results from the three approaches
In the analysis of the two cardiac datasets, the results were similar between the two random effects approaches and the commonmean model (Table 2). Figure 6 shows the value of the Z teststatistic for each hospital for each of the pairwise combination of the methods above, showing very high correlation between the Z teststatistic values. This similarity was perhaps to be expected because the betweenhospital variance was relatively small in both datasets (ICC < 0.03), resulting in only a mild violation of the commonmean model assumption that the hospitals share a single underlying true performance; consequently, the overdispersion correction for the commonmean model appears to have worked well.
As the variance of the random effects, \({\sigma}_u^2\) (and the betweenhospital variance in the NormalPoisson model, τ^{2}) increases, one would expect the correlation between the Z teststatistic values from the two random effects approaches to remain very high, and the correlation between the Z teststatistic values from either of the random effects models and the commonmean model to gradually decrease. This hypothesis was confirmed by artificially inducing higher betweenhospital variance (σ_{u} = 0.76, ICC = 0.15) and generating new outcomes for the ACS data. The Z teststatistic values from the two random effect models were very highly correlated between them, and slightly less correlated with the Z teststatistics from the commonmean model (Fig. S5 in Appendix 1).
Discussion
When comparing the performance of different units with respect to a performance indicator, e.g., riskadjusted proportion of inhospital deaths following cardiac surgery, it is often of interest to identify units whose performance deviates from the overall performance across units (outliers). The methods for identifying outliers rely on specifying an underlying assumed model that describes the performance of all units. Any units whose observed performance is found to be inconsistent with the underlying model are denoted outliers.
In this paper we have provided an overview of three of the main methods to identify outliers for binary outcomes: the commonmean model and the NormalPoisson random effects model for unitlevel data and the logistic random effects model for individuallevel data. The commonmean model is straightforward to apply, and its results are conveniently visualised via a funnel plot. It also seems to be commonly used in practice. However, it assumes that all units share the same underlying true performance, which may be incorrect, e.g., due to imperfect riskadjustment. As a result, a correction for overdispersion is important, otherwise it will tend to identify too many outliers. In the two random effects models the commonmean assumption is relaxed, and the units’ underlying true performance is allowed to vary around the common mean. Therefore, the random effects approaches may be more appropriate in most scenarios.
Of the two random effects approaches, the NormalPoisson random effects model uses aggregated unitlevel data, effectively simplifying the data structure. The results can be visualised using a funnel plot. In contrast, the logistic random effects model is applied directly to individuallevel data. This will avoid a loss of information if one plans on applying riskadjustment at the individual level. Therefore, in principle, the logistic random effects model may be considered more appropriate. It is straightforward to implement in standard software and it can readily accommodate riskadjustment via a risk prediction model, as well as additional individual and unitlevel risk factors (e.g., whether a hospital is located in an urban or a rural area) by just including them in the model as explanatory variables.
To identify outliers using the logistic random effects model we followed the approach of Skrondal et al. (2009) [6] for testing based on the diagnostic standard errors. This outlier process may be visualised using the twopanel plot proposed in this paper. Alternative ways of presenting the results from the logistic random effects model also exist. Possibilities include the use of odds ratios or different types of SMRs derived from the logistic random effects model [13]. For example, exponentiating the estimated random effect for a given unit provides the odds of the event for a given patient in that unit over the odds of the event had the patient belonged to the average unit [14].
One issue in the implementation of the random effects approaches is obtaining a value of the variance of the random effects. The variance is often, as it is in this paper, estimated from the data. This, however, may be unduly influenced by a few units with extreme performances, which would ultimately mask their detection as outliers. An alternative approach would be to estimate the variance using a robust estimation procedure that downweights extreme units, such as Winsorisation or crossvalidation [15]. These approaches come with their own challenges, e.g., choosing a suitable proportion for Winsorisation. Another approach would be to set a fixed value for the random effects variance, representing a degree of tolerable variation between units. As an external judgment, it may be specified based on historical data and published before the analysis. Alternatively, expert knowledge may be incorporated into the model via the use of a suitable prior distribution for the betweenunit variance (as well as other model parameters) leading to a fully Bayesian approach [16, 17].
In practice, when applying outlier detection methods, there is a chance of a falsepositive result or type1 error. Setting the significance level at a very low value decreases the risk of a type1 error but also decreases the power to detect a true outlying unit (true positive result). The choice of the significance levels depends on the implications of a false positive result and the importance of identifying true outliers as such.
In our data illustration (cardiac clinical audits) we used two commonly used significance levels for the detection of outliers, 5% (alert) and 0.2% (alarm), which correspond to two different levels for the chance of a false positive result when testing for the outlier status for each unit. The purpose of an alert is to advise that perhaps the standards of care are drifting in the wrong direction. It is not a declaration of an immediate cause for concern, but more a process of flagging up the alert with the hospital/individual concerned. On the other hand, an alarm level means that the result is concerning, and a review process might be activated.
Often the number of units being compared is large. For example, when comparing the performance of cardiac surgeons, the number of units tends to be very large, i.e., several hundred. In this scenario, a large number of surgeons might be identified as outliers due to chance alone. Therefore, it is advisable that a postprocessing of the pvalues be applied to reduce this number. The Bonferroni correction, often used to correct for multiple testing, might be too conservative [18] because reducing the probability of a single false positive test result when the number of units is large, will be at the cost of reducing the power to identify outliers. An alternative strategy is to instead control the False Discovery Rate [19] (FDR) to ensure that the majority of the rejected null hypotheses are correctly rejected.
Conclusion
Random effect approaches should be the preferred approach when the assumption of the simple commonmean model is unlikely to hold. The logistic random effects model can flexibly accommodate riskadjustment based on a suitable existing risk model and/or additional risk factors by simply including these factors in the model as explanatory variables. The twopanel plot presented in this paper can be used to visualise the results of the outlier process using the logistic random effects model.
Availability of data and materials
The data that support the findings of this study are available from NICOR but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from Menelaos Pavlou (m.pavlou@ucl.ac.uk) upon reasonable request and with permission of NICOR.
Abbreviations
 BCIS:

British Cardiovascular Intervention Society
 ICC:

Intracluster Correlation Coefficient (ICC)
 MLE:

Maximum Likelihood Estimation
 NICOR:

National Institute for Cardiovascular Outcomes Research
 PCI:

Percutaneous Coronary Intervention
 SMR:

Standardised Mortality Ratio.
References
Spiegelhalter DJ. Funnel plots for comparing institutional performance. Stat Med. 2005;24(8):1185–202.
Jones HE, Spiegelhalter DJ. The identification of “unusual” healthcare providers from a hierarchical model. Am Stat. 2011;65(3):154–63.
Nashef SA, Roques F, Michel P, Gauducheau E, Lemeshow S, Salamon R. European system for cardiac operative risk evaluation (EuroSCORE). Eur J Cardiothorac Surg. 1999;16(1):9–13.
Roques F, Michel P, Goldstone AR, Nashef SA. The logistic EuroSCORE. Eur Heart J. 2003;24(9):881–2.
Spiegelhalter DJ. Handling overdispersion of performance indicators. Qual Saf Health Care. 2005;14(5):347–51.
Skrondal A, RabeHesketh S. Prediction in multilevel generalized linear models. J R Stat Soc Ser A Stat Soc. 2009;172(3):659–87.
Spiegelhalter D, SherlawJohnson C, Bardsley M, Blunt I, Wood C, Grigg O. Statistical methods for healthcare regulation: rating, screening and surveillance. J R Stat Soc Ser A Stat Soc. 2012;175(1):1–47.
Breslow NE, Day NE. Statistical methods in cancer research. Volume IIThe design and analysis of cohort studies. IARC Sci Publ. 1987;82:1–406.
Eldridge SM, Ukoumunne OC, Carlin JB. The intracluster correlation coefficient in cluster randomized trials: a review of definitions. Int Stat Rev. 2009;77(3):378–94.
Efron B, Morris C. Stein's paradox in statistics. Scientific American  SCI AMER. 1977;236:119–27.
MacKenzie TA, Grunkemeier GL, Grunwald GK, O’Malley AJ, Bohn C, Wu Y, et al. A primer on using shrinkage to compare inhospital mortality between centers. Ann Thorac Surg. 2015;99(3):757–61.
McAllister KS, Ludman PF, Hulme W, de Belder MA, Stables R, Chowdhary S, et al. A contemporary risk model for predicting 30day mortality following percutaneous coronary intervention in England and Wales. Int J Cardiol. 2016;210:125–32.
Mohammed MA, Manktelow BN, Hofer TP. Comparison of four methods for deriving hospital standardised mortality ratios from a single hierarchical logistic regression model. Stat Methods Med Res. 2016;25(2):706–15.
DeLong ER, Peterson ED, DeLong DM, Muhlbaier LH, Hackett S, Mark DB. Comparing riskadjustment methods for provider profiling. Stat Med. 1997;16(23):2645–64.
Ohlssen DI, Sharples LD, Spiegelhalter DJ. A hierarchical modelling framework for identifying unusual performance in health care providers. J R Stat Soc Ser A Stat Soc. 2007;170(4):865–90.
Racz MJ, Sedransk J. Bayesian and frequentist methods for provider profiling using riskadjusted assessments of medical outcomes. J Am Stat Assoc. 2010;105(489):48–58.
Austin PC, Alter DA, Tu JV. The use of fixed and randomeffects models for classifying hospitals as mortality outliers: a Monte Carlo assessment. Med Decis Making. 2003;23(6):526–39.
Jones HE, Ohlssen DI, Spiegelhalter DJ. Use of the false discovery rate when comparing multiple health care providers. J Clin Epidemiol. 2008;61(3):232–40.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57(1):289–300.
Acknowledgements
Not applicable.
Funding
The work was supported by NICOR.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation and analysis were performed by Menelaos Pavlou. The first draft of the manuscript was written by Menelaos Pavlou, Gareth Ambler and Rumana Omar, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This is an observational study. All methods were carried out in accordance with relevant guidelines and regulations, in line with the Declaration of Helsinki. This work was performed as part of the methodological review of the National Cardiac Audit Programme (NCAP), commissioned by the Healthcare Quality Improvement Programme (HQIP), and overseen by the National Institute for Cardiovascular Outcomes Research (NICOR). The review and application of the methodologies was agreed as part of the commission. NICOR conforms to legislation within the Data Protection Act for the collection and use of patientidentifiable data and has approval to hold patient identifiable information without patient consent under section 251 of the NHS Act 2006. The applications for approval were reviewed by the NIGB Ethics and Confidentiality Committee.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12913_2022_8995_MOESM1_ESM.docx
Additional file 1.
12913_2022_8995_MOESM2_ESM.txt
Additional file 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Pavlou, M., Ambler, G., Omar, R.Z. et al. Outlier identification and monitoring of institutional or clinician performance: an overview of statistical methods and application to national audit data. BMC Health Serv Res 23, 23 (2023). https://doi.org/10.1186/s1291302208995z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291302208995z
Keywords
 Outlier detection
 Funnel plot
 Random effects model
 Overdispersion