Separating variability in healthcare practice patterns from random error

Improving the quality of care that patients receive is a major focus of clinical research, particularly in the setting of cardiovascular hospitalization. Quality improvement studies seek to estimate and visualize the degree of variability in dichotomous treatment patterns and outcomes across different providers, whereby naive techniques either over-estimate or under-estimate the actual degree of variation. Various statistical methods have been proposed for similar applications including (1) the Gaussian hierarchical model, (2) the semi-parametric Bayesian hierarchical model with a Dirichlet process prior and (3) the non-parametric empirical Bayes approach of smoothing by roughening. Alternatively, we propose that a recently developed method for density estimation in the presence of measurement error, moment-adjusted imputation, can be adapted for this problem. The methods are compared by an extensive simulation study. In the present context, we find that the Bayesian methods are sensitive to the choice of prior and tuning parameters, whereas moment-adjusted imputation performs well with modest sample size requirements. The alternative approaches are applied to identify disparities in the receipt of early physician follow-up after myocardial infarction across 225 hospitals in the CRUSADE registry.


Introduction
Clinical studies frequently seek to improve the quality of care provided to patients by identifying discrepancies between providers. Provider units may be individual physicians, clinics, or hospitals. The first step is to establish that systematic variation exists and is of sufficient magnitude to justify a detailed investigation. Wide variation in outcomes supports the hypothesis that institutional factors play a role in affecting outcomes. Variation in treatment patterns suggests that the quality of care may be improved by identifying the practices of best performing institutions and adopting them widely. If the discrepancies between providers are small, improvement in quality will most likely come from new developments in care, rather than existing models. Consequently, recent publications emphasize conclusions about the magnitude in variation across providers: ''the degree of variability in clinical practice we observed represents a potential quality improvement opportunity''; 1 ''in-hospital major bleeding rates varied widely across hospitals''; 2 and ''ICU admission rates for heart failure (HF) varied markedly across hospitals''. 3 These data arise hierarchically, with a sample of providers, and within them, a sample of patients who experience a binary treatment or outcome. The goal is to quantify variation at the provider level in the proportion of patients who experience the outcome. Such data are commonly modeled by hierarchical logistic regression with a Gaussian random effect for provider whereby the provider-level variance can be estimated. This variance, measured on the log-odds scale, has limited the clinical interpretability, and the investigators prefer to visualize the variation in provider probabilities on an absolute scale. Therefore, numerous examples in the cardiovascular literature include a histogram of provider-specific sample proportions. [3][4][5] For example, Hess et al. 5 sought to identify disparities in the receipt of early follow-up (within seven days) with a physician after myocardial infarction. Across 225 hospitals enrolled in the CRUSADE registry, the hospital-specific sample

Methods
For clarity of exposition, suppose we are interested in studying treatment variation across hospitals. Other binomial outcomes of clusters could easily be substituted. Let i ¼ 1, 2, . . . , N represent the ith hospital and j ¼ 1, 2, . . . , n i represent the jth patient from the ith hospital, where N is the number of hospitals and n i is the number of patients at the ith hospital. Denote Y ij as a binary indicator for whether the jth patient from the ith hospital receives treatment, taking a value of 1 if that patient received treatment, and 0 otherwise. The sum of Y ij over patients within the same hospital is denoted Y i . Y i may be viewed as independent, binomial counts with a hospital-specific probability of receiving treatment, p i , and take observed values y i . With N hospitals, there are p 1 , . . . , p N , potentially unique probabilities of treatment. The data can be described by a two-stage sampling model: The hospital-specific parameter p i comes from a population distribution, F p , that depends on parameters g. Let f p be the probability density function (PDF) associated with F p . We are interested in estimating f p and its variance 2 As noted by Paddock and Louis, 19 there are applications where the observed N hospitals constitute the full population of interest. In that case, one is interested in the finite probabilities p 1 , . . . , p N , with variance 2 If p 1 , . . . , p N were observed, one might calculate 2 N and F N as consistent estimators for 2 p and F p . Here, we focus on the former, population-level inference, and address the problem that p 1 , . . . , p N are not observed. In accordance with the CRUSADE early follow-up example, our conclusions about variability are meant to apply to hospitals from a broader population, not limited to those participating in the registry.

Raw proportions
As described above, a standard approach to this problem is to obtain sample proportions for each hospital aŝ The EDF is defined as FpðxÞ ¼ N À1 P N i¼1 I ðp i xÞ , and the distribution ofp 1 , . . . ,p N is converted to a smooth density using a kernel density estimate (KDE) with the default bandwidth in the R density() function, denoted fp.

Gaussian hierarchical model
The generic two-stage model for Y i is provided in Equation (1). Here, we focus on the Gaussian, logistic model by adding parametric assumptions. The model assumes: where logit(x) ¼ logfx=ð1 À xÞg. A common goal is to estimate 0 and 2 b , while acknowledging b i as a source of variation; b i is integrated out of the likelihood, and not directly estimated as part of the model fit. When there is interest in estimating or predicting p i (as a function of b i ), a variety of approaches can be implemented in a second phase of estimation. 20,21 An empirical Bayes (EB) approach is widely utilized for this purpose. 8,20 We denote the EB predictions asp b i . The variance ofp b 1 . . .p b N , 2 p b is over-shrunk compared to 2 p . Although this fact is well known, EB predictions are nearly synonymous with hierarchical models in cardiovascular research. To emphasize the difference between EB prediction and the following approach, 2 p b is included in our simulation study of provider-level variance. For the current application, we are not interested in the individual p i , but in describing their distribution. Given that logit(p) is assumed to follow a normal distribution, Normalð 0 , 2 b Þ, we can estimate the parameters of this distribution, plug them in, and parametrically estimate 2 p and f p . Implementation is as follows. Using standard statistical packages, fit the hierarchical model to obtain estimates of 0 and 2 b . Substitute estimates for the unknown parameters and generate q ¼ logit(p) according to the normal distribution, Normal( b 0 , 2 b ). This can be done either by simulation or by applying the normal quantile function to a sequence of uniform probabilities between 0 and 1. Then, convert to the probability scale by taking expðqÞ=f1 þ expðqÞg ¼p. By taking a large number of data points q, and thereforep, the density of these data will be arbitrarily close to the density ofp, fp, which is regarded as an estimator of f p . The variance ofp, 2 p , is an estimator of 2 p . Alternative methods could be used to derive fp, such as the transformation theorem applied to the function expðqÞ=f1 þ expðqÞg. This would also have to be obtained numerically.
Clearly, fp is derived under the assumption that q has a normal distribution. In many cases, this is a reasonable assumption. As with many natural phenomenon, it is plausible that hospitals would have symmetric variation about a common mean on the log-odds scale. The Gaussian hierarchical model (GHM) approach, just described, illustrates the result on a clinically meaningful scale. If normality is not plausible, estimating provider variation may still be insensitive to such mis-specification, as previous authors have observed robust results, even when the random effects distribution is mis-specified. 22 The robustness of the variance parameter to violations of the Gaussian assumption is explored below.

Semi-parametric Bayesian density estimation
Bayesian methods offer flexibility in the specification of hierarchical models and density estimation. Rather than specifying a parametric prior distribution for cluster-specific probabilities, the Dirichlet process (DP) prior has been widely used for parametric mixture modeling. 11,12,14,15,17,23 Numerous options and software are reviewed by Jara et al. 23 The DPs mixture can approximate many smooth distributions by estimating the number of mixture components according to the data. One such model, based on the methods of Escobar and West 14 and Liu, 17 is implemented by the DPbetabinom function in R DPpackage. 23 The function fits a semi-parametric version of the beta-binomial model For this study, we assume F 0 $ Uniformð0, 1Þ, i.e., a 1 ¼ 1, b 1 ¼ 1. The baseline distribution, F 0 , is conjugate in this model which allows Markov chain Monte Carlo with Gibbs sampling for estimating the posterior density f pjY . The posterior variance is defined as 2 The posterior density f pjY and 2 pjY are estimators of f p , and 2 p , respectively. In preliminary studies, we evaluated potential values for the hyper-parameters a 0 and b 0 or the prior parameter (removing the hyper-prior). Plausible values for range from 1= log N to N= log N (0.1 to 80 in the simulation settings below). 11 When is near 0, the model places high probability on p i being all the same, and the posterior density tends to be highly peaked around finite points. As increases, this leads to a higher probability of more unique values of p i , and the posterior density tends to be more smooth. Excessively large will treat each provider probability as if they are totally unique and tend toward a fixed effects model. The gamma hyperprior is employed to increase robustness by allowing to be better adapted to the data. We considered 2 f0:1, 1, 5, 10, 20, 50g and multiple sets of hyper-parameters for the gamma prior on . None performed better than those of Paddock et al., 15 a 0 ¼ 4 and b 0 ¼ 4 (subsequently denoted DP-1) and a 0 ¼ 10 and b 0 ¼ 0.10 (subsequently denoted DP-2). We therefore focus on this specification in the subsequent evaluation. Given a sample of 100 hospitals, 95% of the prior mass on the number of clusters in the DP mixture for these two pairs of hyper-parameters falls on 10 or fewer clusters (DP-1) and between 53 and 83 clusters (DP-2). Results for a fixed ¼ 20 are also displayed as DP-3.

Smoothing by roughening
Rather than specifying a parametric prior distribution for the provider-specific parameters, Laird 24,25 developed a non-parametric maximum likelihood (NPML) estimator of the prior, F p . Despite numerous advantages, this approach yields discrete priors with too narrow support and under-dispersion, 13 motivating the adaptation of a non-parametric EB alternative called ''smoothing by roughening'' (SBR). 13,26 The process starts with a smooth prior distribution, that is not necessarily correct, and uses the EM algorithm to update it in the direction of the NPML. Fewer iterations results in a more smooth distribution, while increasing iterations are successively closer to the NPML. To speed up computation, the initial smooth density is discretized at a large number of mass points The distribution of a is updated over ¼ 1, . . . , N steps, as follows: This can be expanded further as The initial density f 0 is taken to be Uniform(0, 1). After N iterations, the algorithm stops, and f N is the SBR estimator of f p . The posterior variance, defined as 2 is the SBR estimator of 2 p . When iterated to convergence, the SBR algorithm produces the NPML estimate; however, the goal of SBR is to produce a smooth distribution by the selection of finite N . The appropriate choice of N is unclear, though Shen and Louis 13 provide some guidance. They suggest that ''There is no need to be too strict on the exact value of N , as long as for large N, N is of order log N, . . . For small or moderate samples, we recommend N=3 N 2N.'' In the sample sizes considered below, log N would be about 6, whereas N=3 would range from 30 to 150, and 2N would range from 200 to 1500. Several values of N are compared in online Supplemental Appendix E. We selected N ¼ 50 as it yielded superior results in a variety of cases, but no value of N was observed to be universally preferable.

Moment-adjusted imputation
By viewing this as a measurement error problem, additional options become available. Thomas et al. 18 introduced moment-adjusted imputation (MAI) which replaces mis-measured data, W, with estimators that have asymptotically the same distribution as a latent variable of interest, X, up to some finite number of moments. They show that MAI is related to other measurement error methods but has superior performance in many settings. MAI is applicable whenever the error in W is due to added noise; W i ¼ X i þ U i , where U i is normally distributed random error with mean 0 and variance 2 ui . This type of error arises as a result of device error or biological fluctuations but also by random sampling error in parameter estimates. To see that MAI applies to the present application we can view p as the unobserved latent variable (X) andp as a mis-measured version (W).p i can be written as p i þ U i where U i is a random error with mean 0 and variance 2 ui ¼ p i ð1 À p i Þ=n i . For hospitals with a reasonably large number of patients, the error inp i is approximately normal. When the number of patients per site is small, this approximation may not hold. The performance across a range of finite site sizes is evaluated by the simulation given below.
The details of MAI have been described previously, with application to a broad variety of problems. 18 Here, we review a special case of MAI that is applicable to the current focus on cluster-specific proportions. For the clarity of exposition, we use general MAI notation in this section and refer to X and W. The objective is to constructadjusted versions of the W i , sayX i , where the first M sample moments ofX i unbiasedly estimate the corresponding moments of X i ; that is, The distribution ofX i approximates that of X i up to M moments. In particular, the variance ofX i will be unbiased for the variance of X i . Unbiased estimates,m r , for the moments, EðX r Þ, are derived according to the assumption of normal, additive error, and variance 2 ui . The adjustedX i is obtained by minimizing P N i¼1 ðW i À X i Þ 2 subject to constraints on the moments. Using Lagrange multipliers ð 1 , . . . , , M Þ ¼ Ã, the objective function is The adjusted data are obtained by taking the derivative of Q M with respect to ðX 1 , . . . , X N , ÃÞ, equating this to 0, and solving for ðX 1 , . . . ,X N ,ÃÞ by the Newton-Raphson method. This process has been integrated into an R function (online Supplemental Appendix A). Once the adjusted dataX i are obtained, they can be ''imputed'' in place of X i . In particular, the data pointsX i can be displayed in a histogram or kernel density plot and will resemble the distribution of X i up to M moments. At a minimum, the mean and variance ofX i will be unbiased for For the current application, MAI can be applied directly to the ''mis-measured''p i , which takes the place of W i , and the measurement error variance estimated by 2 ui ¼p i ð1 Àp i Þ=n i . Adjusted values,p MAI , will have M moments that are consistent for the corresponding moments of F p , as both n i and N get large. The sample variance, 2 p MAI , EDF, Fp MAI , and density fp MAI are calculated as forp, but substituting adjusted valuesp MAI .
Thomas et al. 18 provide an extensive simulation study of MAI, with application to kernel density estimation. They recommend matching an even number of moments (two or four). When p has a normal distribution, it is adequate to match two moments, when p has a Chi-square or bi-modal distribution, it is necessary to match four moments in order to capture unique features. The tradeoff between matching two versus four moments is a biasvariance tradeoff. Higher order moments are estimated with less precision, and MAI uses estimated moments. Matching higher moments reduces bias but adds variation. Thomas et al. 18 show that matching four moments strikes a good balance in many reasonable scenarios. As the advantage of MAI is the flexible handling of nonnormal distributions, a general recommendation is to match four moments.

Small sites
From a subject matter standpoint, excluding hospitals with very few eligible patients is clinically reasonable; researchers do not expect to characterize a hospital based on a handful of patients. The ad hoc convention is to exclude n i < 25. Although this exclusion is insufficient, it can be relaxed when the methods considered here are employed. However, there is a limit to the feasibility of recovering f p when hospital sizes become small. Effectively, the ratio of noise to signal becomes very large; a scenario in which f p cannot be identified without strong parametric assumptions. 27 Thus, we do not expect good performance of non-parametric or semi-parametric methods for the estimation of f p with extremely small hospital sizes. Preliminary simulations indicate that a minimum n i 4 10 and median n i > 20 is sufficient to avoid numerical problems and poor performance. We focus our attention on applications such as CRUSADE where it is reasonable to require n i > 10, and the typical cluster size is at least 20. Even so, we consider cluster sizes that are smaller than have been previously evaluated (to the best of our knowledge). A recent tutorial on Bayesian methods for a similar application (hospital comparisons) excluded sites with less than 20 patients (median n i ¼ 295). 12 Another example focused on even larger clusters (cities). 13

Simulation studies
In this section, we compare the preceding methods by simulation across a range of conditions, including (1) number of providers N: 100, 300, and 500; (2) number of patients per provider n i : 20, median % 25, 30, median % 65; (3) distribution of logit ð p i Þ (i.e., b i in Equation (2)): Normalð0, 2 b Þ; 2 df¼1 , standardized to have mean 0 and variance 2 b ; and a bi-modal mixture of normals; and (4) variation across providers 2 b : 0.05 (small) and 0.50 (large). Two settings of n i involve heterogeneous patients per site, taking values similar to those observed in two databases. Histograms of n i are available in online Supplemental Appendix B. The first represents a distribution of n i from the CRUSADE early follow-up study which, as a large registry, tends to have larger numbers of patients per site (median % 65, IQR: 38-126). The second replicates the distribution of site sizes from a recent clinical trial, more heavily weighted toward a small number of patients (median % 25, IQR: 17-43). In both, we have excluded sites with n i 5 10 (less than 15% of hospitals and less than 1% of patients). In each simulated data set, a sample of N provider probabilities, p 1 , . . . , p N , is generated according to the distributions described above. At each hospital, n i patient responses are generated according to a Bernoulli distribution with probability p i .

Illustration
In Figures 2 and 3, the simulation settings are demonstrated along with select results for single simulated data sets. For this illustrative example, we fix the number of providers at 300 and the number of patients per provider at 20. Densities fp and fp MAI corresponding to the GHM and MAI are shown in Figure 2, with comparison to f p and fp. The densities based on SBR and DP-2 are displayed separately, in Figure 3, although applied to the same data set and compared to the same f p and fp. Each density is displayed for both small and large variation across providers. These figures emphasize four things: (1) that our simulation conditions cover a wide range of provider distributions; (2) variation across providers is severely over-estimated by raw proportions when true variation is small, but not when the true variation is large; (3) the adjustment methods can achieve clinically meaningful improvement; and (4) SBR and DP-2 appear to create false modes in some cases. These examples orient us to the simulation settings and illustrate the potential advantages and disadvantages of alternative approaches. A full Monte Carlo (MC) simulation is conducted in order to establish performance over repeated samples and different conditions.

Monte Carlo simulation
In B ¼ 1000 simulated data sets, a sample of N providers is drawn, with probabilities of outcome generated according to the distributions for logit(p) described above. For each data set and method, we obtain estimates of 2 p , F p , and f p . Success in estimating the provider-level variance is measured by bias and mean squared error (MSE). Bias is quantified by comparing the MC average of provider-level variance estimates, 1 The Euclidean distance (ED) is calculated between cumulative distribution function estimates and F p (ED-CDF), as well as between PDF estimates and f p (ED-PDF). In a single data set ED-CDF ¼ ffi p R fFðtÞ À F p ðtÞg 2 dt and ED-PDF ¼ ffi   Figure 4 shows the MC average provider-level variance estimates plotted as a ratio, relative to the true variance, on the log scale. The ratio is taken so that different underlying distributions and magnitudes of variance can be plotted on a common scale. Ratios less than 1 reflect estimators that under-estimate the variance, whereas ratios greater than 1 reflect estimators that over-estimate the variance. The log scale is used so that a 50% under-estimate is the same distance from 1 as a 200% over-estimate. Figure 4 shows that the naive method, based on raw proportions, substantially over-estimates the variance. As expected, the Gaussian EB predictions are just as bad, in the opposite direction. The candidate adjustment methods, GHM, MAI, SBR, and DP-1, are almost perfectly unbiased. Of note, the GHM variance estimator is unbiased even when the underlying distribution is chisquare or bi-modal and the Gaussian assumption is incorrect. However, the Bayesian semi-parametric method is sensitive to the specification of prior parameters; priors DP-2 and DP-3 slightly over-estimated the cluster-specific variance, particularly when the underlying variation is small. Results for the MSE track closely with bias and are reported in online Supplemental Appendix C. Adjustment methods reduced the MSE, relative to raw proportions, by 60% to 99%.
The ED provides a global measure of closeness between the estimated distributions and their target. Figure 5 shows the average over simulated data sets of ED-CDF for each adjustment method, plotted as a percent reduction relative to the average ED-CDF for raw proportions. The goal is to quantify improvement in estimating F p compared to taking a naive approach. MAI is the only adjustment method that reduces the ED-CDF in all scenarios. As expected, GHM does best when the underlying distribution function is Gaussian, but poorly otherwise. SBR performs similar to MAI, but in a few cases increases the ED-CDF. Investigation of individual data sets (online Supplemental Appendix D) reveals that this can be attributed to SBR identifying too many peaks or multiple modes when the target distribution is smooth. Among the DP methods, DP-2 and DP-3 perform better than DP-1. This makes sense considering that DP-2 and DP-3 favor a smooth distribution, whereas DP-1 identifies too many peaks and modes rather than a smooth distribution (online Supplemental Appendix D). The many peaks have minimal impact on the variance estimate but matter with respect to estimating F p .
The ED-CDF metric does not tell the full story. Provider variation is frequently illustrated by plotting a density estimate, which may also be of interest. Even if the primary goal focuses on variation, a density estimate with multiple modes or unique skew is likely to attract attention. Therefore, accuracy for density estimation is important. Figure 6 shows the average over simulated data sets of ED-PDF, plotted as a percent reduction relative to the average ED-PDF for raw proportions. On this metric, improving density estimation relative to raw proportions is not guaranteed by any adjustment method. As expected, the GHM provides large improvements only if the underlying distribution is Gaussian. Otherwise it is generally worse. MAI improves the ED-PDF as much as 25% to 75% when the underlying distribution is normal or bi-modal, though very little when the underlying distribution is Chi-square. In online Supplemental Appendix D, MAI  produces a rare outlier in the case of small n i and small, chi-square provider variation. The outlier was easily recognizable for having mass on a few discrete points and may have skewed the average performance in this scenario. This never occurred with larger n i (! 30), regardless of the underlying distribution. SBR and DP-1 increased the ED-PDF by as much as 500% in many cases. The problem was most severe in large variation scenarios, worsened with increasing cluster size n i , and did not improve with increasing N. Investigation of individual data sets (online Supplemental Appendix D) reveals density estimates with 10 to 20 modes, when the underlying distribution had only one or two. DP-2 and DP-3 favor a more smooth distribution and yet the same problem was apparent.

Analysis of early physician follow-up
To demonstrate these methods, we revisit the example of early physician follow-up described in Section 1. Using the raw proportions, calculated within each provider, Hess et al. 5 saw wide variation in early follow-up with a physician after MI (Figure 7(a)). In Figure 1, we re-analyze the CRUSADE data and directly estimate the density of the provider-specific proportion of patients receiving early follow-up using a GHM, MAI, SBR, and DP methods. In Figure 7(c), SBR is displayed for N ¼ 20 and 50, in order to visualize the sensitivity to the choice of iterations. Each of the DP priors is included in Figure 7(d).
Comparing Figure 7(a) with 7(b) to (d), the standard deviation is smaller when chance variations are removed. The adjusted distributions in Figure 7(b) to (d), have nearly identical spread. Compared to GHM, MAI preserves some unique features of the raw distribution, such as slight bi-modality near the peak, whereas SBR and DP methods create clusters of provider probabilities. Our simulation studies suggest that those clusters may be false and occur even if the underlying distribution of provider probabilities is smooth. Figure 7(b) to (d) conveys the same essential information that the wide variation in the use of early follow-up observed by Hess et al. 5 was largely reflective of real variation.

Discussion
When the goal of an analysis is to estimate the magnitude of systematic differences among providers, the random variation associated with observed rates muddies the picture. We illustrate the use of existing methods for this purpose and assess their merits under settings that are representative of hospital variation studies, including smaller cluster sizes than have previously been evaluated. Further, we demonstrate a novel application of MAI.
Our results emphasize that each has limitations. The choice of methods should be guided by the priorities of a given study and the available data. When the goal is to quantify cluster-level variation and flexible density estimation is not a priority, the GHM performs well; its variance is unbiased regardless of whether the underlying distribution is Gaussian. This result does not depend on the number of hospitals, N, or hospital sizes, n i . The GHM is particularly useful if the number of providers is small (N 100) and/or the number of patients per provider is small (median n i 25), but the parametric assumption imposed by the model must be acknowledged. If flexible density estimation is desired, larger site sizes are needed. Otherwise all of the methods considered here have potential problems. With median n i 25, MAI may produce densities with mass on a few discrete points; while exceedingly rare, this cannot be ruled out without the knowledge of the underlying density. With hospital sizes larger than 30, MAI performs well, whereas SBR and DP methods estimate incorrect multi-modal densities. The gains in density estimation from MAI are modest, with a reduction in ED-PDF of 0 to 75%. However, this compliments a large improvement in variance estimation (60%-95% reduction in MSE of the variance).
Although DP-1, DP-2, and DP-3 reduce bias in the estimation of provider-level variability, the resulting densities are not accurate. Given that our motivating applications seek to illustrate variability via a density estimate, this method could be misleading. The current analysis is based on a mixture of DPs that has been proposed for the purpose of flexible density estimation and incorporated into software that is apparently straightforward to use. 14,17,23 Our specification of hyper-prior parameters is consistent with the previous literature, 15 and DP-1 is similar to fixing ¼ 1, which is default in the documentation of DPbetabinom(). However, the problems we observe are not easily resolved by modifying the hyper-prior parameters. We explored a large number of hyper-prior parameters prior to focusing on DP-1, DP-2, and DP-3. Alternatives were not superior. Therefore, our results are likely representative of a typical application of this software. This does not implicate all Bayesian methods; a Bayesian analysis with Gaussian prior would be nearly identical to the GHM. However, the Bayesian semi-parametric model based on a DP prior was frequently worse than naive methods at estimating the provider-level density. This method does not appear suitable to hospital variation studies that are similar to the settings we evaluate.
SBR exhibits some of the same problems as DP methods. Even with a large number of hospitals and many patients per hospital, the SBR density estimate is often worse than the naive method, with false multi-modality. Shen and Louis 13 noted that the SBR estimate of F p behaves similarly to the one based on the DP hyper-prior, depending on the number of iterations N . SBR converges to the nonparametric maximum likelihood estimator (NPML) as the number of iterations increases. The NMPL has been criticized for being discrete, and frequently under-dispersed. Therefore it is not surprising that SBR would exhibit similar features at some number of iterations. Using fewer iterations of SBR is an option. Smaller N will create more smoothing and avoid multimodality. However, the tradeoff, at least with an uninformative prior, is that the results remain over-dispersed. This phenomenon can be seen in Supplemental Appendix E.
These problems observed with DP and SBR methods were not observed in previous studies that focused on continuous outcomes, larger cluster sizes and estimation of CDFs. In these cases, where there is more information in the data, results would not be as sensitive to choice of prior parameters or iterations. In addition, focusing on the CDF ( Figure 5) does not emphasize a problem because the ED-CDF ( Figure 5) is insensitive to false modes, since it is primarily driven by differences in center and spread. One advantage to SBR is that it is easy to investigate the sensitivity of results to the choice of N . By saving the output at each step, one can plot the results across a range of N . If the results were insensitive that might provide an additional level of confidence. In our settings, the results were very sensitive. Therefore, there is a high risk of inaccurate results or the potential to pick and chose among them.
Compared to the DP and SBR methods, MAI does not involve decisions regarding hyper-prior or tuning parameters. MAI does not directly target density estimation but ''adjusts'' the observed data to have unbiased moments up to a fixed number. However, the choice of how many moments to match is analogous to a tuning parameter. We implemented MAI, as proposed by Thomas et al., 18 by fixing the number of unbiased moments at four (mean, variance, skewness, and kurtosis). If the number of moments matched were increased, the results would vary. For our purpose, MAI may be preferable because the decision regarding moments is easily interpreted and related to the goals. In order to visualize variability across providers, without any strong parametric assumptions, it may be adequate to have unbiased mean, variance, skewness, and kurtosis. Indeed, this approach performed well across a range of scenarios.
There are a variety of opportunities for extensions and future research. For clarity, we have focused on the example of variation in early follow-up across hospitals, but the same issues and methods apply to any binary outcome and across different types of provider-based clusters (site, region, and physician). Occasionally, the goal is to describe variation across providers, after adjusting for patient characteristics. It is relatively simple to adapt MAI to allow for case-mix adjusted rates, as MAI starts with a set of noisy estimates (potentially adjusted) and removes the noise. However, there are a lot of ways in which adjustment can be defined in terms of the target of inference. This is more complicated because the variation in proportions depends on the mean; different methods of adjustment that alter the center of the distribution may target different variance parameters. In addition, one might consider whether fixed provider-level information could be used to augment the adjustment process. Provider volume, for example, is often of interest. MAI was developed to target the joint distribution of multiple variables including mis-measured and error-free variables. Moments and cross-products can be targeted to achieve this end. MAI could be adapted to remove measurement error in a volume-specific way; preserving the joint distribution between site-specific performance and volume. These extensions are an important avenue for future research.
Quality improvement studies involve many objectives that are not considered here and excellent resources are available. [6][7][8] Different objectives within the same study may require different methods. Here, we provide methods that may complement other analyses. Whenever the magnitude of variability is a key finding, it should be estimated and illustrated unbiasedly.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded, in part, by the Patient-Centered Outcomes Research Institute (PCORI) grant #130, PI: Prvu Bettger, an independent non-profit research organization created to help people make informed healthcare decisions and improve healthcare delivery, and by grant R01-HL118336 from the National Heart, Lung, and Blood Institute of the National Institutes of Health (NIH). The content is solely the view of the authors and does not represent the view of the NIH or PCORI.

Supplemental material
Supplemental material is available online for this article.