Comparison of imputation variance estimators

Appropriate imputation inference requires both an unbiased imputation estimator and an unbiased variance estimator. The commonly used variance estimator, proposed by Rubin, can be biased when the imputation and analysis models are misspecified and/or incompatible. Robins and Wang proposed an alternative approach, which allows for such misspecification and incompatibility, but it is considerably more complex. It is unknown whether in practice Robins and Wang’s multiple imputation procedure is an improvement over Rubin’s multiple imputation. We conducted a critical review of these two multiple imputation approaches, a re-sampling method called full mechanism bootstrapping and our modified Rubin’s multiple imputation procedure via simulations and an application to data. We explored four common scenarios of misspecification and incompatibility. In general, for a moderate sample size (n = 1000), Robins and Wang’s multiple imputation produced the narrowest confidence intervals, with acceptable coverage. For a small sample size (n = 100) Rubin’s multiple imputation, overall, outperformed the other methods. Full mechanism bootstrapping was inefficient relative to the other methods and required modelling of the missing data mechanism under the missing at random assumption. Our proposed modification showed an improvement over Rubin’s multiple imputation in the presence of misspecification. Overall, Rubin’s multiple imputation variance estimator can fail in the presence of incompatibility and/or misspecification. For unavoidable incompatibility and/or misspecification, Robins and Wang’s multiple imputation could provide more robust inferences.


Introduction
Multiple imputation (MI) 1 is the most commonly used technique for analysing incomplete data, which are frequently encountered in health-related research. Imputations are repeatedly and independently drawn from the posterior predictive distribution of the missing data given the observed data, under a Bayesian model, to generate mð! 2Þ imputed datasets. These imputed datasets are then analysed separately using complete data methods, and the m sets of results are combined using a simple set of rules. Appropriate imputation inference requires both an unbiased imputation estimator and an unbiased variance estimator.
MI was first developed for analysing large incomplete public shared datasets, for which there may be many analysts, with a wide range of scientific questions, who may not have the specialized knowledge required to generate statistically valid inferences when data are incomplete. It was envisioned that the task of generating multiple imputed datasets would be undertaken by someone with specialized knowledge of missing data methods. 2 The analysts need only apply standard complete data statistical procedures to each imputed dataset separately and then combine the multiple estimates according to Rubin's rules. 1 At the time of imputation, it may not be possible to anticipate all analysis procedures that will be applied to the imputed datasets. Consequently, the assumptions of a given analysis procedure may differ to those of the imputation model. We define an analysis procedure to be incompatible with an imputation model when one or more assumptions of the imputation model contradict with those made by the analysis procedure. For example, the imputer may assume that two subgroups have the same population mean of an incomplete variable (where missingness is independent of subgroup status) whilst the analyst estimates the population mean in one subgroup only.
As with any modelling or statistical procedure, the imputation model may be misspecified; that is, the imputation model's assumptions about the missing data mechanism (MDM) or the complete data are incorrect. For example, the imputation model could incorrectly assume homoscedastic errors. Similarly, for the analysis procedure.
The MI literature has focused primarily on generating methods and guidelines for reducing the bias of the imputation estimator. However, correct imputation inference also requires an unbiased imputation variance estimator and an efficient interval estimator with at least nominal coverage. In certain settings of misspecification and incompatibility of the models, for an unbiased imputation estimator, Rubin's variance estimator was either upwardly or downwardly biased, which led to conservative or anti-conservative confidence intervals, respectively. [2][3][4][5][6][7] There are alternatives to Rubin's MI. Robins and Wang have proposed imputing under a frequentist procedure, 5 that fixes the imputation model parameters at the observed data maximum likelihood estimate, and an imputation variance estimator that allows for misspecification and incompatibility of the imputation model and the analysis procedure, and for non-or semi-parametric analysis procedures. However, it is unclear whether the Robins and Wang MI procedure is an improvement over Rubin's MI in situations typical of the use of imputation in practice. Also, the Robins and Wang imputation variance estimator is considerably more complex to compute than Rubin's and importantly is more technically difficult for the analyst, although the greater burden of calculating the variance estimator is still placed on the imputer. With ever faster computers, a simpler solution could be to calculate the variance of the imputation estimator using re-sampling methods. Full mechanism bootstrapping (FMB) is a bootstrapping approach to imputation that can be applied to parametric problems. 8,9 In this paper, we have conducted the first comparative evaluation of Rubin's MI, our modified version of Rubin's MI, Robins and Wang's MI and FMB with respect to variance estimation and interval estimation. We have compared these four methods of imputation inference in four scenarios of incompatibility and misspecification of the imputation and analysis models that can occur in practice, via a data example (see Sections 2 and 4) and simulations (see Sections 5 and 6). In Section 2, we describe a motivating example for seeking an alternative to Rubin's MI. Section 3 describes all four methods considered in the context of a specific example. Section 4 revisits the motivating example, applying these methods to data from the British Child and Adolescent Mental Health Study 1999 (B-CAMHS99). 10 In Sections 5 and 6, we present our simulation study and conclude with a general discussion in Section 7.

Motivating example
There are several examples in the literature, where observations used for estimation at the imputation stage of MI are not used or available at the analysis stage. For example, confidential information may be used to improve estimation of the imputation model but cannot be disseminated to the analysts. Or, when external data are used to account for measurement error using MI and the external data are not included at the analysis stage because it may not be representative of the target sample. 11 We briefly describe a case study where external data were used to impute measurements, which were not collected for any of the subjects of the target study.
The B-CAMHS99 measured conduct, hyperactive and emotional problems in 15 year olds using the strengths and difficulties questionnaire (SDQ). 12 Researchers wished to compare the results of this study to previous UK population studies, which had used the Rutter A scale. 13 A calibration study was undertaken, which measured both the Rutter A scale and the SDQ scale on an independent sample of adolescents. 10 These external data were used to impute the Rutter A subscales (for conduct, hyperactive and emotional problems) in the B-CAMHS99 study, but were not included at the analysis stage.
In this scenario, the imputations were generated using extra information that was not utilized by the analyst; i.e. that the target and external data were assumed to be identically distributed. Rubin calls such imputation superefficient from the analyst's perspective. 2

Methods for imputation inference
We describe Rubin's MI and its modified version, Robins and Wang's MI and FMB. Robins and Wang have described their complex method for a general missing data setting. 5 To aid further understanding of this method we restrict ourselves to a more simplified setting, which we describe in Section 3.1. The description of Rubin's MI and FMB remains the same for a more general missing data setting.

Notation and setup
Suppose g random variables Y ¼ ðY 1 , . . . , Y g Þ are intended to be observed on n subjects. We use subscripts i and j to index subjects and variables, respectively, ði ¼ 1, . . . , n; j ¼ 1, . . . , gÞ. Let y ¼ ð y ij Þ denote a (n Â g) matrix, whose i, j element is y ij . Row i of matrix y is denoted by y i ¼ ð y i1 , . . . , y ig Þ. The rows of the matrix are assumed to be independent and identically distributed. In practice, some subjects have missing observations, and we write y ¼ ðy obs , y mis Þ with obs and mis denoting the observed and missing parts, respectively. In our simplified example, missingness is confined to one variable and without any loss of generality, we assume variable Y g is incompletely observed for t (t < n) subjects. The MDM is assumed to be ignorable for Bayesian inference, 14, p.120 and hence is also ignorable for likelihood inference.
Let, e Y denote p À 1 ð1 p À 1 g À 1Þ variables in the set Y 1 , . . . , Y gÀ1 and e y i the row vector of observations on e Y for subject i. The imputation model is the normal linear regression y ig je y i $ Nðe y i l, Þ, where l is a column vector of dimension p À 1. Let h ¼ ðl, Þ T , a (p Â 1) vector of unknown parameters.
Let, y k ig denote the kth imputed value of variable Y g for subject i, and y k i is row i of the kth imputed dataset y k ¼ ðy k 1 , . . . , y k n Þ T ðk ¼ 1, . . . , mÞ. If y ig is observed then y k ig ¼ y ig , and so y k i ¼ y i , for all k. Let € Y denote q ð1 q g À 1Þ variables in the set Y 1 , . . . , Y gÀ1 and € y k i the set of observations on € Y for subject i of the kth imputed dataset. Similarly,ỹ k i is the set of observations onỸ for subject i of the kth imputed dataset. Note, in this example, € y k i ¼ € y i andỹ k i ¼ỹ i for all k. In the interest of completeness, we have included the superscript k.
The analysis procedure is the normal linear regression model y k ig j € y k i $ Nð € y k i b, !Þ, where b is a column vector of dimension q. The imputation estimate for the vector of coefficients b ¼ ð 1 , . . . , q Þ T is denoted byb I ¼ ð I 1 , . . . , I q Þ T and, for j ¼ 1, . . . , q, the variance estimate for I j iŝ V I j , withV I ¼ ðV I 1 , . . . ,V I q Þ T .

Rubin's MI inference
Missing values are imputed independently m times, to generate m imputed datasets. Imputations are drawn independently from the posterior predictive distribution of the missing data given the observed data under a Bayesian model. For k ¼ 1, . . . , m, the analysis model is fitted to each imputed dataset y k separately, generating regression coefficients estimateb k ¼ ð k 1 , . . . , k q Þ T with associated variance estimatesŴ k 1 , . . . ,Ŵ k q , whereŴ k j is the variance estimate for coefficient estimate k j . The set of m coefficient estimates is combined into one MI inference using a simple set of rules. When the estimand of interest is a set of regression coefficients it is simpler to apply these rules separately to each regression coefficient as followŝ Confidence intervals are based on the Student's t-distribution. When n is small a modified degrees of freedom formula is recommended. 15 The modified degrees of freedom Ã j is calculated, separately for each regression coefficient, as followŝ where com j is the degrees of freedom about j when there are no missing values. We shall also consider a modification in which the within imputation variances, W k j , are calculated using the robust sandwich variance estimator. 16,17 This modification only affects the calculation of the variance of the imputation estimate; the imputation estimate remains the same as for Rubin's MI. We refer to this modified method as 'robust Rubin's MI'.

Robins and Wang's imputation inference
The Robins and Wang variance estimator for imputation does not require multiply imputed datasets, although it may be more efficient when a dataset is multiply imputed. 5 To our knowledge, there does not exist any commercial or freely available software that calculates this variance estimator. All derivatives involving scalars, vectors and matrices are defined as in, 18 e.g. for n Â 1 vectors a and b, define @a=@b T ¼ ½@a r =@b c , a square matrix of dimension n where c and r, respectively, refer to column and row numbers.
The m sets of imputations are drawn independently from the predictive distribution of the missing data given the observed data under the imputation model evaluated atĥ, the observed data maximum likelihood estimate of h. Therefore, each set of imputations is drawn conditionally on the same parameter estimate. In our simple case,ĥ is the complete case estimate of h (i.e. estimation is based on the n-t subjects with observed Y g ). The m imputed datasets are then stacked into a mn Â g dataset ðy 1 , . . . , y m Þ T .
The analysis procedure is applied to the stack of imputed datasets ðy 1 , . . . , y m Þ T , treating the mn rows as independent, and the estimate of b is the Robins and Wang imputation estimateb I . The analysis procedure can be non-, semi-or fully parametric such thatb I is the solution to the estimating equation To calculate the Robins and Wang variance estimator, both the imputer and analyst must generate additional information. The imputer supplies two further datasets based on the score function of the imputation model. The analyst must generate a dataset and a matrix, which are both based on the estimating equation of the analysis procedure evaluated at b ¼b I . The analyst then inputs these pieces of information into a set of matrix formulae to generateV I . Below we provide details on how to calculate these pieces of information and the matrix formulae.
The score function of the imputation model is the derivative, with respect to h, of its loglikelihood function. The contribution from subject i to the log-likelihood function is log f ð y ig je y i , hÞ ¼ 0:5ðÀ log 2 À log À À1 ð y ig À e y i lÞ 2 Þ and its derivative is column vector 2 2 For subject i, when y i is completely observed let s obs i ¼ s obs i ðĥÞ ¼ ð@ log f ð y ig je y i , hÞ=@hÞ T j h¼ĥ and when y i is incompletely observed let s obs i be a zero row vector of dimension p. Conversely, for k ¼ 1, . . . , m, when y i is incompletely observed let s k, mis i ¼ s k, mis i ðĥÞ ¼ ð@ log f ðy k ig je y k i , hÞ=@hÞ T j h¼ĥ and when y i is completely observed let s k, mis i be a zero row vector of dimension p. For each k we then have n Â p dataset S mis, k ¼ ðs mis, k 1 , . . . , s mis, k n Þ T and stacking these m datasets we have S mis ¼ ðS mis, 1 , . . . , S mis, m Þ T .
For the second dataset based on the score function of the imputation model, first calculate the derivative of column vector @ log f ð y ig je y i , hÞ=@h with respect to row vector h T , which is the p Â p matrix 2 3 For subject i with observed y ig define When y ig is incompletely observed d i is a zero row vector of dimension p. These n row vectors form the second dataset D ¼ ðd 1 , . . . , d n Þ T . The imputer's role is now completed and the stacked imputed datasets ðy 1 , . . . , y m Þ T and datasets S mis and D are passed on to the analyst.
Evaluating u k i ðĥ, bÞ at the imputation estimateb I , the analyst generates m ðn Â qÞ datasets . . , mÞ. These m datasets are stacked to form dataset U ¼ ðU 1 , . . . , U m Þ T . The matrix generated by the analyst is s ¼ sðĥ,b I Þ, which is calculated by differentiating column vector u k i ðĥ, bÞ T with respect to row vector b T , to generate a square matrix of dimension q where r and c, respectively, denote the row and column of the matrix and u k i ðĥ, bÞ r ¼ y k ir ð y k ig À € y k i bÞ. We can then calculate The analyst now inputs datasets D, S mis and U and matrix s into the following matrix formulae the jth diagonal entry of matrix C is the variance estimate corresponding to the coefficient imputation estimator I j , i.e.V I j ¼ À jj .

Full mechanism bootstrapping
FMB can be applied to parametric and non-parametric problems, 8 and under all three MDM assumptions. 19 Unless, the MDM is assumed to be missing completely at random (MCAR) then FMB requires modelling of the MDM. In Efron's worked example of the procedure a deterministic imputation model was used. Shao and Sitter 9 describe FMB with a random regression imputation method. FMB is implemented as follows: 8 (1) Impute the incomplete dataset y once to generate imputed dataset y I . Apply the analysis procedure to dataset y I . The estimate of b is the imputation estimateb I .
(2) Sample with replacement n rows from y I to obtain bootstrapped dataset y IÃ .
(3) Set observations in y IÃ to missing by applying the same missing data pattern (MDP) as for y if MCAR is assumed. Otherwise, infer missingness under a model for the MDM. (4) Singly impute the incomplete dataset from step 3, using the same imputation model as in step 1, to generate the bootstrapped imputed dataset y I~. (5) Apply the analysis procedure to dataset y I~a nd store the estimate of b as bootstrap replicationb I~. (6) Repeat steps 2-5 T times to obtain T bootstrap replicates.
Standard bootstrap formulae 20 can be applied to the bootstrap replicatesb I~t o calculate the bootstrap variance and confidence intervals for each I j . Currently, there does not exist an algorithm or formula for the calculation of the acceleration constant, which is used to generate the biascorrected and accelerated bootstrap confidence intervals. However, calculating the acceleration constant based on the formula used to construct non-parametric bias-corrected and accelerated confidence intervals can give a reasonable approximation. 8

Return to motivating example
We applied the methods of Section 3 to data from the B-CAMHS99 study discussed in Section 2. The data (from the B-CAMHS99 study) consisted of fully observed measurements for sex and the SDQ subscales in 855 (433 boys and 422 girls) adolescents aged 15. The external data consisted of fully observed measurements for sex, the SDQ subscales and Rutter A subscales in 380 (203 boys and 177 girls) adolescents with median age 15 years (interquartile range 13-15 years).
For the purposes of this case study, each Rutter A subscale was imputed separately under an ordinal logistic regression model, with the three SDQ subscales (for conduct, hyperactive and emotional problems) and sex as covariates. The analysis of interest was a linear regression of the Rutter A subscale, with sex and the constant term as covariates, estimated in the B-CAMHS99 study only. We generated 50 imputed datasets for Rubin's MI, robust Rubin's MI and Robins and Wang's MI, and 2500 bootstrap replications for FMB. Table 1 presents the results of the analyses of the B-CAMHS99 study. As expected, all point estimates were comparable. Robins and Wang's MI had the smallest standard errors and narrowest confidence intervals for all cases. The maximum percentage difference in the standard errors of Rubin's MI and Robins and Wang's MI was just under 23%, and the corresponding Rubin's confidence interval was 26.5% longer than the Robins and Wang's confidence interval. In almost all cases, FMB had smaller standard errors than Rubin's MI, and the maximum percentage difference in the standard errors was just under 14%. The larger standard errors of Rubin's MI (or robust Rubin's MI) indicate potential inflation due to superefficient imputations. Robust Rubin's MI showed little improvement over Rubin's MI.

Simulation study methods
We compared the methods described in Section 3 in four scenarios of incompatibility and misspecification of the imputation and analysis models. The simulation study was based on a hypothetical dataset of one binary variable, sex (0 denoting male and 1 denoting female), and four continuous variables, age, height, weight and natural log of insulin index (hereafter referred to as loginsindex). The data were generated under the following model sex $ BernoulliðÞ, age, heightjsex $ Nð 0 þ 1 sex, AEÞ where error W and error L are error distributions and sex ¼ 1 when sex ¼ 0 and sex ¼ when sex ¼ 1.
Model (equation (1)) was based on a dataset of standard anthropometric measurements of 951 young adults enrolled in the Barry Caerphilly Growth study. 21 The parameters of the data model The analysis model was the normal linear regression of loginsindex on weight, with adjustment for other variables. The aim of the simulation study was to evaluate the methods with respect to the imputation variance estimator when the imputation estimator was unbiased. To avoid bias in the imputation estimator due to the MDM we set weight to be missing, for a subset of subjects, under an MCAR mechanism. The missing weight measurements were imputed under a normal linear regression model. Both imputation and analysis models included a constant term and assumed that the variance of the error terms was constant for all values of the outcome variable (i.e. homoscedasticity). Unless otherwise stated, error distributions error W and error L were normal, weight measurements were missing in men and women and the imputation and analysis models were fitted to the entire sample. The scenarios considered were as follows: Scenario 1: Subgroup analysis. We set the true conditional distributions of age, height, weight and loginsindex to be the same in men and women; i.e. 1 ¼ ð0, 0Þ, 1 ¼ 0, 1 ¼ 0 and ¼ 1. Weight measurements were missing (completely at random) in men only. The covariates of the imputation model were age, height and loginsindex. The covariates of the analysis model were age and weight, and the model was only fitted to the men's observations. There was incompatibility between the imputation and the analysis model since only the imputation model assumed that the continuous variables were identically distributed in men and women. Scenario 2: Heteroscedastic errors. We set the true conditional distributions of age, height, weight and loginsindex to have different means in men and women; i.e. 1 , 1 and 1 were set to their nonnull values stated earlier. Additionally, among women the variance of weight and loginsindex was set to be 1/4 of the variance in men; i.e. ¼ 1/2. The covariates of the imputation model were sex, age, height and loginsindex, and the covariates of the analysis model were sex, age and weight. The imputation and analysis models were compatible but incorrectly specified because they assumed homoscedastic errors. Scenario 3: Omitted interaction. We set the true conditional distributions of age, height, weight and loginsindex to have different means in men and women. The variances of weight and loginsindex were set to be the same in men and women; i.e. ¼ 1. The covariates of the imputation model were sex, age, height and loginsindex, and the covariates of the analysis model were sex, age, weight and the interaction between weight and sex. The imputation model was correctly specified but because the analysis model included an unnecessary interaction term the imputation and analysis models were incompatible. Scenario 4: Moderate and severe non-normality. This scenario was motivated by a simulation study that investigated the performance of MI methods with non-normal distributions. 22 Unlike these authors, we were only interested in the effect of the shape of the error distribution, not the size of the error variances. The true conditional distributions of age, height, weight and loginsindex were set to be the same in men and women, i.e. 1 ¼ ð0, 0Þ, 1 ¼ 0, 1 ¼ 0 and ¼ 1. For moderate departures from non-normality, we investigated nine different parameter specifications by setting independently the distributions of error W and error L to be the uniform distribution, Student's t-distribution with six degrees of freedom or the log-normal distribution expfNð0, 1=4 2 Þg. For severe departures from non-normality, we investigated eight different parameter specifications by setting independently the distributions of error W and error L to be the uniform distribution, Student's t-distribution with three degrees of freedom or the log-normal distribution expfNð0, 1Þg. All error distributions had mean zero and unit variance. The imputation and analysis models were the same as in the subgroup analysis scenario, although both models were fitted to the entire sample. The imputation and analysis models were compatible, but misspecified because they assumed a normal error distribution.
For all scenarios, we repeated the simulation study for sample sizes n ¼ 100 and n ¼ 1000 and probabilities 0.6 and 0.4 that weight was observed. For the subgroup analysis scenario only, the probability that weight was observed was one among women and 0.6 or 0.4 among men. For each combination of scenario, parameter specification, sample size and observation probability we generated 2500 independent simulated datasets. Based on 2500 simulations the Monte Carlo standard error for the true coverage probability of 0.95 is p ð0:95ð1 À 0:95Þ=2500Þ ¼ 0:0044, 23 implying that the estimated coverage probability should lie within the range 0.941 and 0.960 (with 95% probability). For Rubin's MI and robust Rubin's MI, we imputed the data using the independent Jeffrey's prior. For FMB, the data were imputed using the same frequentist imputation method used for Robins and Wang's MI. Each incomplete dataset was imputed 50 times for Rubin's MI, robust Rubin's MI and Robins and Wang's MI, and 2500 bootstrap samples were generated for FMB. Table 2 presents the results of imputation inference according to the methods of Rubin, robust Rubin and Robins and Wang in the four scenarios of misspecified or incompatible imputation and analysis models, where the probability of observing weight was 0.4. For moderate departure from normality and severe departure from normality, we have reported the results corresponding to the error distributions error W $ expfNð0, 1=4 2 Þg, error L $ expfNð0, 1=4 2 Þg and error W $ Student's t with three degrees of freedom, error L $ expfNð0, 1Þg. The results for other parameter specifications are summarized in the text below.

Simulation study results
First consider the left-hand side of q contained the true value of 3 . For each of these three scenarios, the mean of the Robins and Wang variance estimate,V I 3 , was close to the sampling variance of I 3 and the confidence interval coverage probability was close to the nominal level. For the subgroup analysis and omitted interaction scenarios, Rubin's variance estimates were upwardly biased (i.e. the mean ofV I 3 was larger than the sampling variance of I 3 ), leading to conservative confidence intervals that were on average wider than those of Robins and Wang. Conversely, for the heterogeneous errors scenario, Rubin's variance estimate was downwardly biased and the coverage probability of the confidence intervals was more than 4% below the nominal level. The robust Rubin's MI method performed similar to Rubin's MI for the subgroup analysis and omitted interaction scenarios and showed a

2552
Statistical Methods in Medical Research 25 (6) slight improvement over Rubin's MI (i.e. higher coverage probability) for the heterogeneous errors scenario.
For the moderate non-normality scenario, the imputation estimate was unbiased and the confidence interval coverage probability was close to the nominal level for all three methods. However, for the severe non-normality scenario, the imputation estimates were upwardly biased for all three methods and for both sample sizes. Robins and Wang MI under-estimated the sampling variance of I 3 the least and had the highest coverage probability, although this was still 3% below the nominal level. The robust Rubin's MI method was an improvement on Rubin's MI such that the confidence interval coverage probability for robust Rubin's MI was less than 1% below that of Robins and Wang's MI. Now consider the right-hand side of Table 2, corresponding to sample size 100. The results for Rubin's MI and robust Rubin's MI followed the same patterns as noted for sample size 1000. There was a deterioration in the performance of Robins and Wang's MI when applied to a dataset of sample size 100. Firstly, the imputation estimates for all but the subgroup analysis scenario were (slightly) upwardly biased. Secondly, the Robins and Wang variance estimates were downwardly biased for all scenarios, resulting in confidence interval coverage probabilities that were at least 3% below the nominal level. Table 3 reports the corresponding results for FMB. The FMB imputation estimates were unbiased for both sample sizes, and for all scenarios except severe non-normality. However, estimation of 3 was less efficient than for the other methods. Of the three types of confidence intervals, in almost all cases the percentile confidence interval had the highest coverage probability for the same average confidence interval width.
For the subgroup analysis, heterogeneous errors, omitted interaction and moderate nonnormality, when the sample size was 1000 Robins and Wang MI outperformed FMB; i.e. had the narrowest average confidence interval and at least nominal coverage. When the sample size was 100, for the subgroup analysis and omitted interaction scenarios FMB with the percentile confidence interval was marginally better than Rubin's MI because (for comparable coverage probabilities) the bootstrap confidence intervals were narrower on average than those of Rubin's MI. The relative inefficiency of FMB to Rubin's MI was outweighed by the upward bias of Rubin's variance estimates. For the heterogeneous errors scenario, sample size 100, FMB with the percentile confidence interval had the highest coverage probability, although still outside the expected range (0.941-0.960). For the moderate non-normality scenario, Rubin's MI was the best method for sample size 100, with close to nominal coverage and the narrowest mean confidence interval width. For the severe non-normality scenario, and for both sample sizes, FMB had the highest coverage probabilities, although still below 0.941.
When the probability of observing weight was 0.6, the pattern of results was the same as in Tables  2 and 3, although the differences between the methods were less marked. Across the nine different error distributions investigated for the moderate non-normality scenario, and for the severe nonnormality specifications error W $ expfNð0, 1Þg, error L $ uniform and error W $ Student's t-distribution with three degrees of freedom, error L $ uniform the results were virtually identical to those reported for moderate departures from normality in Tables 2 and 3. Across the remaining six error distributions investigated for severe non-normality, the pattern of the results was very similar to the severe non-normality scenario in Tables 2 and 3, so that the conclusions drawn with respect to the comparisons of the methods remained the same. For the other coefficients of the analysis model, either the pattern of results was the same as in Tables 2 and 3 but with smaller differences between the methods, or all methods had coverage probabilities of 94-95%, with Robins and Wang MI having the narrowest confidence interval on average and FMB having the widest. Table 3. Summary of the full mechanism bootstrapping simulation results for the weight coefficient: mean and empirical variance (Emp. var) of the imputation estimate I 3 , mean of the estimated standard errorV I 3 and empirical coverage probability (CP) and mean width of the 95% confidence interval for For the subgroup analysis and interaction scenarios in several instances, the Robins and Wang variance estimate was smaller than the corresponding variance estimate that resulted from analysing the fully observed data, i.e. data without missing observations (data not shown). This is due to the fact that the imputations are superefficient with respect to the analysis procedure; i.e. the imputations contain extra information that is not contained in the true data. 2 We conducted a second simulation study to assess the robustness of FMB when data were missing at random (MAR) and the MDM (for simulating missingness in a bootstrapped dataset) was not modelled; that is, missingness was simulated using the observed MDP as above. The design of this simulation study was based on the simulation study described above, with three modifications. There were three changes. First, data were simulated to be missing dependent upon the outcome variable of the analysis model (loginsindex), thus ensuring the complete case analysis produced biased estimates. Second, the imputation and analysis models were compatible and correctly specified. Third, we conducted two FMB methods: FMB with the MDM correctly modelled (which we shall call FMB correct MDM) and FMB with missingness simulated using the observed MDP (which we shall call FMB observed MDP). For both sample sizes, when data were MAR, omitting to model the MDM for FMB resulted in downwardly biased variance estimates, which led to under-coverage of the confidence intervals. Variance estimates were not downwardly biased when the MDM was correctly modelled (FMB correct MDM). For sample size 1000, any differences in the performances of Rubin's MI, robust Rubin's MI, Robins and Wang's MI and FMB correct MDM were very small. For sample size 100, the Robins and Wang method outperformed the other methods with respect to point estimation but its variance estimates were again downwardly biased. For interval estimation only FMB correct MDM with the percentile confidence interval provided coverage probabilities greater than 0.941 for all coefficients. (Results of this simulation study are available upon request from the authors.)

Discussion
We have conducted the first comparative evaluation of Rubin's MI, a modified version of Rubin's MI, Robins and Wang's MI and FMB. Our simulation study shows that Rubin's MI variance estimator failed in four common scenarios of misspecification of the imputation and analysis models, and of incompatibility between them. The variance estimates were substantially upwardly or downwardly biased, resulting in confidence intervals that over-and under-covered, respectively. For moderate sample size (n¼1000) and all scenarios apart from severe nonnormality, Robins and Wang's MI produced the narrowest confidence intervals on average, with close to nominal coverage. When the imputation and analysis models were both misspecified due to severe non-normality all methods had the same biased imputation estimate, but the larger imputation variance estimate of FMB resulted in confidence intervals with coverage probabilities closest to the nominal level.
A key feature of Rubin's MI is the separation of the imputation procedure from the substantive analysis. This has the advantage that the more technical process of imputation can be done by a specialist, following which multiple analyses can be done on the imputed datasets by non-specialists using standard software. However, this separation may also lead to incompatibility between the imputation and analysis models, when assumptions made during imputation are not carried forward to the analysis stage. For example, estimation for domains (i.e. subgroups or subpopulations) is common for the analysis of survey data and, in particular for large surveys analysed by many users, imputations can be generated ignoring the domain indicator. 24 In some situations, such as the subgroup analysis and omitted interaction scenarios of our simulation study, incompatibility can in principle be avoided, if the imputer provides sufficient documentation of the imputation model to future users. However, in some instances incompatibility may be unavoidable; e.g. for confidentiality reasons the provider of the imputed data may only disseminate to the users a subset of the observations (or records) used during the imputation stage. 11,25 Misspecification of imputation and analysis models is a more general problem, which will arise whenever model assumptions are not justified. We investigated two scenarios, in which misspecification was due to heteroscedastic errors and non-normality. Our results suggest that when misspecification arises from heteroscedastic errors, there has to be a sizable difference between the subgroup variances in order for Rubin's imputation variance estimator to materially under-estimate the sampling variance of the imputation estimator. In this case, heteroscedastic errors in the imputation and analysis models could have been accommodated by conducting separate MI analyses in men and women.
A limitation of the Robins and Wang MI method is that it makes large sample assumptions, which led to downwardly biased variance estimates and confidence interval coverage when applied to small datasets. A major disadvantage of the Robins and Wang method is that calculation of the imputation variance estimate is considerably more complicated than for Rubin's MI and FMB, with a greater burden placed on both the imputer and the analyst. To our knowledge, there is no generally available software implementing the Robins and Wang method. The analyst must make available derivatives of the estimating equations for use in calculation of variance estimates, and these become harder to calculate as the complexity of the analysis procedure increases. Also, the complexity of the calculations conducted by the imputer increases when there is multiple incomplete variables with a general MDP. For this reason, our simulation scenarios were restricted to data missing in a single variable, as were the scenarios considered in the papers proposing the approach. The Robins and Wang method requires the data to be imputed under a single imputation model. Therefore, currently, it cannot be applied if imputation is conducted using chained equations imputation, 26 a flexible and commonly used method of imputation that imputes under two or more imputation models. In contrast, calculation of the variance of an imputation estimator by Rubin's MI method and FMB is straightforward for more complex MDPs and analysis procedures, and can be applied when data are imputed using chained equations imputation.
A limitation of FMB was its inefficiency relative to the other imputation inference methods. Furthermore, FMB requires modelling of the MDM when data are MAR, which is not required by the MI methods of Rubin or Robins and Wang. A further simulation study we conducted showed that FMB required modelling of the MDM when data were MAR. Further work is needed to compare Rubin's MI with FMB when data are MAR and the missing data model assumed by FMB is incorrectly specified.
In summary, accurate inference requires an unbiased estimator and variance estimator. Rubin's MI variance estimator may be biased in the presence of incompatibility between the imputation and analysis models and model misspecification. This can lead to over-or under-coverage of confidence intervals. These limitations should be noted in guidelines on the appropriate use of Rubin's MI, 27 which should emphasize how incompatibility can be avoided, and the pitfalls that can arise because of model misspecification. The simplicity and flexibility of Rubin's MI mean that it is likely to remain the method of choice to deal with data that are MAR. However, where these problems of incompatibility and misspecification cannot be avoided, Robins and Wang MI has the potential to provide more robust inferences, should the considerable challenges in provision of software implementing the procedure be overcome.