Multiple imputation of multilevel missing data: An introduction to the R package pan

The treatment of missing data can be difficult in multilevel research because state-of-the-art procedures such as multiple imputation (MI) may require advanced statistical knowledge or a high degree of familiarity with certain statistical software. In the missing data literature, pan has been recommended for MI of multilevel data. In this article, we provide an introduction to MI of multilevel missing data using the R package pan, and we discuss its possibilities and limitations in accommodating typical questions in multilevel research. In order to make pan more accessible to applied researchers, we make use of the mitml package, which provides a user-friendly interface to the pan package and several tools for managing and analyzing multiply imputed data sets. We illustrate the use of pan and mitml with two empirical examples that represent common applications of multilevel models, and we discuss how these procedures may be used in conjunction with other software.


Multiple Imputation of Multilevel Missing Data: An Introduction to the R Package pan
In recent years years, multilevel models have become one of the standard tools for analyzing clustered empirical data. Such data often occur in organizational and educational psychology and other fields of the social sciences, for example, when employees are nested within work groups, students are nested within school classes, or in longitudinal studies when measurement occasions are nested within persons. In addition, empirical data are often incomplete, for example, when participants drop out of the study or do not answer all of the items on a questionnaire. Several authors have advocated the use of modern missing data techniques such as multiple imputation (MI) rather than traditional approaches such as listwise or pairwise deletion (Allison, 2001;Enders, 2010;Newman, 2014;Schafer & Graham, 2002;van Buuren, 2012). One central requirement of MI is that the imputation model must be at least as general as the model of interest in order to preserve relationships among variables (Enders, 2010). In the case of incomplete multilevel data, it is important that the imputation model takes the multilevel structure into account in order to ensure valid statistical inferences in subsequent multilevel analyses (Black, Harel, & McCoach, 2011;Graham, 2012;van Buuren, 2011).
Although MI is gaining popularity among applied researchers, multilevel imputation models are rarely used in practice. One of the most commonly recommended software solutions for multilevel imputation is the pan package (Schafer & Yucel, 2002;Schafer & Zhao, 2014), which is freely available in the statistical software R (R Core Team, 2015; see also Culpepper & Aguinis, 2011). However, the application of pan can be challenging, and its documentation is rather technical, especially for users who are not familiar with R. For instance, for multilevel missing data, Graham (2012) recommended "that you obtain a copy of the PAN program (...), and that you find an expert in R who can help you get started" (p. 137).
The present paper is intended as a gentle introduction to the pan package for MI of multilevel missing data. We assume that readers have a working knowledge of multilevel models (see Hox, 2010;Raudenbush & Bryk, 2002;Snijders & Bosker, 2012). In order to make pan more accessible to applied researchers, we make use of the R package mitml, which provides a user-friendly interface to the pan package and some additional tools for organizing and analyzing multiply imputed data (Grund, Robitzsch, & Lüdtke, 2016). The first section of this paper introduces an empirical example that is used for illustrating the application of pan to multilevel data. In the following section, we briefly describe the main ideas behind pan and MI, and we discuss which features of multilevel models must be considered when conducting MI. Finally, we use the mitml package to carry out MI for the empirical example. In that context, we will discuss possibilities for model diagnostics and tests of nonstandard statistical hypotheses (e.g., model constraints, model comparisons).

Multilevel Modeling: An Empirical Example
Multilevel models account for dependencies in the data and allow relationships between variables to be estimated at different levels of analysis or effects that may vary across higher-level observational units. For the purpose of this article, we assume that the multilevel structure consists of persons (e.g., students, employees) nested within groups (e.g., classes, work groups).
If only the regression intercept varies across groups, the model is referred to as a random-intercept model. For example, Chen and Bliese (2002) examined the effects of individual characteristics (e.g., psychological strain) and leadership climate on the self-efficacy of U.S. soldiers. Kunter, Baumert, and Köller (2007) investigated the effects of student-and group-level ratings of classroom management on students' interest in mathematics. If the effects of additional predictor variables vary across groups, the model is referred to as a random-slope or random-coefficients model. For example, Hofmann, Morgeson, and Gerras (2003) investigated varying effects of leader-member exchange on safety behavior across work teams in the U.S. army.
The example data set used in this article is from the field of educational research and was taken from the German sample of primary school students who participated in the Progress in International Reading Literacy Study (PIRLS; Bos et al., 2005;Mullis, Martin, Gonzales, & Kennedy, 2003). The data set includes test scores in both mathematics and reading achievement, a measure of cognitive ability, a measure of socioeconomic status (SES), students' ratings of the quality of teaching in their math and reading classes (the prevalence of disciplinary problems), and ratings of the general learning environment (school climate). For the purpose of this article, we considered only students for whom reading achievement and cognitive ability scores were available, which was true for approximately 99.3% of the sample (8,767 students in 475 classes).
Ratings of disciplinary problems in math classes were missing for half of the sample due to a planned missing data design (Graham, Taylor, Olchowski, & Cumsille, 2006). Table 1 provides an overview of the data set, along with the observed correlations and the percentages of missing values among variables. Some variables contain additional, unplanned missing data. In such cases, it is useful to examine the missing data patterns that occur in the data set. This is shown in Table 2. Approximately 50% of the sample adhered to the planned missing data design (Patterns 1 and 2). In another 25% of the sample (Patterns 3 and 4), SES was additionally missing. The remaining patterns were more diverse, and data were missing for math achievement scores, disciplinary problems in reading classes, or school climate. Planned missing data designs are becoming increasingly popular in large-scale observational studies because such designs can reduce the burden that is placed on each individual participant (Graham et al., 2006). The missing data mechanism is usually ignorable for variables recorded in this manner, thus enabling us to focus on more specific aspects of MI in multilevel research.

Example 1: Random-Intercept Model
Our first model of interest examined the effect of teaching quality in math classes (disciplinary problems; DPM) on students' math achievement scores (MA). In addition, we included SES in order to control for differences in socioeconomic background between students and classes. The student-level variables were centered around the group mean, and the group means were included as predictor variables in order to separate within-group from between-group effects (see Enders & Tofighi, 2007;Raudenbush & Bryk, 2002). For student i in class j, Here, the β coefficients denote fixed effects, and υ 0 j and i j denote the residuals at the class and student level, respectively. We refer to the effects of the average DPM and SES of a class as between-group effects, whereas within-group effects accounts for the students' individual deviations from that average. For example, β 4 denotes the effect of a class' average SES on class-level math achievement, whereas β 3 denotes the effect of students' individual deviations from the class average on their individual math achievement scores. The student-and class-level residuals are each assumed to follow a normal distribution with zero mean and variances Var(υ 0 j ), independently and identically across classes, and Var( i j ), independently and identically across students.

Example 2: Random-Slope Model
Our second model of interest examined the relationship between students' cognitive ability (CA) and their math achievement scores. We assumed that the relationship between the two variables would vary across groups (random slope) because some teachers may nurture students' individual strengths and weaknesses, whereas others may strive to "equalize" them. As before, we included SES to control for differences in socioeconomic background. In line with recent recommendations for analyzing random-slope variation, we centered the variables around the group means (Aguinis, Gottfredson, & Culpepper, 2013;Hofmann & Gavin, 1998). The group means were included as additional predictors in order to "reintroduce" the group-level construct into the model. The model reads where υ 1 j denotes the random effect of cognitive ability on math achievement per class. The two random effects (intercept and slope) are assumed to follow a multivariate normal distribution, independently and identically across classes, and the remaining notation is the same as above.

Multiple Imputation of Incomplete Multilevel Data
Missing data could be addressed by restricting the analyses to completely observed cases (listwise deletion). However, this approach is more likely to suffer from low power and to give biased results (e.g., Little & Rubin, 2002; see also Newman, 2014). Multiple imputation has become one of the preferred methods for overcoming these problems (Rubin, 1987;Schafer & Graham, 2002). Using MI, a number of replacements for the missing data are drawn from the distribution of the missing values, given the observed data and an imputation model. The completed data sets are then analyzed separately, and the results are combined across data sets to form final parameter estimates and inferences (see Enders, 2010, for details about the general MI procedure).

General Aspects of MI
In most applications of MI, the data are assumed to be missing at random (MAR), a notion that was introduced by Rubin (1976) in his well-known classification of missing data mechanisms. Consider the hypothetical complete data matrix Y which is decomposed into observed and unobserved portions Y = (Y obs , Y mis ). An indicator matrix R denotes whether values are observed or missing. If the missing data are simply a random sample of the hypothetical complete data, that is, P(R|Y) = P(R), then the data are missing completely at random (MCAR). One such scenario occurs in planned missing data designs, where missing values are "assigned" randomly to each participant. If the occurrence of missing data depends on the observed data but missing data occur "at random" with these taken into account, that is, P(R|Y) = P(R|Y obs ), then the data are missing at random (MAR). The two missing data mechanisms MCAR and MAR are often called "ignorable" because the exact missing data mechanism need not be known in order to perform MI (for a more general discussion of the role of "ignorability", see Enders, 2010). If neither condition holds, that is, P(R|Y) = P(R|Y obs , Y mis ), then the data are missing not at random (MNAR). Most software implementations of MI rely on the assumption that the data are MAR. Performing MI under MNAR is possible but requires making strong assumptions about the missing data mechanism and is most often used for sensitivity analyses (see Carpenter & Kenward, 2013). In order to enhance the plausibility of the MAR assumption, it has been suggested that auxiliary variables be included in the imputation model. These variables are related to either the propensity of missing data or the missing values themselves, without necessarily being part of the model of interest (Collins, Schafer, & Kam, 2001). In our empirical example, some data are missing by design and are thus MCAR. For the remaining data, we will assume that the data are MAR, given the observed portions of the data that can be included as auxiliary variables.
Furthermore, the imputation model must be at least as complex as the analysis model. If variables or parameters that are relevant for the analysis model are not included in the imputation model, then the procedure could yield biased results (Meng, 1994;Schafer, 2003). For example, assume that a researcher is interested in testing an interaction between two variables in a multiple regression analysis with partially missing data. In this case, it would be important that the interaction effect (i.e., product term) is incorporated in the imputation model (Enders, Baraldi, & Cham, 2014). Similarly, if one is interested in estimating the intraclass correlation (i.e., the variance within and between groups) with incomplete data, it would be crucial to take into account the clustered data structure (Taljaard, Donner, & Klar, 2008). If the model of interest includes random slopes, then the imputation model should allow for different slopes across groups. Choosing an appropriate imputation model can be challenging, and it may be tempting to resort to ad hoc methods for treating multilevel missing data (Graham, 2012). For example, it has been suggested that the multilevel structure be represented by creating a set of dummy indicator variables (Drechsler, 2015;Graham, 2009;White, Royston, & Wood, 2011). In this approach, the dummy indicators are included in the single-level imputation model, and a separate intercept (or fixed effect) is estimated for each group. However, recent simulation research has indicated that such methods can distort parameter estimates and standard errors in multilevel analyses (Andridge, 2011;Enders, Mistler, & Keller, 2016;Lüdtke, Robitzsch, & Grund, in press).
Two broad approaches to performing MI can be distinguished. In the joint modeling approach, a single statistical model is used for imputing all incomplete variables simultaneously (e.g., Schafer & Yucel, 2002). In contrast, in the fully conditional specification of MI, each variable is imputed in turn using a sequence of imputation models (van Buuren & Groothuis-Oudshoorn, 2011). In the present article, we focus on the pan package, which follows the joint modeling paradigm (for a discussion, see Carpenter & Kenward, 2013).

The Multivariate Linear Mixed-Effects Model
The statistical model underlying the pan package is a multivariate extension of regular (univariate) multilevel models; that is, it represents multiple dependent variables simultaneously.
In addition, the model may feature a number of predictor variables with associated fixed and random effects. Formally, we refer to this model as the multivariate linear mixed-effects model (MLMM; see Schafer & Yucel, 2002). The model reads where y i j is the (1 × r) vector of responses for person i in group j, x i j and z i j are (1 × p) and (1 × q) vectors of covariate values, β is a (p × r) matrix of fixed effects, b j is a (q × r) matrix of random effects, and e i j is a (1 × r) vector of residuals. In most cases, the matrix z i j contains a subset of the values in x i j , and both will contain at least a "one" for the regression intercept. The random effects matrix b j , with columns stacked upon another, is assumed to follow a normal distribution with mean zero and covariance matrix Ψ, independently and identically for all groups. The vector of residuals e i j is assumed to follow a normal distribution with mean zero and covariance matrix Σ, independently and identically for all individuals. The MLMM imputes all variables on the left-hand side of the model equation given the variables on the right-hand side (with fixed and random effects). Only the variables on the left-hand side (i.e., in y i j ) may contain missing values, whereas the variables on the right-hand side must be completely observed (i.e., in x i j and z i j ). In the following, we will distinguish between two broad approaches to MI of incomplete multilevel data using pan's MLMM (see Table 3 for an illustration).
Multivariate empty model. In the first approach, the emphasis is placed on the left-hand side of the model (i.e., the y i j ), whereas the right-hand side includes only the intercept . For all variables included on the left-hand side, the MLMM decomposes their variances and covariances into separate between-(Ψ) and within-group portions (Σ). We refer to this approach as the multivariate empty model. This model can be understood as a multivariate variant of the regular empty multilevel model-also known as the null model or the intercept-only model-, in which the dependent variable is also decomposed into between-and within-group components, but the predictor side of the model remains empty. The upper half of Table 3 contains an example with three variables, each of which may or may not contain missing data. As can be seen in Table 3, the three variables decompose into a fixed term common to all persons and groups, a random intercept unique to each group, and an error term unique to each person. The covariance matrices of random effects and errors, Ψ and Σ, contain the variances of the y i j and allow for relations between the dependent variables at the group and the person level, respectively. For that reason, the empty model is especially useful if researchers are interested in estimating relationships at the individual and the group level as in random-intercept models with group-level predictors (see Example 1).
Full mixed-effects model. The second approach utilizes both sides of the model. For all variables included on the right-hand side (i.e., in x i j and z i j ), the MLMM estimates fixed and/or random effects, respectively. The lower half of Table 3 contains an example with two dependent variables with missing data and one fully observed predictor variable x 1 in x i j and z i j . As can be seen, the model includes both fixed and random effects for the intercept and x 1 . We refer to this model as the full mixed-effects model because it includes both random intercepts and slopes where possible. Note that x 1 is not decomposed in this model, and the fixed and random effects represent the overall effects of that variable on the dependent variables. In order to include separate within-and between-group effects of x 1 , the variable must be decomposed into betweenand within-group portions prior to performing MI (e.g., by including the group mean as an additional predictor). The full mixed-effects model is particularly useful if the model of interest includes random slopes because the slope variance is represented in the imputation model (see

Software Alternatives
A number of software packages have introduced procedures for MI of multilevel data. The software Mplus (Muthén & Muthén, 2012) implements a two-level model similar to the empty model in pan (denoted H1) as well as a second procedure (denoted H0) for more complex models (e.g., random-slope models; see Asparouhov & Muthén, 2010b). Joint modeling approaches are also available in SAS (Mistler, 2013), REALCOM (Carpenter, Goldstein, & Kenward, 2011), and the R package jomo (Quartagno & Carpenter, 2016). A fully conditional specification of MI is available in the R package mice (van Buuren & Groothuis-Oudshoorn, 2011). For some of these packages, it is possible to follow similar analysis steps as outlined in this article for the pan package. We return to this possibility in a later section.

Example Applications with Multilevel Missing Data
In order to demonstrate the application of pan for imputing incomplete multilevel data, we made use of the mitml package. This package provides a more convenient interface for the pan algorithm and some additional tools for handling multiply imputed data sets and combining their results (Grund, Robitzsch, & Lüdtke, 2016). Following the imputation, we used the package lme4 for estimating the two models of interest (Bates, Maechler, Bolker, & Walker, 2014). We repeated the imputation and estimation in both examples using the popular software Mplus 1 . The results were mostly consistent with those of pan and will not be discussed in detail. Input files for Mplus are provided in Supplement A in the supplemental online materials.
The example data set is structured as follows. The first variable (ID) denotes the class membership of each student. The remaining variables are as described above and may contain different amounts of missing data, which are denoted as NA. Treating and analyzing multilevel missing data usually involves the following steps. First, an appropriate imputation model must be specified. As outlined above, the analysis model must be considered at that point so that the relevant variables, parameters, and auxiliary variables are included in the imputation model. Second, the imputation procedure must be carried out, resulting in a number of imputed data sets. Third, the data sets must be analyzed separately, and the resulting parameter estimates are combined according to the rules described in Rubin (1987; for alternatives, see Carpenter & Kenward, 2013;Reiter & Raghunathan, 2007). These steps can be carried out using the mitml package. In order to illustrate the impact of different approaches for handling incomplete multilevel data, we also provide the results obtained from single-level MI, which ignores the multilevel structure, and from listwise deletion (LD; i.e., complete case analysis). The computer code and output files are provided in Supplement B in the supplemental online materials.

Example 1: Random-Intercept Model
In the first example, the model of interest examined the between-and within-group effects of disciplinary problems in math classes (DPM) on math achievement (MA), while controlling for SES at the individual and class level.
Choosing an appropriate imputation model is straightforward in this case because the multivariate empty model is suitable for random-intercept models in general. In addition, the empty model includes between-as well as within-group relations as required by the model of interest (in Ψ and Σ; see Table 3). Recall that the empty model is specified by writing all variables on the left-hand side of the model equation. In R, the imputation model for this example is set up as follows.
# SETUP: imputation model (variance decomposition model) fml <-MathAchiev + MathDis + SES + ReadAchiev + CognAbility + ReadDis + SchClimate~1 + (1|ID) The mitml package uses formula objects to represent the imputation model. The "∼" symbol separates the left-and right-hand side of the model. The left-hand side contains the three variables of interest and the auxiliary variables (i.e., reading achievement, cognitive ability, ratings of disciplinary problems in reading classes and school climate). On the right-hand side, the intercept is specified both as a fixed (1)  The full procedure is divided into a burn-in phase and an imputation phase (see Enders, 2010).
During burn-in, the algorithm performs a number of iterations without saving any imputations, thus ensuring that the parameters of the imputation model have converged to stationary distributions. In other words, the burn-in phase must be long enough for the algorithm to "stabilize" before any replacements are drawn. Then, during the imputation phase, a number (m) of imputed data sets are drawn, each spread a number of iterations apart. The fact that imputations are not drawn directly from consecutive iterations ensures that the imputed data sets constitute independent random draws from the posterior predictive distribution. Specifically, consecutive iterations in MCMC are often correlated to some degree (autocorrelation), whereas multiply imputed data sets must be drawn independently of one another. Thus, the number of iterations chosen between imputations must be large enough for autocorrelation to vanish.
In the first example, we ran pan for 50,000 burn-in iterations, after which m = 100 imputed data sets were drawn, each spread 5,000 iterations apart. While these numbers may seem large, recent studies have advocated generating such large numbers of imputations, particularly when large portions of the data are missing (Bodner, 2008;Graham, Olchowski, & Gilreath, 2007).
The number of iterations for burn-in and between imputations was chosen such that convergence could be ensured, as described below. The respective command using mitml was as follows.
# IMPUTATION: imp <-panImpute(dat, formula=fml, n.burn=5 , n.iter=5 , m=1 , seed=1234) The mitml package saves the imputation in a special format that is designed to handle large data sets. In order to obtain a list containing all the imputed data sets, the function mitmlComplete is used. The necessary command is printed below.
# list of imputed data sets impList <-mitmlComplete(imp, print="all") Convergence diagnostics. For the analysis to yield reliable results, it must be ensured that the pan algorithm has converged and that the imputed data sets are approximately independent draws from the posterior predictive distribution (for a detailed discussion of convergence assessment in MCMC, see Cowles & Carlin, 1996;Gill, 2014;Jackman, 2009). The mitml package offers two ways of doing so. The first option is to examine the potential scale reduction factor (also calledR; Gelman & Rubin, 1992) for the parameters of the imputation model.
Originally intended for analyzing multiple MCMC chains,R is calculated here by discarding the burn-in iterations and dividing the single MCMC chain for each parameter into multiple segments (see Asparouhov & Muthén, 2010a). TheR statistic then compares the variance within and between segments in order to detect a potential "drifting" of the chain, that is, chains that are more variable overall than one would expect, based on the variability within segments. In the mitml package,R is included in the summary of an imputed data object.
# DIAGNOSTIC: summary and potential scale reduction summary (imp) In addition to the potential scale reduction, the output of summary includes details about the imputation procedure and the missing data rate per variable. In this example, the output was as follows (truncated for better readability). Call: panImpute(data = dat, formula = fml, n.burn = 5 , n.iter = 5 , m = 1 , seed = 1234) Ideally,R should be close to one for all parameters (Gelman & Rubin, 1992). If larger values occur (say, above 1.050), a longer burn-in period may be required. Due to the potentially large number of statistical parameters, the mitml package displays only summary statistics for these values while emphasizing the parameters with the largestR. As shown in the output, theR was well below 1.050 for all parameters. The parameter with the largestR was the first diagonal entry of the random-effects covariance matrix Ψ, that is, the intercept variance for math achievement scores (R = 1.011). However,R has been criticized, and large values ofR need not always indicate poor convergence (e.g., Geyer, 1992). Therefore, as a second option, diagnostic plots should be Here, we discuss the diagnostic plots only for the fixed intercept and the intercept variance for math achievement, which exhibited the worst convergence behavior of all parameters (see Figure 1). The trace plots showed no sign of "drifting" or substantial change after the burn-in phase, indicating that 50,000 iterations were sufficient for the parameters to reach their respective target distributions. Autocorrelation was quite persistent for the intercept variance but had essentially died out by lag 5,000. Therefore, imputations spread 5,000 iterations apart could be considered independent. We concluded that the parameters had converged and that the imputed data sets constituted independent draws from the posterior predictive distribution of the missing data.
Intraclass correlations. Usually the first step in analyzing multilevel data is to estimate the intraclass correlation (ICC) of the variables of interest. Therefore, before proceeding with the model of interest, we will illustrate the analysis of multiply imputed data sets by fitting intercept-only models for math achievement, SES, and DPM to estimate their ICCs. In order to obtain final parameter estimates from multiply imputed data sets, the analysis model must be fit separately to each data set, and the resulting estimates must be combined. In the mitml package, the list of imputed data sets (here impList) can be analyzed by using the functions with and within. The within function is used to transform the imputed data sets and carry out smaller computations prior to fitting the analysis model. The with function returns the model fit itself.
The intercept-only model for math achievement can be fit as shown below (for DPM and SES, see Supplement B). We used the lmer function from the lme4 package to fit the analysis models.
# FIT: null model for math achievement fit <-with( impList, lmer(MathAchiev~1 + (1|ID)) ) This results in a list of 100 fitted analysis models, one for each imputed data set. The parameter estimates of the fitted models can be combined by using the rules described in Rubin (1987). The mitml package implements Rubin's rules in the testEstimates function, which returns the combined estimates for all fixed effects and, when used with lme4, the variance components and the residual ICC (see Supplement B). The final estimates can be requested as given below.

# final parameter estimates (Rubin's rules) testEstimates(fit, var.comp=TRUE)
The resulting estimates of the ICCs are presented in Table 4 along with the estimates from single-level MI and LD. Most notably, multilevel MI (using pan) led to much larger estimates of the ICCs than single-level MI, especially for variables with large amounts of missing data (DPM and SES). This illustrates the importance of accounting for the multilevel structure when conducting MI for multilevel data. The estimates obtained from LD were closer to those of multilevel MI without any obvious pattern emerging. These results are consistent with previous research that was based on simulation studies (e.g., Taljaard et al., 2008;van Buuren, 2011).
Model of interest. The procedures outlined above can also be used for fitting the model of interest (Equation 1). Prior to fitting the model, the group means for DPM and SES must be calculated in each imputed data set, and the student-level variables must be centered around their respective group means. Such computations can be carried out using within as shown below. As before, testEstimates returned the final parameter estimates and inferences.

# final parameter estimates (Rubin's rules) testEstimates(fit, var.comp=TRUE)
The output of testEstimates includes the final parameter estimates, the MI standard errors, the degrees of freedom and value of the reference t distribution 2 , the fraction of missing information (FMI), and the relative increase in variance due to nonresponse (RIV). Even though the FMI is not frequently reported in empirical studies, it holds great value for the interpretation of results and has been recommended as a diagnostic tool for analyzing multiply imputed data sets (Bodner,2 By default, testEstimates uses the standard t distribution proposed by Rubin (1987), which provides a test statistic that is appropriate in larger samples. Alternatively, the degrees of freedom may be adjusted for smaller samples as described in the package documentation (see also Barnard & Rubin, 1999;Reiter, 2007). 2008). The FMI represents the amount of information about an estimand that is lost due to missing data (Allison, 2001;Enders, 2010). In other words, the FMI shows the loss of "efficiency" when estimating parameters from multiply imputed data sets (Savalei & Rhemtulla, 2012). Similar to the FMI, the RIV denotes the increase in sampling variability in each estimand that can be attributed to missing data (see Enders, 2010). The output for the model of interest is given below. The results for multilevel MI, single-level MI, and LD are presented in Table 5. In general, a higher SES was associated with higher math achievement scores, whereas test scores tended to be lower if students reported disciplinary problems in class. The estimates at the class level were roughly twice as large as those at the student level. Single-level MI led to similar estimates of within-group effects, but the estimates of the between-group effects were consistently larger than those obtained from multilevel MI. Listwise deletion produced larger standard errors (especially at the student level) and smaller estimates of class-level effects.
Researchers are often interested in estimating contextual effects, that is, group-level effects when controlling for effects at the student level. For example, the contextual effect of SES can be calculated simply by subtracting its within-group coefficient from its between-group coefficient (Kreft, de Leeuw, & Aiken, 1995). Effects constrained in such a way can be tested against zero using the testConstraints function as shown below.
# contextual effect via model constraints testConstraints(fit, "SES.CLS -SES.STU") Testing constrained parameters is based on the delta method (e.g., Casella & Berger, 2002;Raykov & Marcoulides, 2004), and the pooled test for multiply imputed data sets is based on the method by   In this example, the contextual effect of SES was statistically significant at p < .001 (F = 15.297, df 1 = 1, df 2 = 1293.0). Thus, it appeared that classes with a higher SES tended to have higher math achievement scores, even after controlling for SES at the student-level.
Notice that, throughout this example, we used manifest group means as predictor variables in the multilevel analyses. This is different from the imputation model, where the group-level portions of variables are represented as latent variables (i.e., random effects). In general, an imputation model based on latent group means (i.e., random effects) yields similar results as one that is based on manifest means, and both can be considered correct imputation models for multilevel data (Carpenter & Kenward, 2013;Lüdtke et al., in press;Mistler, 2015). However, when estimating the model of interest, the predictors' group means may again be considered as latent, and slightly different results are expected for such an analysis model (Asparouhov & Muthén, 2006;Lüdtke et al., 2008). A further discussion can be found in Supplement C in the supplemental online materials along with the Mplus syntax files for fitting the latent analysis model. In this example, the two analysis models led to essentially the same conclusions.

Example 2: Random-Slope Model
In the second example, the model of interest examined the effect of student's cognitive ability (CA) and socioeconomic status (SES) on students' math achievement scores (MA). The effect of SES is assumed to be fixed, whereas the effect of cognitive ability is allowed to vary across groups.
As discussed before, the imputation model must consider the model of interest. In this example, the effect of cognitive ability is assumed to vary across groups, which must be reflected in the imputation model. The full mixed-effects model was used for this task (see Table 3). Furthermore, we calculated the group means and the group-mean-centered cognitive ability scores so that we could use them in the imputation model. This was achieved using within as shown below.
# TRANSFORM: group mean centering (prior to performing MI) dat <-within(dat, { CognAbility.CLS <-clusterMeans(CognAbility,ID) CognAbility.STU <-CognAbility -CognAbility.CLS } ) Because cognitive ability scores are available for all students, it can be included on the right-hand side of the imputation model, which also allows the slope variance to be specified. The imputation model was set up as follows.
# SETUP: imputation model (random effects model) fml <-MathAchiev + SES + ReadAchiev + MathDis + ReadDis + SchClimate~1 + CognAbility.STU + CognAbility.CLS + (1+CognAbility.STU|ID) The model includes math achievement, SES, and the auxiliary variables on the left-hand side of the equation. In order to include the slope variance, cognitive ability is featured on the right-hand side, where (1+CognAbility.STU|ID) allows the intercept and the effect of the group-mean-centered cognitive ability scores to vary across groups.
It is worth noting that the MLMM assumes the same random effects structure for all dependent variables in the model. In other words, the full mixed-effects model includes not only the intercepts and slopes for the regressions of math achievement and SES on cognitive ability but also for the four remaining variables. Thus, users of pan should be wary of including too many variables if the model contains multiple random effects. The number of parameters can increase rapidly by adding dependent variables or predictors with random effects to the model, possibly requiring a large number of iterations for the model to converge.
As in the first example, the imputation procedure is started by using panImpute while referring to the data set and the model equation. In this example, we let pan perform 100,000 burn-in iterations, after which we generated m = 100 imputed data sets, each spread 20,000 iterations apart. The respective command was as follows.
# IMPUTATION: imp <-panImpute(dat, formula=fml, n.burn=1 , n.iter=2 , m=1 , seed=1234) As before, a list of imputed data sets was extracted using mitmlComplete. The code is not displayed here because it is identical to the previous example (see Supplement B).
Convergence diagnostics. Before proceeding with the analysis, it must be ensured that the pan algorithm has converged during burn-in and that the interval between imputations was sufficiently large. Again,R gives an idea of possible problems with convergence and is accessed through the summary. The largest value ofR was 1.001 in this case, indicating that the MCMC chain had become stationary for all parameters. Examining the diagnostic plots supported this impression but also indicated that some parameters were affected by autocorrelation. As shown in Figure 2, the parameters related to the variables of interest converged quickly and did not suffer greatly from autocorrelation. For some parameters, especially the group-level variance components, the autocorrelation was quite persistent but vanished for all parameters with a lag of 15,000 to 20,000 iterations.
Model of interest. In order to estimate the model of interest, the student-level variables were centered around their group means (using within), and the model was fit using the lme4 package (using with). We changed the method for estimating the multilevel model from restricted maximum likelihood (REML) to full information maximum likelihood (FIML) because the model comparison that was conducted as a later step in this analysis required that the analysis models were estimated using FIML. The code for fitting the model of interest is given below. The final parameter estimates and inferences were obtained using testEstimates. These are presented in Table 6, along with the estimates from single-level MI and LD. Students with higher cognitive ability (as compared with their class average) tended to score higher on the math achievement test after controlling for SES. This relation appeared to vary substantially across groups. In comparison, single-level MI produced lower estimates of the intercept and slope variance and a slightly larger estimate of the class-level effect of cognitive ability. For LD, the estimates of the fixed effects and variance components were slightly different from those obtained with MI but comparable altogether. Results obtained using the H0 imputation in Mplus yielded results similar to those produced by pan 4 .
When estimating multilevel models with random slopes, researcher are often interested in whether or not the regression coefficients vary substantially across groups. For this purpose, likelihood-ratio tests (LRTs), which compare the model of interest with an alternative model that constrains the slope variance to zero, are often conducted (see Snijders & Bosker, 2012). A method for pooling the LRT across multiply imputed data sets was suggested by Meng and Rubin (1992). This procedure is accessible in mitml through the testModels function. The alternative model is similar to the model of interest, but only the intercept is allowed to vary across groups.
The code for fitting the alternative model is given below.
# LRT for nonzero slope variance testModels(fit, fit.null, method="D3") The output of testModels for testing the slope variance is printed below. The pooled LRT was statistically significant at p = .006 (F = 5.119, df 1 = 2, df 2 = 10386.2) indicating that the slope variance was statistically different from zero. Thus, it appeared that students with different cognitive ability may differ more or less strongly in their math achievement scores, depending on the class to which they belong. It may be interesting to examine the determinants of this variation, for example, teachers' attributes or aspects of the learning environment. However, for the purpose of this article, we will not discuss these questions in detail. Research has shown that the LRT for variance components may suffer from low statistical power (see LaHuis & Ferguson, 2009;Stram & Lee, 1994). However, there are currently very few options for performing hypothesis tests for variance components with multiply imputed data sets other than Meng and Rubin's (1992) method.
Analyzing Imputations Generated by Alternative Software As outlined above, there are a number of software alternatives for generating imputations for multilevel missing data, some of which are similar in scope to pan, and some of which provide further support for categorical, ordinal or group-level variables. For example, if the model of interest also includes categorical variables with missing data, researchers may prefer using the R packages jomo or mice, or standalone software such as Mplus. In general, the analysis steps presented here can be carried out on multiply imputed data sets irrespective of their origin. The requirement for using mitml's analysis functions is that the multiply imputed data sets are represented as a "list" of data sets in R. This can be achieved by either generating imputations using its wrapper functions, or by converting the imputed data into a list of data sets. The mitml package currently includes wrapper functions for pan (panImpute) and jomo (jomoImpute) as well as functions to convert imputed data sets generated by mice (mids2mitml.list). For other software packages, however, the conversion must be performed manually (e.g., using long2mitml.list, or as.mitml.list). The use of these functions is illustrated in the documentation of the package. In most applications, using the wrapper functions is recommended because it allows for using the tools for convergence diagnostics provided by mitml.

Discussion
Even though multilevel models are frequently used in psychology and the social sciences, MI of multilevel missing data is seldom discussed in the applied literature. As a result, listwise deletion, single-level MI, and ad hoc methods for representing the clustered data structure prevail in research practice (e.g., the dummy-indicator approach) even though research has shown that these methods can result in distorted parameter estimates in subsequent multilevel analyses. In the present article, two empirical examples were used to illustrate the application of the two R packages pan and mitml to multilevel data. In Example 1, we discussed the application of pan to random-intercept models and for estimating between-and within-group effects. In Example 2, we focused on MI for multilevel models with random slopes and on estimating and testing the slope variance. We believe that researchers can benefit greatly from incorporating pan in their statistical analyses. Specifically, pan allows the special features of multilevel data to be preserved, a practice that is essential for obtaining reliable estimates from multilevel analyses and for understanding their results. Moreover, pan allows researchers to use all of the available information in the data and to include auxiliary information without altering the model of interest.
By contrast, many interesting features of multilevel models may be distorted or even lost when using simpler methods for handling multilevel missing data. For example, the results from Example 1 showed that parameter estimates can be distorted if the imputation model ignores the multilevel structure of the data.
The field of statistical software is always in motion, and there continue to be a number of promising developments regarding multilevel MI. However, some problems still provide challenges for the future. For example, using multilevel MI can be difficult if missing data occur on predictor variables in models with random slopes or interaction effects. Graham (2012) recommended that MI for models with random slopes should be conducted separately for each group using single-level MI. Schafer (2001) proposed that incomplete predictor variables be treated as outcome variables in the imputation model, thus accepting a (possibly small) bias for the slope variance (see also Grund, Lüdtke, & Robitzsch, 2016a). To mitigate this problem, it has been suggested to generate imputations for predictor variables in such a way that they are consistent with the model of interest (e.g., Goldstein, Carpenter, & Browne, 2014;Wu, 2010; see also Bartlett, Seaman, White, & Carpenter, 2015). These methods may provide an improvement over current implementations of multilevel MI in complex multilevel models with random slopes and missing values in predictor variables (Erler et al., in press). Unfortunately, they are currently not available in standard software.
Even though many algorithms exist for MI of multilevel data, the analysis often remains a challenge when software does not provide the tools for combining the results from multiply imputed data sets. Using the mitml package, we provided examples for combining simple parameter estimates, model comparisons, and model constraints with multiply imputed data sets.
The treatment of multilevel missing data offers many challenges, and state-of-the-art procedures are often not very accessible unless researchers are deeply familiar with missing data and MI. We hope that the present article will provide guidance for applied researchers and promote the use of modern missing data techniques such as MI. In general, we believe that pan is a powerful tool for treating multilevel missing data because many features of typical research questions can easily be represented in pan's MLMM. Future research should devote attention to increasing the accessibility of modern methods for handling and analyzing missing data.
Currently, the use of MI in multilevel research, while largely desirable, is often hindered by the lack of accessible software and appropriate tools for analyzing multiply imputed data sets in real-world research scenarios. For future studies, the topic of multilevel missing data yields many interesting research questions that have yet to be explored. Note. MA = math achievement; RA = reading achievement; CA = cognitive ability; SES = socioeconomic status; DPM = disciplinary problems in math class; DPR = disciplinary problems in reading class; SC = school climate. Note. The patterns displayed here account for ≥ 95% of the sample. o = observed; x = missing; MA = math achievement; RA = reading achievement; CA = cognitive ability; SES = socioeconomic status; DPM = disciplinary problems in math class; DPR = disciplinary problems in reading class; SC = school climate. Table 3 Two Multivariate Linear Mixed-Effects Models for Missing Data General notation   Note. The predictor x 1 is assumed to be completely observed. Vectorization of the random-effects matrix b j is achieved by stacking its columns.  Note. Estimates were significant at p < .001; SE = standard error; MA = math achievement; SES = socioeconomic status; DPM = disciplinary problems in math classes; υ 0 j = random intercepts; i j = residuals at Level 1. Note. Estimates for the fixed effects were significant at p < .001; SE = standard error; CA = cognitive ability; SES = socioeconomic status; υ 0 j = random intercepts; υ 1 j = random slopes; i j = residuals at Level 1.