On the Treatment of Missing Data in Background Questionnaires in Educational Large-Scale Assessments: An Evaluation of Different Procedures

Large-scale assessments (LSAs) use Mislevy’s “plausible value” (PV) approach to relate student proficiency to noncognitive variables administered in a background questionnaire. This method requires background variables to be completely observed, a requirement that is seldom fulfilled. In this article, we evaluate and compare the properties of methods used in current practice for dealing with missing data in background variables in educational LSAs, which rely on the missing indicator method (MIM), with other methods based on multiple imputation. In this context, we present a fully conditional specification (FCS) approach that allows for a joint treatment of PVs and missing data. Using theoretical arguments and two simulation studies, we illustrate under what conditions the MIM provides biased or unbiased estimates of population parameters and provide evidence that methods such as FCS can provide an effective alternative to the MIM. We discuss the strengths and weaknesses of the approaches and outline potential consequences for operational practice in educational LSAs. An illustration is provided using data from the PISA 2015 study.

One of the main goals of educational large-scale assessments (LSAs) such as the Programme for International Student Assessment (PISA) and the Trends in International Mathematics and Science Study (TIMSS) is to provide information about student proficiency and the relations between student proficiency and other cognitive and noncognitive variables such as students' socioeconomic status, self-concept, attitudes, or interests (Singer & Braun, 2018). To this end, LSAs employ scaling procedures by which proficiency scores are estimated for each student on the basis of (a) their performance on an achievement test and (b) the information they provide in a background questionnaire. Combining these two sources of information, the scaling procedure is often used to generate a set of "plausible values" (PVs) for the proficiency of each student (Mislevy, 1991). This method follows Rubin's (1987) multiple imputation (MI) approach by regarding the latent proficiency scores as missing data and allows for unbiased estimates of population parameters (Mislevy, 1991;Mislevy, Johnson, & Muraki, 1992; for other approaches to estimating population parameters, see also Culpepper & Park, 2017;Rijmen et al., 2014;von Davier & Sinharay, 2007, 2010. An important requirement of these scaling procedures is that the background variables are completely observed. However, missing data are common in LSAs. They occur, for example, because students fail to answer some of the items in the background questionnaire or because some studies use a rotated questionnaire (or planned missing data) design (Graham et al., 2006;Rhemtulla & Hancock, 2016). For example, socioeconomic status is often missing for a substantial number of participants, thereby making it difficult to derive statements about the relation between students' socioeconomic background and educational attainment. This difficulty raises the question of how missing data in background variables should be treated in educational LSAs.
The treatment of missing data in LSAs is further complicated by the fact that the data are prepared and used by two different parties sometimes called the "imputer," whose task is to generate the PVs, and the secondary "analyst" who uses the data to answer a substantive research question (Meng, 1994; see also Shin, 2013). For the imputer, an ideal statistical procedure would allow for the generation of PVs while simultaneously treating the missing data in background variables (e.g., using MI; see also Blackwell et al., 2017aBlackwell et al., , 2017b. No such procedure is currently used in educational LSAs. Instead, LSAs rely on the missing indicator method (MIM; Cohen & Cohen, 1975) whose application has been discouraged in the statistical literature (e.g., Jones, 1996;Schafer & Graham, 2002). For the analyst, this method creates additional challenges because, although it provides PVs, it does not provide imputations for missing data and requires that these be treated by other means.
Several articles have investigated the treatment of missing data in background variables. von Davier (2013) provided a discussion of the challenges associated with rotated questionnaires in LSAs. Adams et al. (2013) investigated the effects of rotated questionnaires on the estimates of population parameters and concluded that these designs allow for accurate estimates of the marginal properties of student proficiency (e.g., population means). However, Rutkowski (2011) and Rutkowski and Zhou (2015) found that the current treatment of missing data in LSAs can lead to biased parameter estimates in subpopulations (e.g., means and mean differences) when missing data occur in a systematic manner. Su (2016, 2018) showed that different rotated questionnaire designs can differ greatly in how well they recover variables' marginal properties and the relations between them. For the treatment of missing data, Assmann et al. (2015) proposed a Bayesian estimation procedure for estimating the scaling model Grund et al.
with missing background data, and Weirich et al. (2014) evaluated procedures based on two-stage (or nested) MI (see also Harel, 2007;Rubin, 2003). Finally, Wetzel et al. (2015) considered alternatives to the current practice based on latent class models.
In writing this article, we had three goals. First, we aimed to evaluate the procedures currently used for handling missing data in background variables in educational LSAs (i.e., those based on the MIM) in an attempt to clarify the conditions under which they can cause problems. Second, we aimed to implement and evaluate a strategy that allows for a joint modeling of PVs for the proficiency scores and imputations of missing data in background variables. Finally, we attempted to contrast the strengths and weaknesses of these approaches and provide recommendations for the operational practices applied in educational LSAs.
The article is organized as follows. In the first section, we review the statistical procedures used in the scaling of proficiency data and the generation of PVs in educational LSAs. In the second and third sections, we extend our discussion to the case with missing data in background variables, reviewing (a) the current practices applied in educational LSAs (based on the MIM) and (b) the joint modeling of measurement error and missing data. We then present the results of two simulation studies in which we evaluated the performance of these methods in two scenarios with different measurement models and statistical complexity. Finally, we illustrate their application with data from the PISA 2015 study (Organization for Economic Cooperation and Development [OECD], 2017). We close with a discussion of our findings and recommendations for practice.

Statistical Models in LSAs
The statistical procedures used in educational LSAs are applied to represent the relations between students' responses on the achievement test, their latent (i.e., unobserved) proficiency on these tests, and their responses on the background questionnaire. Let y denote item responses on the achievement test, x the responses on the background questionnaire, and θ the (multidimensional) latent proficiency. Then, under the conditional independence and invariance assumptions usually employed in LSAs (e.g., absence of differential item functioning), the joint distribution of y, x, and θ can be written as Pðy; x; θ; ξ; Þ ¼ Pðyjθ; ξÞPðx; θ; Þ; where Pðyjθ; ξÞ is a measurement (or scaling) model describing the relations between the item responses and the latent proficiency, and Pðx; θ; Þ is a structural (or population) model describing the relations between the latent proficiency and the background variables. In the following, we briefly review the scaling procedures and the generation of PVs in the hypothetical case in which no data are missing in the background variables. 1 Missing Data in Large-Scale Assessments

Scaling Model
In educational LSAs, the students' responses to the items on the achievement test are modeled with item response theory (IRT), which describes item responses as a function of both person ability and item characteristics. In general, IRT models express the probability of observing a response pattern y i ¼ ðy i1 ; . . . ; y iJ Þ for a person i ði ¼ 1; : : :; N Þ to a set of items j ðj ¼ 1; : : :; J Þ in a statistical model for each item where θ i denotes the latent proficiency of person i and ξ ¼ ðξ 1 ; : : :; ξ J Þ denotes a set of item parameters. One particular IRT model that is frequently used in educational LSAs is the generalized partial credit model (Muraki, 1992; for an overview, see von Davier & Sinharay, 2013;von Davier et al., 2006). Suppose that a multidimensional latent proficiency θ i is measured with a set of items, each with K j þ 1 ordered categories, that is, with possible scores k ¼ 0; : : :; K j . Then the probability of person i achieving score k on item j can be modeled as where the item parameters ξ j ¼ ða j ; b j Þ denote item slopes and intercepts, respectively (with b j0 ¼ 0). The response probabilities in the scaling model give rise to an individual likelihood function about θ i , thus allowing proficiency scores to be estimated for each person. Although point estimates such as the maximum likelihood estimate (MLE;Lord, 1983) and weighted likelihood estimate (WLE; Warm, 1989) are often used to measure an individual's proficiency, their application in LSAs has been shown to be inappropriate because they do not fully account for measurement error, thus leading to biased estimates of population parameters (e.g., inflated variance of θ, distorted estimates of subgroup characteristics, correlations, and regression coefficients; Braun & von Davier, 2017;von Davier et al., 2009;Wu, 2005). Instead, proficiency scores in educational LSAs are generated in the form of PVs, which allow for an accurate estimation of population parameters (e.g., Mislevy, 1991;Mislevy, Beaton, et al., 1992).

Population Model and PVs
The generation of PVs follows Rubin (1987) by regarding the latent proficiency scores as missing data, for which imputations can be generated by drawing repeatedly from the posterior predictive distribution of the proficiency θ, given the responses on the achievement test, the parameters of the scaling model, Grund et al. and the responses on the background questionnaire (Mislevy, 1991). Let x i denote the values on the background variables for person i. Then the posterior distribution of θ i can be expressed as where the first term denotes the scaling model with item parameters ξ and the second term denotes a population model for the latent proficiency with parameters . This model often takes the form of a latent regression model with parameters ¼ ðΓ; ΣÞ, in which the proficiency scores are regressed on the responses from the background questionnaire (see Mislevy, 1991;Mislevy, Johnson, & Muraki, 1992): where Γ is a matrix of regression coefficients, and Σ is the residual variancecovariance matrix. Notice that the expression for the posterior distribution in Equation 4 is proportional to the joint distribution in Equation 1. In other words, sampling PVs from the posterior distribution given the observed data for each respondent is equivalent to obtaining samples from the joint distribution of y, x, and θ. In practice, the parameters of the population model can be estimated with an expectation-maximization (EM) algorithm (Dempster et al., 1977; for alternative methods, see also Culpepper & Park, 2017;Johnson & Jenkins, 2004), and samples from the posterior distribution of θ can be obtained by substituting ξ, Γ, and Σ with their MLEs or with approximate draws from their posterior distributions (Bock & Aitkin, 1981;Mislevy, Johnson, & Muraki, 1992;Thomas & Gan, 1997; for an overview, see von Davier & Sinharay, 2013;von Davier et al., 2006). This procedure for generating PVs can be used directly when the background variables are completely observed. In practice, however, the background variables often contain missing data and require treatment before the PVs can be generated (Adams et al., 2013;Wetzel et al., 2015). In the following, we describe the method currently used to deal with missing data in the background questionnaire in LSAs and consider its statistical properties.

Missing Data in the Background Questionnaire
In educational LSAs, the background variables are seldom complete. Suppose that some elements of x are missing, with their observed and missing parts denoted as x obs and x mis , and let r be a set of indicator variables that describe which values in x are missing and observed, respectively. Rubin (1976) referred to data as missing completely at random (MCAR) if the propensity for missing data is independent of both the observed and the missing data, that is, Pðrjx; yÞ ¼ PðrÞ. This can occur, for example, when missing data result from the use of a rotated questionnaire (or planned missing data) design (e.g., Graham Missing Data in Large-Scale Assessments et al., 2006). By contrast, if the data are missing at random (MAR), then the propensity for missing data depends on the observed data but not on the missing data once the observed data are taken into account, that is, Pðrjx; yÞ ¼ Pðrjx obs ; yÞ. In the following, we assume that the background data are either MCAR or MAR. 2 The current method for dealing with missing data in the background variables in LSAs relies on including the indicator variables r in the latent regression model and recoding the data to temporarily render them complete (Adams et al., 2013). This strategy is also known as the missing indicator method (MIM) in the missing data literature (Allison, 2001;Cohen & Cohen, 1975). In the following, we describe this method in more detail.

Missing Indicator Method
The general idea behind the MIM is to recode the incomplete background questionnaire data by replacing missing data with predefined values and by including additional indicator variables that represent differences between cases with missing and observed data, respectively (Adams et al., 2013). The MIM can be applied to both continuous and categorical data, which is illustrated in Table 1. For any background variable with missing data x q , an indicator variable r q is created such that In addition, the background variables are recoded to render them temporarily complete. A continuous background variable x q , possibly centered at its mean or median, is recoded into a new variable x Ã q such that This is illustrated in Table 1. For categorical data, the same principle is used but combined with dummy or effect coding. For example, when using dummy codes to represent a categorical variable x q with L q categories and values 0; : : :; L q À 1, the variable is recoded into L q À 1 dummy variables x Ã ql ðl ¼ 1; : : :; L q À 1Þ such that where the remaining category acts as a reference category. This strategy is sometimes described as treating missing responses as an "extra category" because the missing indicator r q acts as an additional dummy variable in the coding scheme. In other words, treating missing responses as an extra category is equivalent with coding missing responses as zero and adding a missing indicator to the coding scheme. This is illustrated in Table 1 for a single variable with three categories.
The recoded variables x Ã and the indicators r are then used as conditioning variables in the latent regression model, often in the form of principal components or a similar lower dimensional representation, and the PVs are generated from the (adjusted) posterior distribution Because the variables x Ã take on a constant value whenever the corresponding values in x are missing, the recoded background variables contribute to the posterior distribution only to the extent to which they have data (for further discussion, see von Davier, 2013). Despite its convenience, the MIM has been criticized because it can distort parameter estimates (Jones, 1996). The same may be true in LSAs when PVs are generated under the MIM (see also Rutkowski, 2011). Furthermore, because the MIM provides only a temporary solution to missing data in background variables, secondary analysts must take additional steps to treat them, for example, by removing cases with incomplete data (e.g., listwise deletion) or by using MI (see also Schafer & Graham, 2002). 3 It is largely an open question whether using the MIM, by itself or in combination with these methods for handling missing data, can provide unbiased estimates of population parameters in LSAs. In the following, we consider this question in more detail.

Estimators Under the MIM
Depending on the method used to treat missing data after generating the PVs via the MIM, it is possible to obtain different estimators for population parameters. In the following, we consider three such methods: pairwise estimation of variances and covariances (MIM-PE), listwise deletion (MIM-LD), and nested MI of the incomplete data (MIM-MI). The three methods provide different Missing Data in Large-Scale Assessments estimators for the population parameters that differ in how they make use of the data. Specifically, MIM-PE estimates variances and covariances on the basis of the (pairwise) available data, and other parameters such as correlation or regression coefficients are then estimated on the basis of the variance-covariance matrix. MIM-LD estimates variances and covariances in the same way, but other parameters such as correlation and regression coefficients are obtained on the basis of only the subset of the data in which all of the required variables are complete. Finally, MIM-MI generates replacements for missing data in background variables on the basis of the PVs generated under the MIM and the observed data, and the population parameters of interest are then estimated on the basis of the imputed data.
To illustrate the difference between the estimators, consider a hypothetical scenario with two variables u and v, both of which contain missing data. In this case, there are two missing data indicators r u and r v . Suppose the parameter of interest is the correlation between u and v. Using MIM-PE, the correlation is estimated by first estimating the variances and covariances of the variables using the (pairwise) available data, that is, where the subscripts refer to the subsets of the data in which u and v are observed (r ¼ 0) or missing (r ¼ 1), respectively. By contrast, using MIM-LD, the correlation is estimated using only the subset of the data in which both u and v are observed, that is,r LD uv ¼ŝ uv;ru¼0;rv¼0 s u;ru¼0;rv¼0ŝv;ru¼0;rv¼0 : Although the two estimators use the same information about the covariance between the variables, they make different use of the information that is available about the variances. The same principle applies more generally to all parameters whose sufficient statistics include the variance-covariance matrix of the variables. For example, suppose that v now includes two or more variables, and the parameters of interest are the regression coefficients in the multiple regression model The sufficient statistics for the regression coefficients (ignoring the intercept) are the variances and covariances among the variables, and the coefficients can be estimated as␤ In MIM-PE, the variances and covariances are estimated using the (pairwise) available data, that is, every element in the variance-covariance matrix is Grund et al.
estimated separately using the data available for the respective (pair of) variables. In MIM-LD, all values in the variance-covariance matrix are estimated at once using the subset of the data in which all variables are observed. Finally, in MIM-MI, the variances and covariances are estimated using the imputed data. In the following, we discuss the statistical properties of the different estimators under the MIM in more detail.

Statistical Properties of the MIM
To better understand the statistical properties of the MIM, we derived the asymptotic bias in the parameter estimates under the MIM in a "minimal" setting with one latent variable y, measured by a single indicator y, and two background variables x 1 and x 2 . We assumed that the sample was large (n ! 1) and that x 1 was MCAR. The parameters of interest were the variances and covariances among the variables as well as correlation and regression coefficients. In the following, we present the main results of this investigation. The full derivation is presented in the Supplement A in the online version of this article. The main results for the variances and covariances are also summarized in Table 2, and an additional simulation study confirming and illustrating the results is presented in Online Supplement B.
Pairwise estimation. Our investigation showed that, under MIM-PE, the variances and covariances can be estimated without bias under MCAR. Consequently, this would also allow for an unbiased estimation of other parameters such as correlation and regression coefficients by employing a two-step procedure in which the variances and covariances of the variables are estimated in a pairwise fashion, and regression coefficients are then calculated on the basis of the variance-covariance matrix. This is an important finding because it shows that unbiased estimates of many key parameters can be obtained under the MIM at least when the data are MCAR.
Listwise deletion. Under MIM-LD, the variances and covariances are estimated without bias, as with MIM-PE. However, the estimates of correlation and regression coefficients can be biased even when the data are MCAR. This is because variances and covariances can be biased in the subsets of the data that pertain to complete and incomplete cases, respectively (MIM r¼0 and MIM r¼1 ; see Table 2). For example, the estimated regression coefficients for the regression of y on x 1 and x 2 under MIM-LD arê where B ¼ b 1 pr 2 12 1Àð1ÀpÞr 2 12 , p is the proportion of missing data, l c is the conditional reliability of y as a measure of y given x 1 and x 2 , and r 12 is the correlation between x 1 and x 2 (for details, see Online Supplement A). For other parameters,

Missing Data in Large-Scale Assessments
the bias is also a function of the proportion of missing data, the conditional reliability, the true regression coefficients from the regression of y on x 1 and x 2 , and the correlation between x 1 and x 2 . The bias appears to be relatively small in most of the cases that are relevant for practice, but it may increase if the reliability is low or the proportion of missing data is high. This illustrates that naive estimates of correlation and regression coefficients, which often correspond to the default methods in statistical software, can be biased even if the estimates of the variances and covariances are not and even if the data are MCAR.
Nested MI. Under MIM-MI, some but not all variances and covariances can be estimated without bias (see Table 2). Specifically, because the estimates of the variances and covariances in the subsets of the data pertaining to complete and incomplete cases (MIM r¼0 and MIM r¼1 ) are biased under the MIM, the parameters of the imputation model that is used to impute missing data in x 1 are biased as well. Consequently, the estimates of the parameters relating to x 1 can be biased even when the data are MCAR, including some of the variances and covariances as well as the correlation and regression coefficients. The form of the bias tends to be complex even for simple parameters such as correlation and regression coefficients but is again a function of the proportion of missing data, By pattern of missing data By method for the treatment of missing data Grund et al.
the conditional reliability of y, the true regression coefficients of x 1 and x 2 , and the correlation between x 1 and x 2 . In summary, our investigation suggested that, although unbiased estimates can be obtained with MIM-PE, naive use of MIM-LD or even MIM-MI can lead to distorted parameter estimates even if the data are MCAR. Nonetheless, these results should be interpreted with care due to the restrictive nature of the assumptions made in their derivation. In the following section, we present an alternative to the MIM based on a joint treatment of PVs and missing data in the background questionnaire. Then, we present the results of two simulation studies that we conducted to evaluate these procedures in more general and realistic scenarios.

Joint Treatment of PVs and Missing Data
If there are missing data in the background variables, the joint distribution of y, x, and θ in Equation 1 can be extended to include both the observed and missing parts of x, that is, both x obs and x mis . Under the assumption that conditional independence and invariance hold as before and that the data are MAR, the joint distribution can be written as where Pðyjθ; ξÞ is a measurement model, Pðx obs ; x mis ; θ; Þ is a population model describing the (hypothetical) relations between θ and both x obs and x mis , and Pðrjy; x obs ; Þ is the missing data mechanism (i.e., MAR).
In order to draw inferences on the basis of only the observed data, the missing data can be "integrated out" of the function used to fit the model of interest (e.g., the likelihood function or the posterior distribution; see Little & Rubin, 2002). The following posterior distribution for the missing data and the latent proficiency scores can then be used to draw PVs for θ and imputations for x mis : Finally, inferences about population parameters can be drawn by averaging over the PVs for θ and the imputations and for x mis generated in this manner (see also Tanner & Wong, 1987). In the following, we describe a general strategy for sampling from the joint distribution of θ and x mis .

Sampling From the Joint Distribution
There are several computational approaches to sampling from the joint distribution of x mis and θ. In this article, we outline a fully conditional specification (FCS) algorithm that can be used to draw from the joint distribution using a sequence of conditional models (Raghunathan et al., 2001;van Buuren et al., 2006; for a similar approach, see also Assmann et al., 2015). Other approaches are possible and are considered in the Discussion section.
Missing Data in Large-Scale Assessments FCS algorithm. The general idea of FCS is to approximate draws from the joint posterior distribution of θ and x mis by iterating along a sequence of conditional models and generating imputations for θ and x mis in a step-by-step manner, that is, by iterating back and forth between sampling steps pertaining to θ and x mis . This procedure is very flexible because it can employ different imputation models for each variable with missing data. The sampling algorithm in the FCS approach can be summarized as follows. Let p ¼ ð␥ p ; s 2 p Þ denote the parameters of the latent regression model for the pth latent proficiency y p ðp ¼ 1; : : :; PÞ, and suppose that there is a measurement model for each proficiency that induces an individual likelihood function about y ip , given the item responses y i and the item parameters ξ p . Further, let q denote the parameters for the imputation model of the qth background variable with missing data ðq ¼ 1; : :; QÞ. Then, at iteration t, Notice that the FCS algorithm divides the sampling of the multidimensional proficiency θ across separate univariate models. Notice also that the sampling step for drawing the imputations for x mis i conditions only on θ i but not on y i . This step follows the conditional independence assumption in the scaling model (Equation 1) but can be modified to include y i if needed. For computational convenience, it is possible to approximate the posterior distribution of y ip with a normal distribution and to use a measurement model with fixed item parameters (e.g., as estimated in an earlier step; see also Thomas & Gan, 1997;von Davier et al., 2006). Posterior draws for the model parameters can be obtained most easily by employing conjugate prior distributions and combining them with their MLEs (e.g., using the EM algorithm). This approach is implemented in the miceadds package (Robitzsch et al., 2019) for the statistical software R. The Grund et al.
imputation models for missing data in background variables may be chosen from among any univariate (or multivariate) models suited for the data (e.g., normal distribution, predictive mean matching; see Schafer, 1997;van Buuren et al., 2006). In the following, we present the results of two simulation studies that we conducted to evaluate these procedures.

Study 1
In the first study, we focused on a minimal setting, which included a 2PL IRT measurement model and two continuous background variables, one of which contained missing data.

Data Generation
The data were simulated from a multivariate normal distribution with three variables y, x 1 , and x 2 , where x 1 and x 2 are two background variables, and y represents the latent proficiency that is measured by a number of dichotomous indicators y j ). For person i, where Σ is the population correlation matrix. The measurement model was a 2PL IRT model defined as follows. For person i and item j, where the item parameters ξ j ¼ ða j ; b j Þ were chosen in such a way that the item difficulties were equidistant in the interval ½À2; 2, the item discrimination parameters were equidistant in the interval ½0:5; 1:5, and the item difficulty and discrimination were uncorrelated for any given set of items. 4 Missing data. Missing data were induced in x 1 on the basis of a latent response propensity r Ã . Specifically, we used the following linear model to simulate r Ã given the values of some predictor variable z, where a is a quantile of the standard normal distribution according to the probability of missing data (e.g., a ¼ À0:84 for 20% missing data), and d denotes the effect of z on missingness in x 1 . A value x 1i was deleted if r Ã i > 0. We varied the missing data mechanism by (a) changing which variable predicted r Ã (i.e., which variable was represented by z) and (b) choosing d in such a way that the predictor explained more or less of the variance in r Ã .
The simulated conditions are summarized in Table 3 and were chosen so that they reflected typical conditions in LSAs or were of theoretical interest.

Missing Data in Large-Scale Assessments
Specifically, we varied the sample size (n ¼ 100; 500; 2;000) to reflect applications with small, moderate, or large samples (e.g., subgroups vs. countries); the number of indicators y j (J ¼ 7; 15; 60) to mimic conditions in which the test had low, moderate, or high reliability (e.g., facets vs. general domains); and the correlation between the variables to reflect conditions with more or less information about y in the background variables. The data in x 1 were either MCAR, MAR as a function of x 2 (MAR x ), or MAR as a function of the WLE of y (MAR y ; for a similar approach, see Cham et al., 2017). Both MAR x and MAR y were simulated in such a way that either 50% or 100% of the variance in the response propensity was explained by x 2 or the WLE of y, respectively. Finally, we varied the probability of missing data (20% or 40%). This resulted in a total of 180 conditions, each replicated 1,000 times.
Scaling and imputation procedures. In the scaling model for y, the item parameters were fixed to their true values. The population model was a regression of y on x 1 and x 2 as suggested by the data generating model. The procedures used to treat missing data are summarized in Table 4 and included MIM-PE, MIM-LD, and MIM-MI as outlined above as well as the FCS approach. For the purpose of comparison, we also considered PVs generated from the complete data (CD), that is, without missing data in x 1 . For CD, FCS, and the MIM, we generated 20 PVs. For FCS, this also provided 20 imputations of the missing data. For MIM-MI, we generated 10 imputations of the missing data for every PV generated under the MIM, resulting in a total of 200 imputed data sets. All procedures were implemented in the statistical software R, using the packages mice (van Buuren & Grund et al. Groothuis-Oudshoorn, 2011), miceadds (Robitzsch et al., 2019), and TAM (Robitzsch et al., 2018).
Parameters of interest and pooling. The parameters of interest included the means, variances, and covariances for all variables as well as their correlations and the regression coefficients for the regression of y on x 1 and x 2 . For CD, FCS, and MIM-MI, the parameter estimates were obtained on the basis of the PVs for y and the complete or imputed background data. For MIM-PE and MIM-LD, they were obtained on the basis of the PVs for y and the incomplete background data by using pairwise estimation and listwise deletion, respectively. The accuracy of the parameter estimates was evaluated in terms of bias and the root mean squared error (RMSE). In addition, we calculated the coverage rate of the 95% confidence interval to assess the accuracy of the statistical inferences. The standard errors for MIM-PE were obtained through a grouped jackknife procedure with 50 jackknife samples. The standard errors for the correlation coefficients were obtained by applying the Fisher transformation (Fisher, 1915). Results obtained from the CD, FCS, MIM-PE, and MIM-LD were pooled with Rubin's (1987) rules; those obtained from MIM-MI were pooled by applying pooling rules for nested MI (Rubin, 2003;Shen, 2000).

Results
Because the simulation yielded many results, we focus on only the main findings here.
Bias. Figure 1 shows the estimated bias for the mean and variance of y for conditions with different sample sizes and missing data mechanisms. The estimates of the overall mean and variance of y were approximately unbiased for all Missing Data in Large-Scale Assessments procedures regardless of the sample size and the missing data mechanism. Similarly, Figure 2 shows the bias for the mean and variance of x 1 as well as the covariances between the variables. The estimates of these parameters were approximately unbiased for CD and FCS. By contrast, the results for MIM-PE, MIM-LD, and MIM-MI were more mixed. Specifically, MIM-PE and MIM-LD provided unbiased estimates only under MCAR but not under MAR x and MAR y . MIM-MI provided less biased estimates under MAR x , but some bias remained for the covariance of x 1 with y and (to a lesser extent) with x 2 . In addition, MIM-MI yielded slightly biased results under MCAR, which was not the case for MIM-PE and MIM-LD. Under MAR y , MIM-PE, MIM-LD, and MIM-MI showed noticeable bias in these parameter estimates, where the bias was smaller with MIM-MI and pointed in the opposite direction. 5 The extent of the bias with MIM-PE, MIM-LD, and MIM-MI also depended on the number of items (J) and the proportion of missing data. This effect is illustrated in Figure 3 for the bias in the covariance of y with x 1 under MAR x . Bias with MIM-PE and MIM-LD was larger when more data were missing (40%). MIM-MI reduced the bias under MAR x , but the reduction was less effective in conditions with a small or moderate number of items (J ¼ 7 and 15) and more missing data (40%).
In addition to the means, variances, and covariances, we also investigated the bias in the estimates of the correlations between variables and the regression coefficients in the multiple regression of y on x 1 and x 2 . The results are summarized in  provided approximately unbiased estimates of the parameters. By contrast, the estimates were slightly biased with both MIM-LD and MIM-MI. This was true even when the data were MCAR (see Figure 4). The bias for the two regression coefficients pointed in opposite directions and was largest in conditions with a Missing Data in Large-Scale Assessments small number of items (J ¼ 7) and a moderate correlation between y and x 2 (r yx 2 ¼ :35). In most of the other conditions, this bias remained relatively small (usually below 10%). Under MIM-LD, the bias was larger for the regression coefficient of x 2 (b 2 ) than for x 1 (b 1 ), that is, for the variable not affected by missing data. Under MIM-MI, the opposite was true.
RMSE. The RMSE followed a pattern similar to the bias and was usually lower under FCS than under MIM-LD, MIM-PE, and MIM-MI. However, despite the fact that the parameter estimates were sometimes biased under MIM-LD and MIM-MI, they were sometimes more accurate (i.e., had a lower RMSE) than under FCS. This was the case primarily in small samples (n ¼ 100) and some conditions with moderate samples (n ¼ 500, only for MIM-MI) when the data were MCAR or MAR x (50%). This finding may be explained by the fact that the bias for MIM-LD and MIM-MI was often negative in the respective conditions (i.e., toward zero), which may have reduced the variability in the parameter estimates. By contrast, the RMSE of the estimates under MIM-PE were sometimes significantly larger than for the other methods when the data were MAR x and MAR y .  Grund et al.
Coverage. The results for the coverage rates of the 95% confidence intervals are shown in Table 5 for the estimated correlation between y and x 1 in selected conditions with data that were MCAR. The coverage rates under FCS were usually close to those obtained with the complete data and to the nominal value of 95%. Under MIM-LD and MIM-MI, the coverage rates sometimes dropped well below the nominal value of 95%, especially in conditions with moderate or large samples (n ¼ 500 and 2,000) and a small number of items (J ¼ 7 and 15). This can be regarded as a direct consequence of bias in the parameter estimates. Under MIM-PE, the coverage rates were often too low in conditions with smaller samples (n ¼ 100 and 500) and a lot of missing data (40%). This indicates that the standard errors under MIM-PE obtained from the grouped jackknife procedure were sometimes too small.

Summary
The results of Study 1 illustrate several important points. First, the application of the MIM allows for unbiased estimates of the mean and variance of y regardless of the missing data mechanism. Second, the MIM allows for unbiased estimates of means, variances, and covariances when the data are MCAR. For  Missing Data in Large-Scale Assessments more complex parameters, such as correlation and regression coefficients, estimates based on the MIM are unbiased when combined with a pairwise estimation approach (MIM-PE) but not when combined with listwise deletion (MIM-LD) or MI (MIM-MI) even when the data are MCAR. Third, parameter estimates based on the MIM can be biased when the data are MAR (i.e., MAR x or MAR y ), in which case using MI (MIM-MI) may reduce bias. Finally, the FCS approach led to unbiased and efficient parameter estimates in most conditions with good inferential properties overall. These findings were in line with our expectations, which were based on the theoretical properties of the MIM. However, it is important to take into account that, in Study 1, we considered only the minimal case with two background variables. For this reason, we conducted a second study with a more complex data structure that bears a closer resemblance to educational LSAs. The main purpose of this study was to investigate whether and to what extent our findings can be expected to carry over to real data, where there is often a large number of background variables with a complex correlation structure and diverse patterns of missing data, which may show compensatory effects that are not observable in cases with fewer variables.

Study 2
In the second study, we extended the simulation setting to be more similar to conditions commonly found in LSAs. Because the simulation design is much more complex than in the first study, we only provide a brief overview here. For a complete description, we refer to Online Supplement D.

Data Generation, Procedures, and Parameters of Interest
The second simulation featured an extended design with a latent proficiency variable y as well as a total of 54 categorical background variables, which we divided into three blocks, x 1 , x 2 , and x 3 , with 18 variables each. The data were generated as follows. First, we treated the variables as if they were continuous and generated them from a multivariate normal distribution. Then, we discretized the background variables using thresholds, so that we could create different types of variables with three to eight categories and varying distributions (symmetric, skewed, or bimodal), which we distributed equally across the three blocks. The measurement model for y was a 2PL IRT model with items assigned to persons according to a booklet design. Finally, we induced missing data in all variables in the first and the second blocks of background variables, where the variables in the first block were MCAR, MAR x , or MAR y as in the first study, and the variables in the second block were induced according to a three-form rotated questionnaire design (i.e., MCAR; see Graham et al., 2006). The simulated conditions were chosen in such a way that they reflected conditions found in educational LSAs (e.g., PISA) and are summarized in Table 3. For a complete Grund et al. description of the simulation design and the simulated conditions, we refer to Online Supplement D. The scaling and imputation procedures were the same as in the previous study (i.e., CD, FCS, MIM-PE, MIM-LD, and MIM-MI). In accordance with the operational practices applied in LSAs, we converted the background variables into contrast codes and used principal components analysis (PCA) to extract 100 principal components, which were then used as conditioning variables in the latent regression model. Under the MIM, the effect coding also included additional codes to indicate missing responses for each variable (i.e., treating missing responses as an "extra category"). For FCS and MIM-MI, we imputed missing data in the background variables using predictive mean matching (PMM). 6 For CD, FCS, and the MIM, we generated 10 PVs, and for MIM-MI, we generated five imputations for every PV, resulting in a total of 50 imputations.
The parameters of interest were the means, variances, and covariances among the variables as well as the correlations between the variables and the regression coefficients for the multiple regression of y on pairs of any two background variables. Because the total number of parameters was too large to consider them all, we only estimated a subset of the parameters that pertained to five of the 18 background variables in each block (i.e., one per type and block). The parameter estimates were evaluated in terms of bias, RMSE, and the coverage of the 95% confidence interval. However, in contrast to the previous study, we used the average estimates obtained with CD as a point of reference in the calculation of these criteria because the original population values could not be applied to the discretized background variables. In the interest of space, we focus on the bias in the presentation of the results.

Results
The estimated bias in the means, variances, and covariances of the variables is summarized in Table 6 for each type of parameter that pertains to background variables in different blocks of the background questionnaire (x 1 , x 2 , and x 3 ). In contrast to the first study, all procedures provided estimates of the means, variances, and covariances with little or no bias as long as the data were MCAR. Under MAR x , MIM-PE and MIM-LD led to noticeable bias in parameter estimates pertaining to variables in the first block of the background questionnaire (i.e., x 1 ). By contrast, MIM-MI yielded parameter estimates with only a little bias, that is, with the median bias near zero for all types of parameters and only a little bias remaining in the parameters. Likewise, FCS provided parameter estimates with little or no bias as in the first study. Under MAR y , MIM-PE and MIM-LD again led to noticeable bias in some parameter estimates. In contrast to the first study, this included the estimates for the variance of y. MIM-MI reduced the bias in most parameters with the exception of the bias in the variance of y. By contrast, FCS provided estimates with little or no bias for all parameters.
Grund et al.  Similar to the first study, we also investigated bias in the correlation and regression coefficients. This is illustrated in Figure 5 for the correlations of y with the background variables in the first block (i.e., x 1 ) and the regression coefficients for the regressions of y on the pairs of variables in the first and third blocks of the background questionnaire (i.e., x 1 and x 3 ). Under MCAR, all methods led to approximately unbiased estimates of these parameters. This is in contrast to the first study, in which only MIM-PE provided unbiased estimates of these parameters under MCAR. Under both MAR x and MAR y , both MIM-PE and MIM-LD led to noticeable bias in the parameter estimates. MIM-MI provided a substantial reduction of the bias in most cases although some bias remained in individual parameters. By contrast, FCS provided estimates with little or no bias for these parameters in all conditions.

Summary
The second simulation study provided some important insights about the statistical properties of the methods we considered. First, although the theoretical suggested that a naive use of MIM-LD and MIM-MI may lead to biased parameter estimates even when the data are MCAR, this may not necessarily be a reason for concern in the context of educational LSAs, where the large number of background variables may compensate for the effects of missing data under the MIM. This is an encouraging finding because it illustrates that the MIM allows for approximately unbiased parameter estimates at least under MCAR. However, under MAR x and MAR y , MIM-PE and MIM-LD still had the potential to provide biased parameter estimates. Second, FCS appeared to provide estimates with good statistical properties even with a larger number of background variables. The same was true for MIM-MI, which led to a substantial reduction in the bias in most cases, although some bias remained in a few individual parameters under MAR x and MAR y .

Example Analysis: PISA 2015
To illustrate the procedures considered in this article, we used data from the German subsample (N ¼ 6,504) of PISA 2015 (OECD, 2017). The data included the 184 cognitive items from the science domain and 214 variables from the students' background questionnaire. Missing data occurred in all cognitive items (range: 77.8%-92.6%) and almost all of the background variables (range: 10.9%-54.0%). The procedures were implemented in a manner that was similar to the operational practices used in PISA 2015. In the interest of space, we only provide a brief description of the procedures here and provide a full description along with the computer code in Online Supplement E. We generated the PVs using the item parameters and contrast coding scheme used in PISA 2015 (OECD, 2017) and the R package TAM (Robitzsch et al., 2018). This entailed the use of dummy codes to indicate missing responses for all background variables, which were then subjected to PCA in order to obtain components that we used as conditioning variables in the latent regression model. To impute the missing data, we used a FCS approach based on the R package mice (van Buuren & Groothuis-Oudshoorn, 2011), where categorical and ordinal variables were imputed with polytomous regression models and PMM, respectively (see also Kaplan & Su, 2016. The imputation methods for the background variables also used the contrast-coded background data but relied on partial least squares (PLS) to reduce the dimensionality of the data. All steps in the procedure included the final student weights.
For the analysis of the data, we obtained WLEs for each of the 25 noncognitive scales from the background questionnaire using the item parameters used in PISA 2015. The parameters of interest were the regression coefficients in the regressions of science proficiency on any two of the noncognitive scales from the background questionnaire. This resulted in 600 regression coefficients (not counting the intercept) obtained from 300 regression analyses. For MIM-PE and MIM-LD, the WLEs were calculated only for students who responded to at least three items of that scale and set to missing otherwise (see also OECD, 2017). The results are summarized in Figure 6, which provides a direct comparison of the parameter estimates obtained from different procedures. Overall, the regression coefficients obtained from the three procedures were similar with differences primarily occurring between MIM-PE and MIM-LD on the one hand and FCS and MIM-MI on the other hand, such that the estimates under MIM-PE and MIM-LD tend to be somewhat closer to zero. This difference is visible both in the correlation and the mean absolute difference between the procedures as well as in the slight "tilt" that is present in some of the panels of the figure. However, in summary, these results are in line with the results of the simulation studies and illustrate that, whereas differences between the procedures can sometimes be observed, the differences were relatively small in most cases, especially when there were many background variables available and the (conditional) reliability of the test was high.

Discussion
In this study, we investigated different procedures for the treatment of missing data in background variables in educational LSAs. This included the joint treatment of proficiency scores and missing data (i.e., FCS) as well as the procedures currently used in operational practice (i.e., the MIM), combined with different methods for the treatment of missing data (MIM-PE, MIM-LD, MIM-MI). From a theoretical perspective, our investigation revealed that the MIM can provide Missing Data in Large-Scale Assessments unbiased results when the data are MCAR. This included estimates of means, variances, and covariances, but it required the use of a pairwise estimation strategy (MIM-PE) for more complex parameters such as correlation and regression coefficients, whereas its use with listwise deletion (MIM-LD) or MI (MIM-MI) sometimes led to biased parameter estimates even under MCAR. In two simulation studies, we investigated these properties across a wider range of conditions. In all conditions, we found that the FCS procedure provided unbiased estimates with good inferential properties. By contrast, the MIM provided unbiased estimates only when the data were MCAR but not when they were MAR, in which case the bias tended to be lowest when the MIM was combined with MI (MIM-MI). However, in conditions with a larger number of background variables, which are the most similar to conditions encountered in LSAs, the differences between the procedures were much less pronounced, indicating that some of the negative properties of the MIM can be compensated for with the large amount of information available in LSAs. In particular, we found that the MIM when combined with MI often provided results similar to FCS in these conditions for all except a few of the parameters of interest.
Overall, these results are encouraging because they (a) illustrate a number of positive theoretical properties of the MIM that hold when the data are MCAR (i.e., with rotated questionnaires; see also Adams et al., 2013) and (b) provide additional support for the recommendations already in place regarding the analyses of incomplete background data with rotated questionnaires in educational LSAs (e.g., for MIM-PE and MIM-MI; see also OECD, 2014). However, the results also highlight some possible weaknesses of the MIM, namely, that (a) parameter estimates might not be completely unbiased when the data are MAR even when MI is used and (b) some parameter estimates may be biased even when the data are MCAR when obtained with listwise deletion, which is the default setting for many analyses in statistical software.
Based on our findings, it may be worth considering the joint treatment of proficiency scores and missing data as an alternative to the operational practices applied in educational LSAs. However, this idea can be met with scrutiny: Not only does a joint treatment complicate the generation of PVs, but it also blurs the line between the imputer and the (secondary) analyst by placing the burden of specifying an imputation model for the missing background data on the imputer. This requires the imputer to anticipate potential analyses, so that the imputation model will "fit" the intended analyses (Meng, 1994). However, this problem is not unique to the treatment of missing data and applies to the generation of the PVs in the same way. In principle, it may be worth considering whether the data for secondary analyses could be released in two versions: one containing PVs and imputations for missing background data and one containing only the PVs but with the imputations for missing data deleted. In this context, two-stage or nested MI (Rubin, 2003)-in which the PVs for the latent proficiency variables and the imputations for the missing background data are generated in two separate stages-may be considered as an alternative. The little research that has evaluated these approaches seems to imply that they enjoy qualities similar to the joint imputation of PVs and missing data (Weirich et al., 2014; see also Kaplan & Su, 2018).
The implementation of the FCS approach considered here is only one of several ways to adopt a joint imputation of PVs and missing data. First, the FCS approach is not restricted to the use of univariate models to generate PVs, and multivariate models can be used instead to maintain the multidimensional scaling procedures employed in educational LSAs (see von Davier & Sinharay, 2013). To our knowledge, such an approach is currently not implemented in statistical software. Second, instead of using an FCS approach, the joint distribution of the variables can sometimes be modeled directly (Blackwell et al., 2017a(Blackwell et al., , 2017bKing et al., 2001; see also Cole et al., 2006). Third, the FCS approach presented here uses parametric models for the treatment of missing data. By contrast, nonparametric methods (e.g., Assmann et al., 2015;Si & Reiter, 2013) or procedures based on latent class models may allow for an even more flexible representation of the relations (and possible interactions) between the variables (Vermunt et al., 2008;Vidotto et al., 2018;Wetzel et al., 2015;Xu & von Davier, 2019;see also von Davier, 2013). Finally, the imputation of PVs need not be restricted to the achievement test data but can also be applied to the noncognitive constructs in the background questionnaire (e.g., interest and attitude scales). In other words, missing data need not be imputed on the item level but may be combined with intermittent steps to generate PVs for these scales (see also Gottschall et al., 2012).
This study comes with multiple limitations and points to consider. First, educational LSAs often feature hundreds of variables that require the dimensionality of the data set to be reduced before the scaling model can be applied. For this purpose, PCAs or PLS can be used Oranje & Ye, 2013). This was illustrated in our analysis of the PISA 2015 data, in which we used PLS to reduce the dimensionality of the imputation models for the background variables. Second, the data in educational LSAs often have a multilevel structure that results from students being clustered in schools. This structure needs to be taken into account in the scaling of the proficiency data (Adams et al., 1997;Adams & Wu, 2007;Li et al., 2009), the imputation of missing background data (Enders et al., 2016;Lüdtke et al., 2017), and secondary analyses (Monseur & Adams, 2009). By contrast, if the multilevel structure is represented with fixed effects, such as in PISA, then the methods considered in this article can be applied directly by including an additional set of indicator variables to represent school membership in the imputation models for missing data and the PVs (see the example with the PISA 2015 data).
Further research is needed to fully understand the proper treatment of missing data in background variables in educational LSAs. This includes the comparison of the available methods in a wider range of settings, for example, when the data Missing Data in Large-Scale Assessments are MNAR (see also Ibrahim et al., 2005;Moustaki & Knott, 2000;O'Muircheartaigh & Moustaki, 1999). The same is true for applications with rotated questionnaire designs that are gaining popularity in educational LSAs. The procedures studied in this article can be applied to missing data that stem from rotated questionnaires, but their performance under different forms of rotation is an important topic that has yet to receive more attention (see also Kaplan & Su, 2016. Finally, future research should take into account the relationship between the imputer and the (secondary) analyst (Meng, 1994). This is particularly important because the data from educational LSAs are often used for a variety of purposes, comprising an immense number of potential analyses (e.g., models with nonlinear effects, multilevel models; see also Li et al., 2009;Schofield, 2015;Schofield et al., 2015).

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD
Simon Grund https://orcid.org/0000-0002-1290-8986 Notes 1. The procedures presented in this article can also be applied when data are missing unsystematically in the achievement test (e.g., with rotated booklet or multimatrix designs). The case in which data are missing systematically in the achievement test is a different topic and will not be considered here (for further discussion, see Moustaki & Knott, 2000;Pohl et al., 2014;Rose et al., 2017). 2. If this assumption is not satisfied, and the data are missing not at random, that is, Pðrjx; yÞ ¼ Pðrjx obs ; x mis ; yÞ, then the treatment of missing data is more complicated and requires the missing data mechanism to be modeled explicitly (for a detailed discussion, see Carpenter & Kenward, 2013;Little & Rubin, 2002). 3. In practice, full information maximum likelihood (FIML) estimation is also a popular alternative for dealing with missing data. We did not consider FIML further in this study because the statistical properties of parameter estimates under FIML and multiple imputation (MI) appeared to be virtually the same.
For interested readers, we provide simulation results comparing the parameter estimates under the missing indicator method (MIM) when combined with FIML or MI in Online Supplement B.
Grund et al.

4.
The optimal permutations of the item difficulty and discrimination parameters that best fulfilled these criteria were determined a priori by using a large simulated data set (N ¼ 20,000) and comparing all possible permutations for any given number of items. The same permutations of the item parameters were used throughout the study and included in Online Supplement C. 5. The relatively large bias with MIM-PE, MIM-LD, and MIM-MI under MAR y may be explained by the fact that this mechanism implies a dependency between the item responses ðy 1 ; : ::; y J Þ and the missing data indicator r. This violates the assumptions of the measurement model used with the MIM. 6. In principle, the imputation of missing data in background variables can also be combined with dimension reduction techniques such as principal components analysis (PCA) or partial least squares (PLS). In the present case, no dimension reduction was used because none was needed. However, in an additional simulation study, we found that PLS yielded approximately the same results as predictive mean matching without dimension reduction, whereas the results were more mixed with PCA.