High-dimensional Imputation for the Social Sciences: a Comparison of State-of-the-art Methods

Including a large number of predictors in the imputation model underlying a multiple imputation (MI) procedure is one of the most challenging tasks imputers face. A variety of high-dimensional MI techniques can help, but there has been limited research on their relative performance. In this study, we investigated a wide range of extant high-dimensional MI techniques that can handle a large number of predictors in the imputation models and general missing data patterns. We assessed the relative performance of seven high-dimensional MI methods with a Monte Carlo simulation study and a resampling study based on real survey data. The performance of the methods was defined by the degree to which they facilitate unbiased and confidencevalid estimates of the parameters of complete data analysis models. We found that using lasso penalty or forward selection to select the predictors used in the MI model and using principal component analysis to reduce the dimensionality of auxiliary data produce the best results.

missingness is ignorable. The first of these conditions is that the missing data are missing at random (MAR), meaning that the probability of being missing is the same within groups defined by the observed data (i.e., conditioning on the observed data).
When this condition is violated, standard MI can lead to biased parameter estimates, even if the analysis and imputation models are congenial.
Meeting the MAR assumption requires specifying imputation models that include the variables that correlate with the missingness and the analysis model variables.
Omitting such variables from the imputation model results in imputation under MNAR (Collins et al., 2001, p. 339). Applying standard MI under MNAR can lead to bias in the parameter estimates and can invalidate inferences involving the imputed variables (Collins et al., 2001, pp. 341-343). Therefore, including as many good predictors of the variables under imputation as possible in the imputation model is generally advisable.
In this study, we focus on methods that assume MAR data. However, a considerable amount of research has been devoted to developing missing data treatments for MNAR data. We refer interested readers to Enders (2010, pp. 287-328)'s review of the two classes of MNAR models (i.e., selection models and pattern mixture models) and to Little and Zhang (2011)'s subsample ignorable multiple imputations: a method to obtain valid inferences with MI under MNAR under certain additional assumptions.
We refer to all variables that are not targets of imputation, as potential auxiliary variables. This set of potential auxiliaries may include important predictors of missingness, variables that correlate with the imputation targets, and variables that are not useful for imputation. Discerning which of the potential auxiliary variables may be useful predictors in the imputation model can be a daunting task. Following an inclusive approach (i.e., including numerous auxiliary variables in the imputation model) reduces the chances of omitting important correlates of missingness, thereby making the MAR assumption more plausible (Rubin, Stern, & Vehovar, 1995, pp. 826-827;Schafer, 1997, p. 23;White, Royston, & Wood, 2011;Van Buuren, 2018, p. 167). Furthermore, Collins et al. (2001) showed that the inclusive strategy reduces estimation bias and increases efficiency. When designing broad and intermediate imputation models, the inclusive HIGH-DIMENSIONAL IMPUTATION 8 strategy can also grant congeniality with a wider range of analysis models.
Although following the inclusive strategy may be beneficial for the imputation procedure, it is often infeasible to use all potential auxiliaries as predictors with standard imputation methods. Standard imputation methods, such as imputation under the normal linear model (Van Buuren, 2018, p. 68), face computational limitations in the presence of many predictors. For example, using traditional (unpenalized) regression models for the imputation model requires the number of predictors (p) in the imputation models to be smaller than the number of observed cases (n) to avoid mathematical singularity of the underlying system of equations (James, Witten, Hastie, & Tibshirani, 2013, p. 203). As a result, imputers need to balance the benefits of the inclusive strategy with its computational limits. The large number of variables available in modern social scientific data sets makes the difficult step of deciding which predictors to include in the imputation models even more arduous.
In addition to their size, other aspects of social surveys and other social scientific data can further complicate the task of specifying good imputation models. Sociologists and researchers working with large social surveys often want to estimate analysis models that use composite scores (i.e., aggregates of multi-item scales). When working with multi-item scales, the imputer needs to decide if variables should be imputed at the item level or at the scale level. When all a scale's items are usually missing or observed together, scale-level imputation can be effective (Mainzer, Apajee, Nguyen, Carlin, & Lee, 2021). When item-level missing predominates, however, the literature generally suggests imputing multi-item scales at the item level (Eekhout et al., 2014;Gottschall, West, & Enders, 2012;Van Buuren, 2010), but pursuing such a strategy can lead to increased dimensionality of the imputation models (Eekhout, de Vet, de Boer, Twisk, & Heymans, 2018).
Furthermore, social surveys are often longitudinal, and it is usually most convenient to impute such a data structure in wide format (Van Buuren, 2018, p. 312).
A wide data set has a single record for each unit, with observations made at subsequent time points coded as additional columns in the data set. As a result, long-running panel studies might easily induce large pools of potential auxiliary variables with which the imputer must contend.

High-dimensional imputation
The factors discussed above-or combinations thereof-may result in high-dimensional imputation problems wherein the pool of potential auxiliary variables is larger than the available sample size. Such high-dimensional problems preclude a straightforward application of MI and force researchers to choose which variables to include in the imputation model or otherwise regularize the imputation model. One possible solution to this problem is using high-dimensional prediction models as the imputation model. When we say "high-dimensional prediction", we are referring to the branch of statistical prediction concerned with improving prediction in situations where the number of predictors is larger than the number of observed cases (the so-called p > n problem). Recent developments in high-dimensional imputation techniques leverage high-dimensional prediction methodology to offer opportunities for embracing an inclusive strategy while substantially diminishing its downsides.
MI has been combined with high-dimensional prediction models in algorithms that use shrinkage methods (Deng, Chang, Ido, & Long, 2016;Zhao & Long, 2016) and dimensionality reduction (Howard, Rhemtulla, & Little, 2015;Song & Belin, 2004) to avoid the obstacles of an inclusive strategy. Tree-based imputation strategies (Burgette & Reiter, 2010;Doove, Van Buuren, & Dusseldorp, 2014) also have the potential to overcome the computational limitations of the inclusive strategy. The nonparametric nature of decision trees bypasses the identification issues most parametric methods face in high-dimensional contexts. To the best of our knowledge, no study to date has directly compared the performance of the various high-dimensional MI methods recommended in the literature.

Scope of the current project
The goal of this project was to compare how different high-dimensional MI (HD-MI) methods fare when imputing data sets with many variables. In particular, we were interested in the types of imputation problems that may arise in large social scientific data sets. Such data sets do not need to be strictly high-dimensional to be too large for standard MI routines. Even in low-dimensional settings (i.e., n > p), including too many auxiliary variables in the imputation model can bias analysis model estimates and lead to convergence problems and other computational issues (Hardt, Herke, & Leonhart, 2012). The high-dimensional imputation approaches we compared in this project can be used to simplify the process of specifying a good imputation model in both high-and low-dimensional problems.
We compared seven state-of-the-art HD-MI algorithms in terms of their ability to support statistically valid analyses. We chose these techniques because they stood out as the most promising candidates in our review of the HD-MI literature. The comparison was based on two numerical experiments: a Monte Carlo simulation study and a resampling study using Wave 5 of EVS. The simulation study allows us to compare the imputation methods in an artificial scenario with maximum experimental control. In a simulation study, we are able to precisely manipulate data features to match our experimental goals because we define the population model. However, the variables in a simulation study are usually sampled from simple multivariate distributions with regular, unrealistic mean and covariance structures. The resampling study allows us to shed the artifice of the simulation study and compare the methods using real social scientific data. EVS is a large-scale, cross-national survey on human values administered in almost 50 countries across Europe. The EVS data contain both numerical and categorical variables associated via a complicated, heterogeneous covariance structure. Performing a resampling study on this data set allows us to estimate bias and coverage in a more ecologically valid-albeit still somewhat artificial-scenario than is possible with a Monte Carlo simulation study. 1 1 We chose to use the EVS data for a resampling study rather than an applied example because there is no ground truth in an applied example. The resampling study offers much of the ecological validity of an applied example with the added benefit of supporting the same types of generalizable conclusions provided by a simulation study.
The imputation techniques we compared are best suited to data-driven imputation with an intermediate or broad scope. The potential benefits of HD-MI methods lie in the automatic imputation model specification that these techniques offer.
Therefore, we focused on data-driven imputation tasks where the objective is accommodating a wide range of analysis models. However, the techniques we compared do not exclude the possibility of specifying more narrowly scoped imputation models.
With little tweaking, one can always force specific variables into the imputation model.
In what follows, we first introduce the missing data treatments that we compared in our study. Then, we present the methodology and results of the two numerical experiments, we discuss the implications of the results for applied researchers, and we provide recommendations. We conclude by discussing the limitations of the study and suggesting future research directions.

Imputation methods and algorithms
We use the following notation: scalars, vectors, and matrices are denoted by italic lowercase, bold lowercase, and bold uppercase letters, respectively. A scalar belonging to an interval is indicated by s 1 ∈ [s 2 , s 3 ], while a scalar taking the values in a set is represented as s 1 ∈ {s 2 , s 3 }. We use the scope resolution operator, ::, to designate a function provided by a specific software package. So, for example, mice::quickpred() represents the quickpred() function provided by the mice package.
Consider an n × p data set, Z, comprising variables z 1 , z 2 , . . . , z p . Assume that the first t variables of Z have missing values and that these t variables are the targets of imputation. Denote the columns of Z containing z 1 to z t as the n × t matrix, T . The remaining (p − t) columns of Z contain variables that are not targets of imputation.
These variables constitute a pool of potential auxiliary variables that could be used to improve the imputation procedure. Let A be a n × (p − t) matrix denoting this set of potential auxiliary variables and write Z as Z = (T , A). For a given z j , with j = (1, . . . , p), denote its observed and missing components as z j,obs and z j,mis , respectively. Let Z −j = (z 1 , . . . , z j−1 , z j+1 , . . . , z p ) be the collection of p − 1 variables in Z excluding z j . Denote by Z −j,obs and Z −j,mis the components of Z −j corresponding to the data units in z j,obs and z j,mis , respectively.

Multiple imputation by chained equations
Assume that Z is the result of n random samples from a multivariate distribution defined by an unknown set of parameters θ. The multiple imputation by chained equations (MICE) approach obtains the posterior distribution of θ by sampling iteratively from conditional distributions of the form P ( where θ 1 , . . . , θ t are imputation model parameters specific to the conditional distributions of each variable with missing values. More precisely, the MICE algorithm takes the form of a Gibbs sampler 2 . At the mth iteration (m = 1, . . . , M ), samples are drawn for the jth target variable (j = 1, . . . , t) from the following distributions: used as imputations. Any analysis model can then be estimated on each of the d completed data sets, and the parameter estimates can be pooled using Rubin's rules (Rubin, 1987).
In the following, we describe all the missing data treatments we compared in this study. First, we describe the seven high-dimensional MICE strategies we compared in this study. They follow the general MICE framework, but they differ in which elementary imputation methods they use to define equations (1)  If n is not much larger than p, the regression estimates will have large variances, and, if p > n, there is no unique solution for the regression coefficients. Researchers have been studying model-building strategies to overcome these limitations for decades (e.g., Dempster, Schatzoff, & Wermuth, 1977). One of these strategies, known as forward stepwise subset selection (Efroymson, 1966), has been implemented in the popular imputation software IVEware (Raghunathan, Solenberger, & Van Hoewyk, 2002). We refer to this method as MI Step-Forward (MI-SF).  In BRidge, every variable in Z −j is used as a predictor in the imputation model, and the ridge penalty is the only precaution taken to address a large number of predictors. The value of κ is usually chosen to be close to zero (e.g. κ = 0.0001), because values larger than 0.1 may introduce excessive systematic bias (Van Buuren, 2018, p. 68). However, larger values of κ may be necessary to adequately stabilize the estimation in certain scenarios. In the present work, we chose the value of κ by means of cross-validation.

Direct use of regularized regression. 4
Lasso regression (least absolute shrinkage and selection operator; Tibshirani, 1996) is another popular shrinkage method. Unlike ridge regression, the lasso penalty achieves both shrinkage and automatic variable selection (whereas ridge does not exclude any variables). The extent of the lasso penalization depends on a tuning parameter, λ, which is selected from a set of possible values by means of cross-validation. For sufficiently large values of λ, lasso will force some coefficient estimates to be exactly zero thereby excluding the associated predictors from the fitted model. When applied to an imputation model, lasso will automatically select which predictors enter the imputation model. Zhao and Long (2016) and Deng et al. (2016) used lasso regression as the univariate imputation model in a MICE algorithm to impute high-dimensional data and referred to this approach as direct use of regularized regression (DURR).
At iteration m, for a target variable z j , DURR replaces equations (1) and (2)  to predict z j,mis , and obtain draws from the posterior predictive distribution of the missing data as in equation (2).
Hence, at every iteration, each elementary imputation model is estimated as a lasso regression, and uncertainty regarding the parameter values is included by bootstrapping.
So, when using lasso for imputation, no elementary imputation model will contain more predictors than the number of observed cases on the corresponding outcome. Deng et al. (2016) compared lasso with the elastic net-which does not have this restriction-for high-dimensional MI, but they did not find evidence to favor the elastic net over lasso.
Lasso is also computationally simpler than the elastic net because lasso only has one tuning parameter to estimate whereas the elastic net has two. Therefore, we chose to implement DURR with lasso as the regularization method.
Following Zhao and Long (2016), we used the Bayesian lasso specification given by Hans (2010b). Given data with a sample size, n, a dependent variable, y, and a set of predictors, X, the Bayesian lasso model has the following form.
The R code used to perform the BLasso imputation was based on the R Package blasso (Hans, 2010a) and can be found in the code repository for this article (Costantini, 2023a). For a detailed description of the Bayesian lasso MI algorithm in a univariate missing data context see Zhao and Long (2016).

MICE with
3. Use the standard MICE algorithm with a Bayesian normal linear model and no ridge penalty to obtain multiply imputed data sets from Z ′ .
The MI-PCA method was inspired by Howard et al. (2015) and the PcAux R package (Lang, Little, & PcAux Development Team, 2018). For this study, we used the R function stats::prcomp() to perform the PCA estimation via truncated singular value decomposition. Hence, p > n data are not a problem. When A has more columns than rows, prcomp() will simply extract a maximum of n components.

MICE with classification and regression trees. MICE with classification and regression trees (MI-CART; Burgette & Reiter, 2010) is a MICE
algorithm that uses classification and regression trees (CART) as the elementary imputation method. Given an outcome variable y and a set of predictors X, CART is a nonparametric recursive partitioning technique that models the relationship between y and X by sequentially splitting observations into subsets of units with relatively more homogeneous y values. At every splitting stage, the CART algorithm searches through all variables in X to find the best binary partitioning rule to predict y. The resulting collection of binary splits can be visually represented by a decision tree structure where each terminal node (or leaf ) represents the conditional distribution of y for units that satisfy the splitting rules.
For each z j , the mth iteration of MI-CART proceeds as follows: 1. Train a CART model to predict z j,obs from the corresponding Z 3. Create imputations for each element of z j,mis by sampling from the pool of z j,obs in the terminal node containing z j,mis . This procedure corresponds to sampling from the missing data posterior predictive distribution in Equation (2).
This approach does not consider uncertainty in the imputation model parameters since the tree structure is not perturbed between iterations. Therefore, MI-CART cannot produce proper imputations in the sense of Rubin (1986). The implementation of MI-CART used in this paper corresponds to the one presented by Doove et al. (2014, p. 95, algorithm 1) and the impute.mice.cart() function from the mice package.
CART searches for the best splitting criterion one variable at a time. As a result, p > n does not pose the same computational limitations that plague methods based on linear regression. More variables can increase estimation times but will not result in computational obstructions.

MICE with random forests. MICE with random forests (MI-RF) is a
MICE algorithm that uses random forests as the elementary imputation method. The random forest algorithm (e.g., Hastie et al., 2009, p. 588) entails fitting many decision trees (e.g., CART models) to subsamples of the original data. These subsamples are derived by resampling rows with replacement and sampling subsets of columns without replacement. The random forest algorithm results in an ensemble of fitted decision trees that generate a sample of predictions for each outcome value. Consequently, random forests often demonstrate better prediction performance than individual trees by reducing the variance of the estimated prediction function.
For each z j , the mth iteration of MI-RF proceeds as follows: 1. Generate k bootstrap samples from Z −j,obs .
2. Use these bootstrap samples to fit k single trees predicting z j,obs from a random subset of the variables in Z −j,obs .
3. Generate a pool of k terminal nodes for each element of z j,mis by applying the splitting rules from each of the k fitted trees to the appropriate columns of Z −j,mis .
4. Create imputations for each element of z j,mis by sampling from the z j,obs contained in the pool of terminal nodes defined above.
Bootstrapping and random input selection introduce uncertainty regarding the imputation model parameters (i.e., the tree structure), as required by a proper MI procedure. For more details on the MI-RF algorithm, see Doove et al. (2014, p. 103).
To perform MICE with random forests we used the R function mice::impute.mice.rf().
As with CART, the random forests algorithm is not subject to computational limitations in high-dimensional problems because random forests simply aggregate a collection of univariate decision trees.

MICE with quickpred.
A simple way to select predictors for an imputation model is to include variables that relate to the nonresponse or explain a considerable amount of variance in the targets of imputation. One popular implementation of this idea is to select as predictors those variables whose association with the variables under imputation, or their response indicators, exceeds some threshold. This selection strategy was proposed by Van Buuren et al. (1999) and has been implemented in the quickpred function provided by the popular R package mice (Van Buuren, 2018, p. 267). We refer to this approach as MI-QP. As both an intuitive, pragmatic option and the default method of selecting predictors in one of the most popular MI software packages, MI-QP represents an important benchmark against which to compare the performance of the more theoretically sound approaches described above.
The MI-QP approach has two main drawbacks. First, selecting predictors based on their correlations with the targets of imputation and the associated response indicators can still select collinear, redundant predictors. If one predictor is highly correlated with another and with a variable under imputation, both will be selected.
Second, when applied to p > n scenarios, MI-QP is not guaranteed to select fewer predictors than observations available for a given imputation model. As a result, MI-QP often needs to be augmented by other techniques to address collinearity and linear dependencies in the data.

MICE with analysis model variables as predictors.
According to our review of the articles published in AJS and ASR, a common approach to address the large number of possible predictors is to use only the analysis model variables in the imputation model. We refer to this approach as MI-AM. Consider a researcher working with EVS data who wants to estimate a linear model by regressing one item on 10 others afflicted by non-response. The MI-AM imputation strategy would imply using only these 11 variables in the imputation models, instead of manually searching all of the 250 variables contained in the survey for meaningful imputation predictors.
The MI-AM strategy ensures the congeniality of the analysis and imputation models. Furthermore, as long as the analysis model does not include more variables than the number of observed cases, MI-AM is not affected by the dimensionality of the data. However, by following this strategy, any MAR predictors that are not part of the analysis model will be excluded from the imputation. In such cases, the MAR assumption is violated, and the missingness is MNAR.

Oracle MICE.
As hinted by the previous two approaches, the MI literature recommends following three principles to decide which predictors to include in the imputation models (Van Buuren, 2018, p. 168): 1. Include all variables that are part of the analysis model(s).
2. Include all variables that are related to the nonresponse.
3. Include all variables that are correlated with the targets of imputation.
In practice, the first criterion can be met only if the analysis model is known before imputation, which is not always true. Furthermore, researchers can never be sure that the second criterion is entirely met, as there is no way to know exactly which variables are responsible for missingness. However, with simulated data, we know which variables define the response model. The oracle MICE approach (MI-OR) is an ideal specification of the MICE algorithm that uses this knowledge to include only the relevant predictors in the imputation models. As such, this method cannot be used in practice, but it provides a useful reference point for the desirable performance of an MI procedure. The MI-OR imputations were generated using the Bayesian normal linear model as the univariate imputation method.  (Little & Rubin, 2002, p. 42;Schafer & Graham, 2002).

Non
Furthermore, unless the data are missing completely at random (MCAR), CC can bias parameter estimates (Rubin, 1987, p. 8, Schafer & Graham, 2002. Nevertheless, the continued popularity of CC makes it an important benchmark method.

Gold standard.
We also estimated the analysis models directly on the fully observed data before imposing any missing values. In the following, we refer to the results obtained in this fashion as the gold standard (GS). These results represent the counterfactual analysis that would have been performed if there had been no missing data.

Simulation study
We investigated the performance of the methods described above with a Monte Carlo simulation study. Following a similar procedure to that employed by Collins et al.  Table 1 summarizes the four resulting crossed conditions.
We chose the values of n and p to reflect extreme dimensionality situations that would tease apart the relative strengths and weaknesses of the imputation methods considered here. Nonetheless, we selected these values to be somewhat plausible for real-world social scientific studies. Consider, for example, that a typical EVS wave has around 55,000 observations and 250 items in its questionnaire. Therefore, data structures similar to those in both our low-and high-dimensional conditions could arise by taking reasonable subsets of EVS data (potentially over several waves). As for the levels of pm, we chose the lower level to match the 10% of missing cases that is typical of variables in EVS data. We also included a more extreme level to create more challenging-but still realistic-conditions for the imputation methods. For every iteration, we imposed missing values on six target items, and then we used all missing data treatment methods described above to obtain estimates of the means, variances, and covariances of these incomplete variables. what we defined for this study. However, when specifying imputation models for survey data, the main challenge is often finding a few important auxiliary variables in a large collection of possible predictors. We defined the population correlation matrix with the three-block structure described above to replicate this type of situation in an experimentally unequivocal way. 9 .

Missing data imposition.
Missing values were imposed on six of the items in Z: three variables in the block of highly correlated variables {z 1 , z 2 , z 3 } and three in the block of lowly correlated variables {z 6 , z 7 , z 8 }. Item nonresponse was imposed by sampling from a Bernoulli distribution with individual probabilities of 9 We used fixed correlation levels instead of varying the correlation values (e.g., to manipulate collinearity) for the same reason. Varying the strengths of association within blocks would have diluted this key feature of our population model. nonresponse defined by where z i,j is the ith subject's response on z j ,z i is a vector of responses to the set of missing data predictors for the ith individual, γ 0 is an intercept parameter, and γ is a vector of slope parameters.Z was specified to include two fully observed variables from the strongly correlated set and two from the weakly correlated set {z 4 , z 5 , z 9 , z 10 }.
Therefore, the probability of nonresponse for a variable depended on variables present in the data, but never on the variable itself. As a result, when the elements ofZ are included as predictors in the MI procedures, the MAR assumption is satisfied. All slopes in γ were fixed to 1, while the value of γ 0 was chosen through numerical optimization to produce the desired proportion of missing values 10 .

Imputation.
We generated ten imputed data sets by imputing the missing values with all methods described in the preceding section. To evaluate the convergence of the imputation models, we ran ten replications of the high-dim-high-pm Both IURR and DURR could have been implemented with a variety of penalties (e.g., lasso, Tibshirani, 1996;elastic net, Zou & Hastie, 2005; adaptive lasso, Zou, 2006).
In this study, we used lasso as it is computationally efficient, and it performed well for imputation in Zhao and Long (2016) and Deng et al. (2016). A 10-fold cross-validation procedure was used at every iteration of DURR and IURR to choose the penalty parameter. To maintain consistency with previous research, we specified the BLasso hyper-parameters in equations (6), (7), and (8)  We, therefore, applied the intuitively appealing-albeit arbitrary-heuristic of using enough components to explain 50% of the total variance in the data.
Running MI-QP in the high-dimensional procedure led to frequent convergence failures. A more common use of the method includes accompanying the quickpred approach with a ridge penalty and data-driven checks that exclude collinear variables.
We decided to run MI-QP in this more favorable manner by applying the mice we estimated the 6 means, 6 variances, and 15 covariances for these variables on each imputed data set and pooled the estimates via the Rubin (1987) pooling rules. We then compared the performances of the imputation methods by computing the bias, confidence interval coverage, and confidence interval width for each estimated parameter.
Since we generated multivariate normal data, the sample means, variances, and covariances were the sufficient statistics for the joint distribution of the analysis model variables. Hence, we can infer that a method which demonstrates good performance when estimating these statistics will perform equally well when estimating other parameters that describe the same joint distribution. For example, the slopes, general linear model can be defined directly in terms of these statistics. Using only this mean vector and covariance matrix, we could also factor analyze these six variables (Bartholomew, Knott, & Moustaki, 2011, pp. 53-55) or estimate their structural relations via a structural equation model (Bollen, 1989, pp. 104-106).
Importantly, the inverse implication does not generally hold. For example, in the special case noted above wherein CC can produce unbiased slope estimates, the estimated means, variances, and covariances of the underlying data could still be biased unless the data were MCAR. By focusing our analysis on a general set of sufficient statistics, we dissociated our results from any specific statistical model or test and increased the generalizability of our findings.
For a given parameter of interest θ, we used the absolute percent relative bias (PRB) to quantify the estimation bias introduced by the imputation procedure: where θ is the true value of the focal parameter defined as S s=1θ GS s /S, withθ GS s being the Gold Standard parameter estimate for the sth repetition. The averaged focal parameter estimate under a given missing data treatment was computed aŝ θ = S s=1θ s /S, withθ s being the estimate obtained from the treated incomplete data in the sth repetition. Following Muthén, Kaplan, and Hollis (1987), we considered PRB > 10 as indicative of problematic estimation bias.
To assess the performance in hypothesis testing and interval estimation, we evaluated the confidence interval coverage (CIC) of the true parameter value: where CI s is the confidence interval of the parameter estimateθ s in the sth repetition, and I(.) is the indicator function that returns 1 if the argument is true and 0 otherwise. We also reported the average width of the confidence intervals (CIW), an indicator of statistical efficiency. An imputation method with a narrower confidence interval indicates higher efficiency and is therefore preferable. Nevertheless, the narrower CIW should not come at the expense of a lower than nominal CIC (Van Buuren, 2018, p. 52).

Results
We computed both PRB and CIC for each of the 27 parameters in the analysis model (six means, six variances, and 15 covariances). To summarize the results, we focus on the expected and extreme values of these measures. In Figures 1 and 2, we report the average, minimum, and maximum PRB and CIC obtained with the different missing data treatments, for each parameter type. As the GS estimates were used to define the "true" values of the parameters, the bias for this method was by definition 0.

Confidence interval width.
In Figure 3, we report the CIW obtained with the different missing data treatments, averaged per parameter type across the repetitions. All methods maintained similar CIW independent of the dimensionality of the data. The two exceptions were MI-QP and Bridge. While the average CIW for MI-QP in the low-dimensional condition was in line with that of all the other methods, the CIW obtained with this method for all parameter types became larger than 10 for P = 500. In the high-dimensional case, the item variance CIWs obtained by Bridge were four times as large as those obtained in the low-dimensional scenario.

A note on collinearity
Following feedback provided by a reviewer of an earlier draft of this article, we included an additional simulation study to explore the effect of collinearity. We used the same simulation procedure described above, but we adjusted some of the design parameters. We fixed the proportion of missing cases to the highest value (pm = 0.3), as this factor did not affect the relative performances of the methods. We varied the number of columns in the data (p ∈ {50, 500}) and the strength of the correlation between the potential auxiliary variables (ρ pav ∈ {0, 0.6, 0.8, 0.9}). Correlations higher than 0.6 are unlikely in survey data, but including the higher levels provides the opportunity to explore how the imputation methods perform when faced with problematic levels of collinearity.
In Figure 4, we report the average, minimum, and maximum PRB and CIC of the 15 covariances between two imputed items. In this report, we focus on the high-dimensional condition (p = 500), and we omit ρ pav = 0, which can be considered equivalent to the results already reported. The interactive dashboard (Costantini, 2023b) contains the complete set of results. The relative performances of the methods were mostly unchanged. However, a few key differences should be noted. First, shrinkage-based methods resulted in lower PRB and closer-to-nominal CIC for higher levels of ρ pav . In particular, the high PRB and low CIC that characterized BRidge in the original study results were mitigated as ρ pav increased. For ρ pav = 0.9, the highest bias returned by BRidge was lower than 10, and the lowest CIC was higher than 0.80.
Similar trends arose for IURR and DURR, for which higher values of ρ pav led to a lower PRB and closer-to-nominal CIC. Second, for higher values of ρ pav , the PRB and CIC from MI-PCA essentially mirrored those of MI-AM. Finally, in the high-dimensional condition (p = 500), MI-QP had a prohibitively long imputation time. In a small trial run of the simulation, MI-QP required around 360 minutes to impute a single data set generated with ρ pav = 0.6 and around 1130 minutes to impute a data set generated with ρ pav = 0.9. IURR and MI-SF, the next two most computationally intensive methods, each took around 10 minutes to impute these data sets. Consequently, we included MI-QP only in the low dimensional condition of this additional simulation study.

EVS resampling study
We performed a resampling study based on the EVS data to assess whether the results of our simulation study would replicate in more realistic data. EVS is a high-quality survey widely used by sociologists for comparative studies between European countries (EVS, 2020b). Furthermore, it is freely available and represents the type of data social scientists regularly analyze. Variables in the EVS data are discrete numerical and categorical items following a variety of distributions.
To perform the resampling study, we treated the original EVS data as a population. We then resampled S = 1000 data sets of n units from this population, and we used these replicates as we used the multivariate normal samples in the simulation study. For each replicate, we imposed missing values, and we treated these missing values with the same methods explored in the simulation study. This procedure was repeated for low-dimensional and high-dimensional conditions. As the number of predictors in the data was fixed at p = 243, we controlled the dimensionality of the data by varying the sample size (n ∈ {1000, 300}). When the sample size was 300, after dummy coding categorical predictors, even a small proportion of missing values (pm = 0.1) led to a high-dimensional (p > n) situation. Although n = 300 might be too low to represent the typical use of EVS data, we do not see this as a limitation on the following results, for two reasons. First, our purpose in conducting this resampling study was primarily to see if our simulation results would carry over into data generated from a more realistic population model, not necessarily to see if those results would hold in a typical social science data set. Increasing the sample size to match ranges typically seen in analyzes of EVS data would remove the high-dimensional condition where we saw the most interesting results in the simulation, thereby greatly reducing the utility of the resampling study. Second, many social science studies analyze data with around 300 observations, so our samples are not unrealistic in a general sense. imputing under some form of a multilevel model. Since both of these options fall outside the scope of the current study, we opted to subset the data as described. We excluded all columns that contained duplicated information (e.g., recoded versions of other variables), or metadata (e.g., time of the interview, mode of data collection).
The original EVS data set contained missing values. We needed to treat these missing data before we could use the EVS data in the resampling study. We used the mice package to fill the missing values with a single round of predictive mean matching (PMM). We used the quickpred function, to select the predictors for the imputation models. We implemented the variable selection by setting the minimum correlation threshold in quickpred to 0.3. The number of iterations in the mice() run was set to 200. We used a single imputation, and not MI, because this imputation procedure was used only to obtain a set of pseudo-fully observed data to act as the population in our resampling study and not for statistical modeling, estimation, or inference with respect to the true population from which the EVS data were sampled. For the same reason, the relatively poor performance that we observed for MI-QP in the simulation study is not relevant here. At the end of the data cleaning process, we obtained a pseudo-fully observed data set of 8,045 observations across four countries with p = 243 variables. For every replicate in the resampling study, we generated a bootstrap sample by sampling n observations with replacement from this data set.

Analysis models.
To define plausible analysis models, we reviewed the models reported in the repository of publications using EVS data that is available on the EVS website (EVS, 2020b). As a result, we defined two linear regression models.

Missing data imposition.
We imposed missing data on six variables using the same strategy as in the simulation study. The targets of missing data imposition were the two dependent variables in models 1 and 2 (i.e., euthanasia acceptance, and left-to-right voting tendency), religiosity, and the three items making up the "nativist attitudes" scale. The response model was the same as in Equation (9), and three variables were included inZ: age, education, and an item measuring trust in new people 11 . We chose these predictors because older people tend to have higher item nonresponse rates than younger people, and lower educated people tend to have higher item non-response rates than higher educated people (De Leeuw, Hox, & Huisman, 2003;Guadagnoli & Cleary, 1992). We also assumed that people with less trust in strangers would have a higher nonresponse tendency as they are likely to withhold more information from the interviewer (a stranger).

Imputation.
We treated the missing values with the same methods used in the simulation study. MI-AM used all the variables present in either of the analysis models as predictors for the imputation models. MI-PCA was performed considering all the fully observed variables as possible auxiliary variables. In other words, the six variables with missing values were used in their raw form, while the remaining 237 were used to extract PCs. The other imputation methods were parameterized in the same way as in the simulation study, and convergence checks were performed in the same way. These convergence checks suggested that the imputation models had converged after 60 iterations.

Results
When estimating linear regression models, all partial regression coefficients can be influenced by missing values on a subset of the variables included in the model. Therefore, it is important to evaluate the estimation bias and CIC rates for all model parameters. Figure 5 reports the absolute PRBs for the intercept and all partial slopes from Model 2 obtained after using each imputation method, for both the low-and high-dimensional conditions. Model 2 has an intercept and 13 regression coefficients.
Every horizontal line in the figure represents the PRB for the estimation of one of these 14 parameters. Figures 6 and 7 report CIC and CIW results in the same way. For ease of presentation, results for Model 1 are reported in the supplementary materials.
As seen in Figure 5, in both the high-and low-dimensional conditions, DURR, IURR, BLasso, MI-CART, and MI-SF showed only slightly larger PRBs than MI-OR.
However, even MI-OR did not provide entirely unbiased parameter estimates. After imputing with MI-OR, almost half of the parameters in Model 2 were estimated with large bias (PRB > 10%). MI-PCA, MI-RF, and CC showed similar trends but produced larger PRBs (particularly CC). BRidge demonstrated the same results described in the simulation studies. It was competitive in the low-dimensional scenario, but it was inadequate with high-dimensional data (all PRBs > 10.) In the low-dimensional condition, MI-QP resulted in only three parameter estimates with acceptable bias and only one in the high-dimensional condition. MI-AM resulted in six parameter estimates with acceptable bias in the low-dimensional condition but only one in the high-dimensional condition.
As shown in Figure 6, Finally, the average CIW for every parameter estimate is reported in Figure 7. In the low-dimensional condition, all methods result in similar CIWs. All methods result in larger confidence intervals in the high-dimensional condition reflecting a natural loss of information due to the smaller sample size used. However, Bridge, MI-QP, and MI-AM show drastically larger CIWs for the majority of the parameters. Figure 8 reports the average imputation time for the different methods. IURR and DURR were the most time-consuming methods, with imputation times above one hour in the low-dimensional condition. In the high-dimensional condition, IURR and DURR were not as time-intensive due to the smaller sample size but still took more than ten times longer than MI-PCA and BLasso.

Imputation time.
MI-PCA was the fastest method, with imputation times of under a minute in both the high-and low-dimensional conditions. BLasso, MI-OR, and MI-AM were close seconds, with imputation times of two minutes or less in both conditions. BRidge, MI-CART, MI-RF, MI-SF, and MI-QP fell in the middle, with imputations times ranging from 3.5 (MI-CART) to 15.8 (MI-SF) minutes in the low-dimensional condition and from 1.2 (MI-CART) to 12.8 (MI-QP) minutes in the high-dimensional condition.

Methods that work well
On balance, IURR, MI-SF, and MI-PCA were the strongest performers across the simulation study and the resampling study. In the simulation study, IURR and MI-SF produced trivial estimation bias for all parameters in the low-dimensional condition and for the means in the high-dimensional condition. Furthermore, the covariance estimation bias introduced by these two methods in the high-dimensional condition only slightly exceeded the PRB = 10 threshold, while most of the other MI methods resulted in covariance PRBs larger than 20 (with MI-PCA being the most salient exception).
IURR and MI-SF produced good coverages in the low-dimensional condition but tended to under-cover in the high-dimensional condition, especially for variances and covariances. In the resampling study, IURR and MI-SF were also among the strongest performers. Although they did not demonstrate the best performance, there were no conditions in which IURR or MI-SF produced unacceptable results.
The confidence interval widths of IURR and MI-SF were in line with that of the other methods. In the simulation study, the confidence intervals produced by these methods were not influenced by the dimensionality of the data. In the resampling study, their confidence intervals were wider in the high-dimensional condition than in the low-dimensional one. However, this was the same pattern that affected most methods and it was caused by the smaller sample size we used to achieve the p > n scenario in a data set with a fixed number of predictors. Overall, the confidence interval width pattern followed by IURR and MI-SF suggests that their imputation precision is not affected by a larger number of possible predictors.
From the end-user's perspective, IURR is an appealing method. IURR does not require the imputer to make choices regarding which variables are relevant for the imputation procedure. The only additional decision required of the imputer is selecting the number of folds to use when cross-validating the penalty parameter. As a result, an IURR imputation run is easy to specify, which makes IURR an appealing method for the imputation of large social scientific data sets. However, IURR is relatively computationally intensive. If the number of variables with missing values is large, IURR might result in prohibitive imputation time.
Similarly, an MI-SF run is easy to specify and only requires the user to choose the minimum sufficient increase in R 2 to use in the step-froward algorithm. However, the lack of clear guidelines on how to tune this parameter introduces more researcher's degrees of freedom than other methods. Finally, the imputation time of MI-SF was among the longest of the methods we considered.
In the simulation study, MI-PCA showed small bias and good coverage for both item means and covariances. Although it exhibited a large bias of the item variances, the-arguably more interesting-covariance relations between variables with missing values were always correctly estimated. Notably, MI-PCA was the only method resulting in small bias and close-to-nominal CIC for the covariances, even in the high-dimensional condition. When the CICs obtained with MI-PCA deviated significantly from nominal rates, they over-covered. In most situations, over-coverage is less worrisome than under-coverage as it leads to conservative, rather than liberal, inferential conclusions. In terms of confidence interval width, MI-PCA demonstrated the same pattern as IURR. In the resampling study, MI-PCA demonstrated middle-of-the-pack performance: somewhat worse than IURR, but still within acceptable levels.
In the additional simulation study evaluating the effects of collinearity, MI-PCA resulted in the same bias and confidence interval coverage as MI-AM when the potential auxiliary variables were highly correlated. This trend was caused by a subtle interaction between the data-generating model and the rule used to select the number of PCs. In Importantly, this finding does not suggest that MI-PCA cannot treat highly collinear data. Rather, the poor performance seen here suggests that heuristic decision rules-such as keeping the first PC or enough components to explain 50% of the total variance-should not be mindlessly applied when running MI-PCA. Using a different non-graphical decision rule (e.g., the Kaiser criterion, Guttman, 1954;Kaiser, 1960) should preclude the problem described above and allow MI-PCA to compete with other automatic model-building strategies.
On balance, we believe the strong performance demonstrated by MI-PCA in the simulation study outweighs the mediocre performance shown in the resampling study.
Furthermore, as noted above, the poor performance of MI-PCA in the high-collinearity study merely represents a weakness of our current implementation, not a general flaw in the underlying method. Consequently, we view MI-PCA as a promising approach for data analysts interested in testing theories on large social scientific data sets with missing values.

Methods with mixed results
In both the simulation study and the resampling study, BRidge manifested the same mixed performance. This method worked well when the imputation task was low-dimensional but led to extreme bias and unacceptable CI coverage in nearly all the high-dimensional conditions. Furthermore, the high-dimensionality of the data led to much wider confidence intervals compared to the ones obtained by other methods. Our results suggest that BRidge is effective only for low-dimensional imputation problems or in the presence of highly collinear data. The poor performance of BRidge compared to the other shrinkage methods might be explained by the fact that BRidge used a fixed ridge penalty across all iterations, while DURR, IURR, and BLasso allowed the penalty parameter to adapt to the improved imputations.
As implemented here, MI-QP was only effective in low-dimensional settings. The instability of MI-QP in high-dimensional scenarios was apparent not only because of its larger bias but also its very wide confidence intervals. The much wider confidence intervals obtained by MI-QP in the high-dimensional scenario resulted in a 100% coverage for the 95%-confidence intervals, despite the large bias, revealing grossly imprecise imputations. MI-QP is also unable to address collinearity, as it selects predictors based on their bivariate relations with the variable under imputation and its missing data indicator without considering associations between the selected predictors.
Hence, when faced with many highly correlated predictors, MI-QP can also be extremely computationally intensive due to the need to invert near-singular matrices.
DURR performed very well in the resampling study and quite poorly in the simulation study. In the resampling study, DURR was probably the best overall method in terms of bias and coverage, but it performed very badly in the high-dimensional condition of the simulation study. In the simulation study's low-dimensional condition, Furthermore, in terms of CI coverage, both methods showed a large under-coverage of the true values in the high-dimensional condition. In the resampling study, MI-CART and MI-RF both showed somewhat better performance than in the simulation study but not enough better to outweigh the mediocre simulation study performance. Although the nonparametric nature of these approaches elegantly avoids over-parameterization of imputation models, these methods were still outperformed by IURR and MI-PCA.
In the simulation study, BLasso resulted in small biases for item means and variances, even in the high-dimensional conditions, but it produced unacceptably biased covariance in both the low-and high-dimensional conditions. On the other hand, BLasso seemed to recover the relationships between variables in the resampling study well, where the overall bias levels for the regression coefficients were similar to those of MI-OR. However, in terms of CI coverage, BLasso showed poor performance in both studies resulting in either under-coverage or over-coverage for most parameters in the high-dimensional conditions.
The mixed performance of BLasso is also accompanied by a few obstacles to its application for social scientific research. Using Hans (2010b)'s Bayesian Lasso requires the specification of six hyper-parameters, which introduces more researcher degrees of freedom and demands a strong grasp of Bayesian statistics. Furthermore, the method has not currently been developed for multi-categorical data imputation, a common task in the social sciences. As a result, we do not recommend BLasso for the imputation of large social science data sets.
Finally, we do not recommend using MI-AM to impute large social science data sets. MI-AM bypasses the need to select which of the many potential auxiliary variables should be included in the imputation models by using only the analysis model variables as predictors. Therefore, MI-AM can be effective if the MAR predictors are part of the analysis model, but, as shown in the simulation study, it can lead to biased parameter estimates if they are not. In our simulation study, smaller biases and better coverages could always be achieved by using at least one of the alternative methods we evaluated.

Limitations and future directions
The present study aimed to compare current implementations of existing More generally, the results reported in this article only apply to the specific implementations of the algorithms we used. Many of the methods discussed could have been implemented differently. Zhao and Long (2016) proposed versions of IURR and DURR using the elastic net penalty (Zou & Hastie, 2005) and the adaptive lasso (Zou, 2006) instead of the lasso penalty. Although no substantial performance differences between penalty specifications emerged from the work of Zhao and Long (2016)

Conclusions
Our objective in this project was to find a good data-driven way to select the predictors that go into an imputation model. A wide range of methods have been proposed to address this issue, but little research has been done to compare their performance. With this article, we start to fill this gap and provide initial insights into applying such methods in social science research. IURR, MI-SF, and MI-PCA showed promising performance when compared to other high-dimensional imputation approaches. While all of these methods represent good options for automatically defining the imputation models of an MI procedure, MI-PCA is the more practically appealing option due to its much greater speed. However, the current implementation of MI-PCA is limited, and making the most of this method will require further research and optimization, especially regarding methods for the number of components. Finally, Bayesian ridge regression is a good alternative when the imputer wants to have an automatic way of defining the imputation models in a low-dimensional setting (n ≫ p).

Disclosure statement
No authors reported any financial or other conflicts of interest in relation to the work described.