Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error (β^), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm’s performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two real-life datasets are also used to examine the performance of the proposed algorithm using simulations.

1. Rubin, D . Multiple imputation for nonresponse in surveys, New York, NY: Wiley, 1987.
Google Scholar
2. Harrel, O, Zhou, XH. Multiple imputation: review of theory, implementation and software. Stat Med 2007; 26: 30573077.
Google Scholar | Medline
3. Horton, NJ, Kleinman, KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 2007; 61: 7990.
Google Scholar | Medline | ISI
4. Little, R, Rubin, D. Statistical analysis with missing data, 2nd ed. Hoboken, NJ; New York, NY: Wiley, 2002.
Google Scholar
5. Rubin, DB . Inference and missing data. Biometrika 1976; 63: 581592.
Google Scholar | ISI
6. Carpenter, JR, Kenward, MG, White, IR. Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Meth Med Res 2007; 16: 259275.
Google Scholar | SAGE Journals | ISI
7. White, IR, Royston, P, Wood, AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med 2011; 30: 377399.
Google Scholar | Medline | ISI
8. Schafer, JL, Graham, JW. Missing data: our view of the state of the art. Psychol Meth 2002; 7: 147177.
Google Scholar | Medline | ISI
9. Newman, DA Missing data techniques and low response rates: the role of systematic nonresponse parameters. In: Lance, CE, Vandenberg, RJ (eds). Statistical and methodological myths and urban legends, New York: Routledge, Tylor & Francis Group, 2009, pp. 736.
Google Scholar
10. Schafer, JL . Analysis of incomplete multivariate data, London: Chapman & Hall, 1997.
Google Scholar
11. Kim S, Belin TR and Sugar CA. Multiple imputation with non-additively related variables: Joint-modeling and approximations. Stat Meth Med Res 2018; 27: 1683–1694
Google Scholar
12. Hughes, RA, White, IR, Seaman, SR Joint modelling rationale for chained equations. BMC Med Res Meth 2014; 14: 110.
Google Scholar | Medline
13. Kropko, J, Goodrich, B, Gelman, A Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 2014; 22: 497519.
Google Scholar
14. Rizopoulos, D . JM: an R package for the joint modelling of longitudinal and time-to-event data. J Stat Software 2010; 35: 133.
Google Scholar | Medline | ISI
15. Carpenter, JR, Goldstein, H, Kenward, MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Software 2011; 45: 114.
Google Scholar | ISI
16. Mistler, SA . A comparison of joint model and fully conditional specification imputation for multilevel missing data. J Educ Behav Stat 2017; 42: 432466.
Google Scholar | SAGE Journals
17. Hu B, Li L and Greene T. Joint multiple imputation for longitudinal outcomes and clinical events that truncate longitudinal follow-up. Stat Med 2016; 35: 2991–3006
Google Scholar
18. Enders, CK, Mistler, S, Keller, BT. Multilevel multiple imputation: a review and evaluation of joint modeling and chained equations imputation. Psychol Meth 2016; 21: 222240.
Google Scholar | Medline | ISI
19. Kennickell AB. Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods. pp.1–10
Google Scholar
20. Brand JPL. Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. PhD Thesis, Erasmus University, Rotterdam, 1999
Google Scholar
21. van Buuren, S, Boshuizen, HC, Knook, DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18: 681694.
Google Scholar | Medline | ISI
22. Raghunathan, TE, Lepkowski, JM, Hoewyk, JV A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Meth 2001; 27(1): 8595.
Google Scholar
23. Heckerman, D, Chickering, DM, Meek, C Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 2001; 1: 4975.
Google Scholar
24. Rubin, DB . Nested multiple imputation of Nmes via partially incompatible MCMC. Statistica Neerlandica 2003; 57: 318.
Google Scholar
25. Gelman, A . Parameterization and bayesian modeling. J Am Stat Assoc 2004; 99: 537545.
Google Scholar | ISI
26. van Buuren S and Oudshoorn K. Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO Prevention and Health, Leiden, http://stefvanbuuren.nl/publications/mice\%20v1.0\%20manual\%20tno00038\%202000.pdf (2000, accessed November 2016)
Google Scholar
27. van Buuren, S, Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J Stat Software 2011; 45: 167. http://jstatsoft.org/v45/i03/
Google Scholar | ISI
28. van Buuren, S . Multiple imputation of discrete and continuous data by full conditional specification. Stat Meth Med Res 2007; 16: 219242.
Google Scholar | SAGE Journals | ISI
29. Nyquist, H . Restricted estimation of generalized linear models. Appl Stat 1991; 40: 133141.
Google Scholar | ISI
30. Segerstedt, B . On ordinary ridge regression in generalized linear models. Comm Stat Theor Meth 1992; 21: 22272246.
Google Scholar | ISI
31. Hoerl, AE, Kennard, RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970; 12: 5567.
Google Scholar | ISI
32. Schafer, R, Roi, L, Wolfe, R. A ridge logistic estimator. Comm Stat Theor Meth 1984; 13: 99113.
Google Scholar
33. Schafer, R . Alternative estimators in logistic regression when the data are collinear. J Stat Comput Simulat 1986; 25: 7591.
Google Scholar
34. Templ M, Alfons A, Kowarik A, et al. VIM: Visualization and imputation of missing values. R package version 4.4.1, http://CRAN.R-project.org/package=VIM (2015, accessed August 2017)
Google Scholar
35. Lichman M. UCI machine learning repository, http://archive.ics.uci.edu/ml (2013, accessed November 2016)
Google Scholar
36. Hardt, J, Herke, M, Leonhart, R. Auxiliary variables in multiple imputation in regression with missing x: a warning against including too many in small sample research. BMC Med Res Methodol 2012; 12: 113.
Google Scholar | Medline
37. Meng, XL . Multiple-imputation inference with uncongenial sources of input. Stat Sci 1994; 9: 538573.
Google Scholar | ISI
38. Rubin, DB . Multiple imputation after 18 + years. J Am Stat Assoc 1996; 91: 473489.
Google Scholar | ISI
39. Zahid, FM, Tutz, G. Ridge estimation for multinomial logit models with symmetric side constraints. Comput Stat 2013; 28: 10171034.
Google Scholar
40. Zhao, Y, Long, Q. Multiple imputation in the presence of high-dimensional data. Stat Meth Med Res 2016; 25: 20212035.
Google Scholar | SAGE Journals | ISI
41. Bartlett, JW, Seaman, SR, White, IR Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Meth Med Res 2015; 24: 462487.
Google Scholar | SAGE Journals | ISI
42. Robins, JM, Wang, N. Inference for imputation estimators. Biometrika 2000; 87: 113124.
Google Scholar | ISI
43. Hughes, R, Sterne, J, Tilling, K. Comparison of imputation variance estimators. Stat Meth Med Res 2016; 25: 25412557.
Google Scholar | SAGE Journals | ISI
Access Options

My Account

Welcome
You do not have access to this content.



Chinese Institutions / 中国用户

Click the button below for the full-text content

请点击以下获取该全文

Institutional Access

does not have access to this content.

Purchase Content

24 hours online access to download content

Research off-campus without worrying about access issues. Find out about Lean Library here

Your Access Options


Purchase

SMM-article-ppv for $41.50
Single Issue 24 hour E-access for $543.66

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top