Abstract
Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error (), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm’s performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two real-life datasets are also used to examine the performance of the proposed algorithm using simulations.
References
| 1. | Rubin, D . Multiple imputation for nonresponse in surveys, New York, NY: Wiley, 1987. Google Scholar |
| 2. | Harrel, O, Zhou, XH. Multiple imputation: review of theory, implementation and software. Stat Med 2007; 26: 3057–3077. Google Scholar | Medline |
| 3. | Horton, NJ, Kleinman, KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 2007; 61: 79–90. Google Scholar | Medline | ISI |
| 4. | Little, R, Rubin, D. Statistical analysis with missing data, 2nd ed. Hoboken, NJ; New York, NY: Wiley, 2002. Google Scholar |
| 5. | Rubin, DB . Inference and missing data. Biometrika 1976; 63: 581–592. Google Scholar | ISI |
| 6. | Carpenter, JR, Kenward, MG, White, IR. Sensitivity analysis after multiple imputations under missing at random: a weighting approach. Stat Meth Med Res 2007; 16: 259–275. Google Scholar | SAGE Journals | ISI |
| 7. | White, IR, Royston, P, Wood, AM. Multiple imputation using chained equations: issues and guidance for practice. Stat Med 2011; 30: 377–399. Google Scholar | Medline | ISI |
| 8. | Schafer, JL, Graham, JW. Missing data: our view of the state of the art. Psychol Meth 2002; 7: 147–177. Google Scholar | Medline | ISI |
| 9. | Newman, DA Missing data techniques and low response rates: the role of systematic nonresponse parameters. In: Lance, CE, Vandenberg, RJ (eds). Statistical and methodological myths and urban legends, New York: Routledge, Tylor & Francis Group, 2009, pp. 7–36. Google Scholar |
| 10. | Schafer, JL . Analysis of incomplete multivariate data, London: Chapman & Hall, 1997. Google Scholar |
| 11. | Kim S, Belin TR and Sugar CA. Multiple imputation with non-additively related variables: Joint-modeling and approximations. Stat Meth Med Res 2018; 27: 1683–1694 Google Scholar |
| 12. | Hughes, RA, White, IR, Seaman, SR Joint modelling rationale for chained equations. BMC Med Res Meth 2014; 14: 1–10. Google Scholar | Medline |
| 13. | Kropko, J, Goodrich, B, Gelman, A Multiple imputation for continuous and categorical data: comparing joint multivariate normal and conditional approaches. Polit Anal 2014; 22: 497–519. Google Scholar |
| 14. | Rizopoulos, D . JM: an R package for the joint modelling of longitudinal and time-to-event data. J Stat Software 2010; 35: 1–33. Google Scholar | Medline | ISI |
| 15. | Carpenter, JR, Goldstein, H, Kenward, MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Software 2011; 45: 1–14. Google Scholar | ISI |
| 16. | Mistler, SA . A comparison of joint model and fully conditional specification imputation for multilevel missing data. J Educ Behav Stat 2017; 42: 432–466. Google Scholar | SAGE Journals |
| 17. | Hu B, Li L and Greene T. Joint multiple imputation for longitudinal outcomes and clinical events that truncate longitudinal follow-up. Stat Med 2016; 35: 2991–3006 Google Scholar |
| 18. | Enders, CK, Mistler, S, Keller, BT. Multilevel multiple imputation: a review and evaluation of joint modeling and chained equations imputation. Psychol Meth 2016; 21: 222–240. Google Scholar | Medline | ISI |
| 19. | Kennickell AB. Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: ASA 1991 proceedings of the section on survey research methods. pp.1–10 Google Scholar |
| 20. | Brand JPL. Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. PhD Thesis, Erasmus University, Rotterdam, 1999 Google Scholar |
| 21. | van Buuren, S, Boshuizen, HC, Knook, DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18: 681–694. Google Scholar | Medline | ISI |
| 22. | Raghunathan, TE, Lepkowski, JM, Hoewyk, JV A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Meth 2001; 27(1): 85–95. Google Scholar |
| 23. | Heckerman, D, Chickering, DM, Meek, C Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn 2001; 1: 49–75. Google Scholar |
| 24. | Rubin, DB . Nested multiple imputation of Nmes via partially incompatible MCMC. Statistica Neerlandica 2003; 57: 3–18. Google Scholar |
| 25. | Gelman, A . Parameterization and bayesian modeling. J Am Stat Assoc 2004; 99: 537–545. Google Scholar | ISI |
| 26. | van Buuren S and Oudshoorn K. Multiple imputation by chained equations: Mice v1.0 user’s manual. Technical report, TNO Prevention and Health, Leiden, http://stefvanbuuren.nl/publications/mice\%20v1.0\%20manual\%20tno00038\%202000.pdf (2000, accessed November 2016) Google Scholar |
| 27. | van Buuren, S, Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J Stat Software 2011; 45: 1–67. http://jstatsoft.org/v45/i03/ Google Scholar | ISI |
| 28. | van Buuren, S . Multiple imputation of discrete and continuous data by full conditional specification. Stat Meth Med Res 2007; 16: 219–242. Google Scholar | SAGE Journals | ISI |
| 29. | Nyquist, H . Restricted estimation of generalized linear models. Appl Stat 1991; 40: 133–141. Google Scholar | ISI |
| 30. | Segerstedt, B . On ordinary ridge regression in generalized linear models. Comm Stat Theor Meth 1992; 21: 2227–2246. Google Scholar | ISI |
| 31. | Hoerl, AE, Kennard, RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970; 12: 55–67. Google Scholar | ISI |
| 32. | Schafer, R, Roi, L, Wolfe, R. A ridge logistic estimator. Comm Stat Theor Meth 1984; 13: 99–113. Google Scholar |
| 33. | Schafer, R . Alternative estimators in logistic regression when the data are collinear. J Stat Comput Simulat 1986; 25: 75–91. Google Scholar |
| 34. | Templ M, Alfons A, Kowarik A, et al. VIM: Visualization and imputation of missing values. R package version 4.4.1, http://CRAN.R-project.org/package=VIM (2015, accessed August 2017) Google Scholar |
| 35. | Lichman M. UCI machine learning repository, http://archive.ics.uci.edu/ml (2013, accessed November 2016) Google Scholar |
| 36. | Hardt, J, Herke, M, Leonhart, R. Auxiliary variables in multiple imputation in regression with missing x: a warning against including too many in small sample research. BMC Med Res Methodol 2012; 12: 1–13. Google Scholar | Medline |
| 37. | Meng, XL . Multiple-imputation inference with uncongenial sources of input. Stat Sci 1994; 9: 538–573. Google Scholar | ISI |
| 38. | Rubin, DB . Multiple imputation after 18 + years. J Am Stat Assoc 1996; 91: 473–489. Google Scholar | ISI |
| 39. | Zahid, FM, Tutz, G. Ridge estimation for multinomial logit models with symmetric side constraints. Comput Stat 2013; 28: 1017–1034. Google Scholar |
| 40. | Zhao, Y, Long, Q. Multiple imputation in the presence of high-dimensional data. Stat Meth Med Res 2016; 25: 2021–2035. Google Scholar | SAGE Journals | ISI |
| 41. | Bartlett, JW, Seaman, SR, White, IR Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Meth Med Res 2015; 24: 462–487. Google Scholar | SAGE Journals | ISI |
| 42. | Robins, JM, Wang, N. Inference for imputation estimators. Biometrika 2000; 87: 113–124. Google Scholar | ISI |
| 43. | Hughes, R, Sterne, J, Tilling, K. Comparison of imputation variance estimators. Stat Meth Med Res 2016; 25: 2541–2557. Google Scholar | SAGE Journals | ISI |

