Abstract
The goal of multiple imputation is to provide valid inferences for statistical estimates from incomplete data. To achieve that goal, imputed values should preserve the structure in the data, as well as the uncertainty about this structure, and include any knowledge about the process that generated the missing data. Two approaches for imputing multivariate data exist: joint modeling (JM) and fully conditional specification (FCS). JM is based on parametric statistical theory, and leads to imputation procedures whose statistical properties are known. JM is theoretically sound, but the joint model may lack flexibility needed to represent typical data features, potentially leading to bias. FCS is a semi-parametric and flexible alternative that specifies the multivariate model by a series of conditional models, one for each incomplete variable. FCS provides tremendous flexibility and is easy to apply, but its statistical properties are difficult to establish. Simulation work shows that FCS behaves very well in the cases studied. The present paper reviews and compares the approaches. JM and FCS were applied to pubertal development data of 3801 Dutch girls that had missing data on menarche (two categories), breast development (five categories) and pubic hair development (six stages). Imputations for these data were created under two models: a multivariate normal model with rounding and a conditionally specified discrete model. The JM approach introduced biases in the reference curves, whereas FCS did not. The paper concludes that FCS is a useful and easily applied flexible alternative to JM when no convenient and realistic joint distribution can be specified.
|
Rubin DB Multiple imputation for nonresponse in surveys. Wiley , 1987. Google Scholar | Crossref | |
|
Rubin DB Multiple imputation after 18+ years . Journal of the American Statistical Association 1996; 91(434): 473—89 . Google Scholar | Crossref | ISI | |
|
Collins LM , Schafer JL , Kam CM A comparison of inclusive and restrictive strategies in modern missing data procedures . Psychological Methods 2001; 6(3): 330—51 . Google Scholar | Crossref | Medline | ISI | |
|
Scheuren F. Multiple imputation: how it began and continues . American Statistician 2005; 59(4): 315—9 . Google Scholar | Crossref | ISI | |
|
Dempster AP , Laird NM , Rubin DB Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B : Statistical Methodology 1977; 39: 1—38 . Google Scholar | |
|
Little Rja , Rubin DB Statistical analysis with missing data. second edition Wiley , 2002. Google Scholar | Crossref | |
|
Schafer JL Analysis of incomplete multivariate data. Chapman & Hall , 1997. Google Scholar | Crossref | |
|
Schafer JL Multiple imputation: a primer . Statistical Methods in Medical Research 1999; 8(1): 3—15 . Google Scholar | SAGE Journals | ISI | |
|
Stern HS , Sinharay S. , Russell D. The use of multiple imputation for the analysis of missing data . Psychological Methods 2001; 6(3): 317—29 . Google Scholar | Medline | ISI | |
|
Allison PD Missing data. Sage , 2002. Google Scholar | Crossref | |
|
Schafer JL , Graham JW Missing data: our view of the state of the art . Psychological Methods 2002; 7(2): 147—77 . Google Scholar | Crossref | Medline | ISI | |
|
Rubin DB , Schenker N. Multiple imputation in health-care databases: an overview and some applications . Statistics in Medicine 1991; 10(4): 585—98 . Google Scholar | Crossref | Medline | ISI | |
|
Barnard J. , Meng XL Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical Methods in Medical Research 1999 ; 8(1): 17—36 . Google Scholar | SAGE Journals | |
|
Greenland S. , Finkle WD A critical look at methods for handling missing covariates in epidemiologic regression analyses . American Journal of Epidemiology 1995; 142(12): 1255—64 . Google Scholar | Crossref | Medline | ISI | |
|
Kmetic A. , Joseph L. , Berger C. , Tenenhouse A. Multiple imputation to account for missing data in a survey: estimating the prevalence of osteoporosis . Epidemiology 2002 ; 13(4): 437—44 . Google Scholar | Crossref | Medline | ISI | |
|
Abraham WT , Russell DW Missing data: a review of current methods and applications in epidemiological research. Current Opinion in Psychiatry 2004 ; 17(4): 315—21 . Google Scholar | |
|
Croy CD , Novins DK Methods for addressing missing data in psychiatric and developmental research . Journal of the American Academy of Child and Adolescent Psychiatry 2005; 44(12): 1230—40 . Google Scholar | Crossref | Medline | ISI | |
|
Kneipp SM , McIntosh M. Handling missing data in nursing research with multiple imputation . Nursing Research 2001; 50(6): 384—9 . Google Scholar | Crossref | Medline | ISI | |
|
Patrician PA Multiple imputation for missing data . Research in Nursing and Health 2002; 25(1): 76—84 . Google Scholar | Crossref | Medline | ISI | |
|
McCleary L. Using multiple imputation for analysis of incomplete data in clinical research . Nursing Research 2002; 51(5): 339—43 . Google Scholar | Crossref | Medline | ISI | |
|
Fox-Wasylyshyn SM , El-Masri MM Handling missing data in self-report measures . Research in Nursing and Health 2005; 28(6): 488—95 . Google Scholar | Crossref | Medline | ISI | |
|
Molenberghs G. , Burzykowski T. , Michiels B. , Kenward MG Analysis of incomplete public health data . Revue d'Epidemiologie et de Sante Publique 1999; 47(6): 499—514 . Google Scholar | Medline | ISI | |
|
Zhou XH , Eckert GJ , Tierney WM Multiple imputation of public health research . Statistics in Medicine 2001; 20(9—10): 1541—9 . Google Scholar | Crossref | Medline | ISI | |
|
Raghunathan TE What do we do with missing data? Some options for analysis of incomplete data. Annual Review of Public Health 2004; 25: 99—117 . Google Scholar | Crossref | ISI | |
|
Crawford SL , Tennstedt SL , McKinlay JB A comparison of analytic methods for non-random missingness of outcome data. Journal of Clinical Epidemiology 1995; 48(2): 209—19 . Google Scholar | Crossref | ISI | |
|
Faris PD , Ghali WA , Brant R. , Norris CM , Galbraith PD , Knudtson ML , APPROACH Investigators. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. Journal of Clinical Epidemiology 2002; 55(2): 184—91 . Google Scholar | Crossref | ISI | |
|
Oostenbrink JB , Al MJ The analysis of incomplete cost data due to dropout . Health Economics 2005; 14(8): 763—76 . 28 Chavance M. Handling missing items in quality of life studies . Communications in Statistics — Theory and Methods 2004; 33(6): 1371—83 . Google Scholar | Crossref | Medline | ISI | |
|
Catellier DJ , Hannan PJ , Murray DM , Addy CL , Conway TL , Yang S. , Rice JC Imputation of missing data when measuring physical activity by accelerometry . Med Sci Sports Exerc 2005; 37(11 Suppl): S555—62 . Google Scholar | Crossref | Medline | ISI | |
|
Wood AM , White IR , Hillsdon M. , Carpenter J. Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes . International Journal of Epidemiology 2005; 34(1): 89—99 . Google Scholar | Crossref | Medline | ISI | |
|
Smits N. , Mellenbergh GJ , Vorst Hcm. Alternative missing data techniques to grade point average: Imputing unavailable grades . Journal of Educational Measurement 2002; 39(3): 187—206 . Google Scholar | Crossref | ISI | |
|
Peugh JL , Enders CK Missing data in educational research: a review of reporting practices and suggestions for improvement . Review of Educational Research 2004; 74(4): 525—56 . Google Scholar | SAGE Journals | ISI | |
|
Walczak B. , Massart DL Dealing with missing data: part II. Chemometrics and Intelligent Laboratory Systems 2001; 58(1): 29—42 . Google Scholar | |
|
Longford NT Multilevel analysis with messy data. Statistical Methods in Medical Research 2001; 10(6): 429—44 . Google Scholar | SAGE Journals | |
|
Olinsky A. , Chen S. , Harlow L. The comparative efficacy of imputation methods for missing data in structural equation modeling. European Journal of Operational Research 2003; 151(1): 53—79 . Google Scholar | Crossref | ISI | |
|
Allison PD Missing data techniques for structural equation modeling . Journal of Abnormal Psychology 2003; 112(4): 545—57 . Google Scholar | Crossref | Medline | ISI | |
|
Twisk J. , de Vente W. Attrition in longitudinal studies: how to deal with missing data . Journal of Clinical Epidemiology 2002; 55(4): 329—37 . Google Scholar | Crossref | Medline | ISI | |
|
Demirtas H. Modeling incomplete longitudinal data . Journal of Modern Applied Statistical Methods 2004; 3(2): 305—21 . Google Scholar | Crossref | |
|
Streiner DL The case of the missing data: Methods of dealing with dropouts and other research vagaries . Canadian Journal of Psychiatry 2002; 47(1): 68—75 . Google Scholar | SAGE Journals | ISI | |
|
Kristman VL , Manno M. Methods to account for attrition in longitudinal data: do they work? A simulation study . European Journal of Epidemiology 2005; 20(8): 657—62 . Google Scholar | Crossref | Medline | ISI | |
|
Little R. , Yau L. Intent-to-treat analysis for longitudinal studies with drop-outs . Biometrics 1996; 52(4): 1324—33 . Google Scholar | Crossref | Medline | ISI | |
|
Liu G. , Gould AL Comparison of alternative strategies for analysis of longitudinal trials with dropouts . Journal of Biopharmaceutical Statistics 2002; 12(2): 207—26 . Google Scholar | Crossref | Medline | |
|
Houck PR , Maz umdar S. , Koru-Sengul T. , Tang G. , Mulsant BH , Pollock BG , Reynolds CF 3rd. Estimating treatment effects from longitudinal clinical trial data with missing values: Comparative analyses using different methods . Psychiatry Research 2004; 129(2): 209—15 . Google Scholar | Crossref | Medline | ISI | |
|
Tang L. , Unntzer J. , Song J. , Belin TR A comparison of imputation methods in a longitudinal randomized clinical trial . Statistics in Medicine 2005; 24(14): 2111—28 . Google Scholar | Crossref | Medline | ISI | |
|
Beunckens C. , Molenberghs G. , Kenward MG Direct likelihood analysis versus simple forms of imputation for missing data in randomized clinical trials. Clinical Trials 2005; 2(5): 379—86 . Google Scholar | SAGE Journals | ISI | |
|
Barnes SA , Lindborg SR , Seaman J. Multiple imputation techniques in small sample clinical trials . Statistics in Medicine 2006; 25(2): 233—45 . Google Scholar | Crossref | Medline | ISI | |
|
Pigott TD Missing predictors in models of effect size . Evaluation and the Health Professions 2001; 24(3): 277—307 . Google Scholar | SAGE Journals | ISI | |
|
Ibrahim JG , Herring AH , Chen MH , Lipsitz SR Missing-data methods for generalized linear models: a comparative review . Journal of the American Statistical Association 2005; 100(469): 332-46 . Google Scholar | Crossref | ISI | |
|
Schafer JL Multiple imputation in multivariate problems when the imputation and analysis models differ . Statistica Neerlandica 2003 ; 57(1): 19—35 . Google Scholar | Crossref | ISI | |
|
Brand Jpl , Van Buuren S , Groothuis-Oudshoorn K. , Gelsema ES A toolkit in SAS for the evaluation of multiple imputation methods . Statistica Neerlandica 2003; 57(1): 36—45 . Google Scholar | Crossref | ISI | |
|
Meng XL Multiple imputation with uncongenial sources of input (with discussion) . Statistical Science 1995; 10: 538—73 . Google Scholar | |
|
Van Buuren S. , Boshuizen HC , Knook DL Multiple imputation of missing blood pressure covariates in survival analysis . Statistics in Medicine 1999; 18(6): 681—94 . Google Scholar | Crossref | Medline | ISI | |
|
Abayomi K. , Gelman A. , Levy M. Diagnostics for multivariate imputations. Assessed from Gelman's weblog November 2005. Google Scholar | |
|
Schenker N. , Taylor Jmg. Partially parametric techniques for multiple imputation . Computational Statistics and Data Analysis 1996; 22(4): 425—46 . Google Scholar | Crossref | ISI | |
|
Rubin DB Statistical matching using file concatenation with adjusted weights and multiple imputations . Journal of Business Economics and Statistics 1986; 4: 87—94 . Google Scholar | ISI | |
|
Little Rja. Missing data adjustments in large surveys (with discussion) . Journal of Business Economics and Statistics 1988; 6: 287—301 . Google Scholar | ISI | |
|
Harrell F. Regression modeling strategies, with applications to linear models, logistic regression, and survival analysis Springer , 2001. Google Scholar | |
|
Albert JH , Chib S. Bayesian analysis of binary and polychotomous variables . Journal of the American Statistical Association 1993; 88: 669—79 . Google Scholar | Crossref | ISI | |
|
Yucel RM , Zaslavsky AM Imputation of binary treatment variables with measurement error in administrative data . Journal of the American Statistical Association 2005; 100(472): 1123—32 . Google Scholar | Crossref | ISI | |
|
Brand Jpl. Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Erasmus University , 1999. Google Scholar | |
|
Raghunathan TE , Lepkowski JM , van Hoewyk J. , Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models . Survey Methodology 2001; 27: 85—95 . Google Scholar | |
|
Parzen M. , Lipsitz SR , Fitzmaurice GM A note on reducing the bias of the approximate Bayesian bootstrap imputation variance estimator . Biometrika 2005; 92(4): 971—4 . Google Scholar | Crossref | ISI | |
|
Reilly M. , Pepe M. The relationship between hot-deck multiple imputation and weighted likelihood . Statistics in Medicine 1997; 16(1—3): 5—19 . Google Scholar | Crossref | Medline | ISI | |
|
Junninen H. , Niska H. , Ruuskanen J. , Kolehmainen M. , Tuppurainen K. Methods for imputation of missing values in air quality data sets . Atmospheric Environment 2004; 38(18): 2895—907 . Google Scholar | Crossref | ISI | |
|
Paddock SM Bayesian nonparametric multiple imputation of partially observed data with ignorable nonresponse . Biometrika 2002; 89(3): 529—38 . Google Scholar | Crossref | ISI | |
|
Heckman JJ The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models . Annals of Economic and Social Measurement 1976; 5: 475—92 . Google Scholar | |
|
Greenlees WS , Reece JS , Zieschang KD Imputation of missing values when the probability of response depends on the variable being imputed . Journal of the American Statistical Association 1983; 77: 251—61 . Google Scholar | Crossref | ISI | |
|
Wei Gcg , Tanner MA Applications of multiple imputation to the analysis of censored regression data . Biometrics 1991; 47(4): 1297—309 . Google Scholar | Crossref | Medline | ISI | |
|
Pan W. , Connett JE A multiple imputation approach to linear regression with clustered censored data. Lifetime Data Analysis 2001; 7(2): 111—23 . Google Scholar | ISI | |
|
Goetghebeur E. , Ryan L. Semiparametric regression analysis of interval-censored data . Biometrics 2000; 56(4): 1139—44 . Google Scholar | Crossref | Medline | ISI | |
|
Pan W. A multiple imputation approach to Cox regression with interval-censored data . Biometrics 2000; 56(1): 199—203 . Google Scholar | Crossref | Medline | ISI | |
|
Bechger TM , Boomsma DI , Koning H. A limited dependent variable model for heritability estimation with non-random ascertained samples. Behavior Genetics 2002 ; 32(2): 145—51 . Google Scholar | ISI | |
|
Hopke PK , Liu C. , Rubin DB Multiple imputation for multivariate data with missing and below-threshold measurements: time-series concentrations of pollutants in the arctic . Biometrics 2001; 57(1): 22—33 . Google Scholar | Crossref | Medline | ISI | |
|
Lubin JH , Colt JS , Hartge P. , Camann D. , Davis S. , Cerhan JR , Severson RK , Bernstein L. , Hartge P. Epidemiologic evaluation of measurement data in the presence of detection limits . Environmental Health Perspectives 2004; 112(17): 1691—6 . Google Scholar | Crossref | Medline | ISI | |
|
Fridley B. , Rabe K. , de Andrade M. Imputation methods for missing data for polygenic models . BMC Genetics 2003; 4(Suppl 1): S42 . Google Scholar | Crossref | Medline | ISI | |
|
Heeringa SG , Little Rja , Raghunathan TE Multivariate imputation of coarsened survey data on household wealth. In Groves RM , Dillman DA , Eltinge JL , Little RJA , eds. Survey Nonresponse. Wiley , 2002. Google Scholar | |
|
Rubin DB , Schafer JL Efficiently creating multiple imputations for incomplete multivariate normal data . 1990 Proceedings of the Statistical Computing Section, American Statistical Association 1990; 83—8. Google Scholar | |
|
Rubin DB Nested multiple imputation of NMES via partially incompatible MCMC . Statistica Neerlandica 2003; 57(1): 3—18 . Google Scholar | Crossref | ISI | |
|
Van Buuren S. , Brand Jpl , Groothuis-Oudshoorn K. , Rubin DB Fully conditional specification in multivariate imputation . Journal of Statistical Computation and Simulation 2006; 76(12): 1049—1064 . Google Scholar | Crossref | ISI | |
|
Horton NJ , Lipsitz SR Multiple imputation in practice: comparison of software packages for regression models with missing variables . American Statistician 2001; 55: 244—54 . Google Scholar | Crossref | ISI | |
|
Kennickell AB Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. ASA 1991 Proceedings of the Section on Survey Research Methods 1991 ; 1—10 . Google Scholar | |
|
Heckerman D. , Chickering DM , Meek C. , Rounthwaite R. , Kadie C. Dependency networks for inference, collaborative filtering, and data visualisation . Journal of Machine Learning Research 2001; 1: 49—75 . Google Scholar | ISI | |
|
Gelman A. Parameterization and Bayesian Modeling . Journal of the American Statistical Association 2004; 99(466): 537—45 . Google Scholar | Crossref | ISI | |
|
Van Buuren S , Groothuis-Oudshoorn K. Multivariate imputation by chained equations: MICE V1.0 user's manual. TNO Quality of Life, 2000, PG/VGZ/00.038. Google Scholar | |
|
Arnold BC , Castillo E. , Sarabia JM Conditional specification of statistical models. Springer , 1999. Google Scholar | |
|
Goodman LA The multivariate analysis of qualitative data: interactions among multiple classifications . Journal of the American Statistical Association 1970; 65: 226—56 . Google Scholar | Crossref | ISI | |
|
Besag J. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society Series B : Statistical Methodology 1974; 36: 192—236 . Google Scholar | |
|
Arnold BC , Press SJ Compatible conditional distributions . Journal of the American Statistical Association 1989; 84: 152—6 . Google Scholar | Crossref | ISI | |
|
Gelman A. , Speed TP Characterizing a joint probability distribution by conditionals. Journal of the Royal Statistical Society Series B : Statistical Methodology 1993; 55: 185—8 . Google Scholar | |
|
Gelman A. , Rubin DB Inference from iterative simulation using multiple sequences (with discussion) . Statistical Science 1991; 7: 457—511 . Google Scholar | Crossref | |
|
Royston P. Multiple imputation of missing values . The Stata Journal 2004; 4: 227—41 . Google Scholar | |
|
Royston P. Multiple imputation of missing values: update of ice . Stata Journal 2005; 5: 527—36 . Google Scholar | SAGE Journals | ISI | |
|
Fredriks MA , van Buuren S. , Burgmeijer RJ , Meulmeester JF , Beuker RJ , Brugman E. , Roede MJ , Verloove-Vanhorick SP , Wit JM Continuing positive secular growth change in The Netherlands 1955—1997 . Pediatric Research 2000; 47: 316—23 . Google Scholar | Crossref | Medline | ISI | |
|
Marshall WA , Tanner JM Variations in pattern of pubertal changes in girls . Archives of Diseases in Childhood 1969; 44: 291—303 . Google Scholar | Crossref | Medline | ISI | |
|
Mul D. , Van Buuren S. , Frediks MA , Oostdijk W. , Verloove-Vanhorick SP , Wit JM Pubertal development in The Netherlands 1965—1997 . Pediatric Research 2001; 50: 479—86 . Google Scholar | Crossref | Medline | ISI | |
|
Little Rja. Regression with missing X's: a review . Journal of the American Statistical Association 1992; 87: 1227—37 . Google Scholar | ISI | |
|
McCullagh P , Nelder JA eds. Generalized linear models, second edition. Chapman & Hall , 1989. Google Scholar | Crossref | |
|
Venables WN , Ripley BD eds. Modern applied statistics with S, fourth edition Springer-Verlag , 2002. Google Scholar | Crossref | |
|
Hastie TJ , Tibshirani RJ Generalized additive models. Chapman & Hall , 1990. Google Scholar | |
|
Horton NJ , Lipsitz SR , Parzen M. A potential for bias when rounding in multiple imputation . American Statistician 2003; 57(4): 229—32 . Google Scholar | Crossref | ISI | |
|
Ake CF Rounding after multiple imputation with non-binary categorical covariates SUGI 30 Proceedings 2005, 112—30 , pp. 1—11 . Google Scholar | |
|
Allison PD Imputation of categorical variables with PROC MI . SUGI 30 Proceedings 2005, 113—30 , pp. 1—14 . Google Scholar | |
|
Belin TR , Hu MY , Young AS , Grusky O. Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study . Statistics in Medicine 1999; 18: 3123—35 . Google Scholar | Crossref | Medline | ISI | |
|
Gelman A. , Raghunathan TE Discussion of Arnold et al. Conditionally specified distributions . Statistical Science 2001; 16: 249—74 . Google Scholar | ISI | |
|
Briggs A. , Clark T. , Wolstenholme J. , Clarke P. Missing.... presumed at random: cost-analysis of incomplete data . Health Economics 2003; 12: 377—92 . Google Scholar | Crossref | Medline | ISI | |
|
Chen L. , Valois RF , Toma-Drane M. , Drane JW Multiple imputation for missing ordinal data . Journal of Modern Applied Statistical Methods 2005; 4(1): 288—99 . Google Scholar | Crossref |
