Abstract
Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well-behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation procedure. The original instantiation of the collaborative targeted minimum loss-based estimation template can be presented as a greedy forward stepwise collaborative targeted minimum loss-based estimation algorithm. It does not scale well when the number p of covariates increases drastically. This motivates the introduction of a novel instantiation of the collaborative targeted minimum loss-based estimation template where the covariates are pre-ordered. Its time complexity is as opposed to the original , a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy collaborative targeted minimum loss-based estimation algorithm is unacceptably slow. Simulation studies seem to indicate that our scalable collaborative targeted minimum loss-based estimation and SL-C-TMLE algorithms work well. All C-TMLEs are publicly available in a Julia software package.
References
| 1. | van der Laan, MJ, Rose, S. Targeted learning: causal inference for observational and experimental data, New York, NY: Springer Science & Business Media, 2011. Google Scholar |
| 2. | van der Laan, MJ, Gruber, S. Collaborative double robust targeted maximum likelihood estimation. Int J Biostat 2010; 6: Article 17–17. Google Scholar |
| 3. | Stitelman, OM, Wester, CW, De Gruttola, V Targeted maximum likelihood estimation of effect modification parameters in survival analysis. Int J Biostat 2011; 7: Article 19–19. Google Scholar | Medline |
| 4. | Wang, H, Rose, S, van der Laan, MJ. Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Stat Probabil Lett 2011; 81: 792–796. Google Scholar | Medline |
| 5. | Stitelman, OM, van der Laan, MJ. Collaborative targeted maximum likelihood for time to event data. Int J Biostat 2010; 6: Article 21–21. Google Scholar |
| 6. | van der Laan, MJ, Polley, EC, Hubbard, AE. Super learner. Stat Appl Genetics Mol Biol 2007; 6: Article 25–25. Google Scholar |
| 7. | Robins, JM . A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Model 1986; 7: 1393–1512. Google Scholar | ISI |
| 8. | Hernan, MA, Brumback, B, Robins, JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: 561–570. Google Scholar | Medline | ISI |
| 9. | Robins, JM . Marginal structural models versus structural nested models as tools for causal inference. In: Statistical models in epidemiology, the environment, and clinical trials, New York, NY: Springer, 2000, pp. 95–133. Google Scholar |
| 10. | Robins, JM, Rotnitzky, A. Comment on the Bickel and Kwon article, ‘Inference for semiparametric models: Some questions and an answer’. Statistica Sinica 2001; 11: 920–936. Google Scholar | ISI |
| 11. | Robins, JM, Rotnitzky, A, van der Laan, M. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. J Am Stat Assoc – Theory Meth 2000; 450: 431–435. Google Scholar |
| 12. | Robins J. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association: Section on Bayesian Statistical Science, 8–12 August 1999, pp.6–10. Google Scholar |
| 13. | Bickel, PJ, Klaassen, CA, Ritov, Y Efficient and adaptive estimation for semiparametric models, Springer-Verlag, 1998. Google Scholar |
| 14. | Robins, JM, Rotnitzky, A, Zhao, LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994; 89: 846–866. Google Scholar | ISI |
| 15. | Robins, JM, Hernan, MA, Brumback, B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: 550–560. Google Scholar | Medline | ISI |
| 16. | van der Laan, MJ, Robins, JM. Unified methods for censored longitudinal data and causality, Springer Science & Business Media, 2003. Google Scholar |
| 17. | van der Laan, MJ, Rubin, D. Targeted maximum likelihood learning. Int J Biostat 2006; 2: Article 11–11. Google Scholar |
| 18. | Gruber, S, van der Laan, MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat 2010; 6: Article 26–26. Google Scholar |
| 19. | Gruber, S, van der Laan, MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat 2010; 6: Article 18–18. Google Scholar |
| 20. | Porter, KE, Gruber, S, van der Laan, MJ The relative performance of targeted maximum likelihood estimators. Int J Biostat 2011; 7: Article 31–31. Google Scholar |
| 21. | Schneeweiss, S, Rassen, JA, Glynn, RJ High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 2009; 20: 512–512. Google Scholar | Medline | ISI |
| 22. | Hair, JF, Black, WC, Babin, BJ Multivariate data analysis 2006; Vol. 6, Upper Saddle River, NJ: Pearson Prentice Hall. Google Scholar |
| 23. | van der Laan MJ and Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series: Working Paper 130, http://works.bepress.com/sandrine_dudoit/34/ (2003, accessed January 2016). Google Scholar |
| 24. | van der Vaart, AW, Dudoit, S, Laan, MJ. Oracle inequalities for multi-fold cross validation. Stat Decis 2006; 24: 351–371. Google Scholar |
| 25. | Rose, S, van der Laan, MJ. Understanding tmle. In: Targeted learning, New York, NY: Berlin Heidelberg Springer, 2011, pp. 83–100. Google Scholar |
| 26. | Freedman, DA, Berk, RA. Weighting regressions by propensity scores. Eval Rev 2008; 32: 392–409. Google Scholar | SAGE Journals | ISI |
| 27. | Petersen, ML, Porter, KE, Gruber, S Diagnosing and responding to violations in the positivity assumption. Stat Meth Med Res 2012; 21: 31–54. Google Scholar | SAGE Journals | ISI |
| 28. | Brookhart, MA, Schneeweiss, S, Rothman, KJ Variable selection for propensity score models. Am J Epidemiol 2006; 163: 1149–1156. Google Scholar | Medline | ISI |
| 29. | Gruber, S, van der Laan, MJ. C-tmle of an additive point treatment effect. In: Targeted learning, New York, NY: Berlin Heidelberg Springer, 2011, pp. 301–321. Google Scholar |
| 30. | Rassen, JA, Schneeweiss, S. Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiol Drug Safe 2012; 21: 41–49. Google Scholar | Medline |
| 31. | Gadbury, GL, Xiang, Q, Yang, L Evaluating statistical methods using plasmode data sets in the age of massive public databases: an illustration using false discovery rates. PLoS Genet 2008; 4: e1000098–e1000098. Google Scholar | Medline |
| 32. | Franklin, JM, Schneeweiss, S, Polinski, JM Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Computat Stat Data Anal 2014; 72: 219–226. Google Scholar | Medline |
| 33. | Patorno, E, Glynn, RJ, Hernández-Díaz, S Studies with many covariates and few outcomes: selecting covariates and implementing propensity-score–based confounding adjustments. Epidemiol. 25: 268–278. Google Scholar | Medline |
| 34. | Franklin, JM, Eddings, W, Glynn, RJ Regularized regression versus the high-dimensional propensity score for confounding adjustment in secondary database analyses. Am J Epidemiol 2015; 187: 651–659. Google Scholar |
| 35. | Toh, S, García Rodríguez, LA, Hernán, MA. Confounding adjustment via a semi-automated high-dimensional propensity score algorithm: an application to electronic medical records. Pharmacoepidemiol Drug Safe 2011; 20: 849–857. Google Scholar | Medline |
| 36. | Kumamaru, H, Gagne, JJ, Glynn, RJ Comparison of high-dimensional confounder summary scores in comparative studies of newly marketed medications. J Clin Epidemiol 2016; 76: 200–208. Google Scholar | Medline |
| 37. | Ju, C, Combs, M, Lendle, SD Propensity score prediction for electronic healthcare databases using super learner and high-dimensional propensity score methods. arXiv preprint arXiv:1703.02236 2017. Google Scholar |
| 38. | Bross, I . Misclassification in 2 × 2 tables. Biometrics 1954; 10: 478–486. Google Scholar | ISI |

