Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well-behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation procedure. The original instantiation of the collaborative targeted minimum loss-based estimation template can be presented as a greedy forward stepwise collaborative targeted minimum loss-based estimation algorithm. It does not scale well when the number p of covariates increases drastically. This motivates the introduction of a novel instantiation of the collaborative targeted minimum loss-based estimation template where the covariates are pre-ordered. Its time complexity is O(p) as opposed to the original O(p2), a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is O(p) as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy collaborative targeted minimum loss-based estimation algorithm is unacceptably slow. Simulation studies seem to indicate that our scalable collaborative targeted minimum loss-based estimation and SL-C-TMLE algorithms work well. All C-TMLEs are publicly available in a Julia software package.

1. van der Laan, MJ, Rose, S. Targeted learning: causal inference for observational and experimental data, New York, NY: Springer Science & Business Media, 2011.
Google Scholar
2. van der Laan, MJ, Gruber, S. Collaborative double robust targeted maximum likelihood estimation. Int J Biostat 2010; 6: Article 1717.
Google Scholar
3. Stitelman, OM, Wester, CW, De Gruttola, V Targeted maximum likelihood estimation of effect modification parameters in survival analysis. Int J Biostat 2011; 7: Article 1919.
Google Scholar | Medline
4. Wang, H, Rose, S, van der Laan, MJ. Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Stat Probabil Lett 2011; 81: 792796.
Google Scholar | Medline
5. Stitelman, OM, van der Laan, MJ. Collaborative targeted maximum likelihood for time to event data. Int J Biostat 2010; 6: Article 2121.
Google Scholar
6. van der Laan, MJ, Polley, EC, Hubbard, AE. Super learner. Stat Appl Genetics Mol Biol 2007; 6: Article 2525.
Google Scholar
7. Robins, JM . A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Model 1986; 7: 13931512.
Google Scholar | ISI
8. Hernan, MA, Brumback, B, Robins, JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000; 11: 561570.
Google Scholar | Medline | ISI
9. Robins, JM . Marginal structural models versus structural nested models as tools for causal inference. In: Statistical models in epidemiology, the environment, and clinical trials, New York, NY: Springer, 2000, pp. 95133.
Google Scholar
10. Robins, JM, Rotnitzky, A. Comment on the Bickel and Kwon article, ‘Inference for semiparametric models: Some questions and an answer’. Statistica Sinica 2001; 11: 920936.
Google Scholar | ISI
11. Robins, JM, Rotnitzky, A, van der Laan, M. Comment on “On Profile Likelihood” by S.A. Murphy and A.W. van der Vaart. J Am Stat Assoc – Theory Meth 2000; 450: 431435.
Google Scholar
12. Robins J. Robust estimation in sequentially ignorable missing data and causal inference models. In: Proceedings of the American Statistical Association: Section on Bayesian Statistical Science, 8–12 August 1999, pp.6–10.
Google Scholar
13. Bickel, PJ, Klaassen, CA, Ritov, Y Efficient and adaptive estimation for semiparametric models, Springer-Verlag, 1998.
Google Scholar
14. Robins, JM, Rotnitzky, A, Zhao, LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994; 89: 846866.
Google Scholar | ISI
15. Robins, JM, Hernan, MA, Brumback, B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: 550560.
Google Scholar | Medline | ISI
16. van der Laan, MJ, Robins, JM. Unified methods for censored longitudinal data and causality, Springer Science & Business Media, 2003.
Google Scholar
17. van der Laan, MJ, Rubin, D. Targeted maximum likelihood learning. Int J Biostat 2006; 2: Article 1111.
Google Scholar
18. Gruber, S, van der Laan, MJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat 2010; 6: Article 2626.
Google Scholar
19. Gruber, S, van der Laan, MJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat 2010; 6: Article 1818.
Google Scholar
20. Porter, KE, Gruber, S, van der Laan, MJ The relative performance of targeted maximum likelihood estimators. Int J Biostat 2011; 7: Article 3131.
Google Scholar
21. Schneeweiss, S, Rassen, JA, Glynn, RJ High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 2009; 20: 512512.
Google Scholar | Medline | ISI
22. Hair, JF, Black, WC, Babin, BJ Multivariate data analysis 2006; Vol. 6, Upper Saddle River, NJ: Pearson Prentice Hall.
Google Scholar
23. van der Laan MJ and Dudoit S. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series: Working Paper 130, http://works.bepress.com/sandrine_dudoit/34/ (2003, accessed January 2016).
Google Scholar
24. van der Vaart, AW, Dudoit, S, Laan, MJ. Oracle inequalities for multi-fold cross validation. Stat Decis 2006; 24: 351371.
Google Scholar
25. Rose, S, van der Laan, MJ. Understanding tmle. In: Targeted learning, New York, NY: Berlin Heidelberg Springer, 2011, pp. 83100.
Google Scholar
26. Freedman, DA, Berk, RA. Weighting regressions by propensity scores. Eval Rev 2008; 32: 392409.
Google Scholar | SAGE Journals | ISI
27. Petersen, ML, Porter, KE, Gruber, S Diagnosing and responding to violations in the positivity assumption. Stat Meth Med Res 2012; 21: 3154.
Google Scholar | SAGE Journals | ISI
28. Brookhart, MA, Schneeweiss, S, Rothman, KJ Variable selection for propensity score models. Am J Epidemiol 2006; 163: 11491156.
Google Scholar | Medline | ISI
29. Gruber, S, van der Laan, MJ. C-tmle of an additive point treatment effect. In: Targeted learning, New York, NY: Berlin Heidelberg Springer, 2011, pp. 301321.
Google Scholar
30. Rassen, JA, Schneeweiss, S. Using high-dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system. Pharmacoepidemiol Drug Safe 2012; 21: 4149.
Google Scholar | Medline
31. Gadbury, GL, Xiang, Q, Yang, L Evaluating statistical methods using plasmode data sets in the age of massive public databases: an illustration using false discovery rates. PLoS Genet 2008; 4: e1000098e1000098.
Google Scholar | Medline
32. Franklin, JM, Schneeweiss, S, Polinski, JM Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Computat Stat Data Anal 2014; 72: 219226.
Google Scholar | Medline
33. Patorno, E, Glynn, RJ, Hernández-Díaz, S Studies with many covariates and few outcomes: selecting covariates and implementing propensity-score–based confounding adjustments. Epidemiol. 25: 268278.
Google Scholar | Medline
34. Franklin, JM, Eddings, W, Glynn, RJ Regularized regression versus the high-dimensional propensity score for confounding adjustment in secondary database analyses. Am J Epidemiol 2015; 187: 651659.
Google Scholar
35. Toh, S, García Rodríguez, LA, Hernán, MA. Confounding adjustment via a semi-automated high-dimensional propensity score algorithm: an application to electronic medical records. Pharmacoepidemiol Drug Safe 2011; 20: 849857.
Google Scholar | Medline
36. Kumamaru, H, Gagne, JJ, Glynn, RJ Comparison of high-dimensional confounder summary scores in comparative studies of newly marketed medications. J Clin Epidemiol 2016; 76: 200208.
Google Scholar | Medline
37. Ju, C, Combs, M, Lendle, SD Propensity score prediction for electronic healthcare databases using super learner and high-dimensional propensity score methods. arXiv preprint arXiv:1703.02236 2017.
Google Scholar
38. Bross, I . Misclassification in 2 × 2 tables. Biometrics 1954; 10: 478486.
Google Scholar | ISI
Access Options

My Account

Welcome
You do not have access to this content.



Chinese Institutions / 中国用户

Click the button below for the full-text content

请点击以下获取该全文

Institutional Access

does not have access to this content.

Purchase Content

24 hours online access to download content

Research off-campus without worrying about access issues. Find out about Lean Library here

Your Access Options


Purchase

SMM-article-ppv for $41.50
Single Issue 24 hour E-access for $543.66

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top