Abstract
Random forests are one of the state-of-the-art supervised machine learning methods and achieve good performance in high-dimensional settings where p, the number of predictors, is much larger than n, the number of observations. Repeated measurements provide, in general, additional information, hence they are worth accounted especially when analyzing high-dimensional data. Tree-based methods have already been adapted to clustered and longitudinal data by using a semi-parametric mixed effects model, in which the non-parametric part is estimated using regression trees or random forests. We propose a general approach of random forests for high-dimensional longitudinal data. It includes a flexible stochastic model which allows the covariance structure to vary over time. Furthermore, we introduce a new method which takes intra-individual covariance into consideration to build random forests. Through simulation experiments, we then study the behavior of different estimation methods, especially in the context of high-dimensional data. Finally, the proposed method has been applied to an HIV vaccine trial including 17 HIV-infected patients with 10 repeated measurements of 20,000 gene transcripts and blood concentration of human immunodeficiency virus RNA. The approach selected 21 gene transcripts for which the association with HIV viral load was fully relevant and consistent with results observed during primary infection.
References
| 1. | Breiman, L. Random forests. Mach Learn 2001; 45: 5–32. Google Scholar | Crossref | ISI |
| 2. | Fernández-Delgado, M, Cernadas, E, Barro, S, et al. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 2014; 15: 3133–3181. Google Scholar |
| 3. | Cutler, DR, Edwards, TC, Beard, KH, et al. Random forests for classification in ecology. Ecology 2007; 88: 2783–2792. Google Scholar | Crossref | Medline | ISI |
| 4. | Chen, X, Ishwaran, H. Random forests for genomic data analysis. Genomics 2012; 99: 323–329. Google Scholar | Crossref | Medline | ISI |
| 5. | Scornet, E, Biau, G, Vert, JP. Consistency of random forests. Ann Stat 2015; 43: 1716–1741. Google Scholar | Crossref |
| 6. | Mentch, L, Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res 2016; 17: 841–881. Google Scholar |
| 7. | Wager, S. Asymptotic theory for random forests. arXiv preprint arXiv:14050352 2014. Google Scholar |
| 8. | Biau, G, Scornet, E. A random forest guided tour. Test 2016; 25: 197–227. Google Scholar | Crossref |
| 9. | Genuer, R, Poggi, JM, Tuleau, C. Random forests: some methodological insights. arXiv preprint arXiv:08113619 2008. Google Scholar |
| 10. | Zhu, R, Zeng, D, Kosorok, MR. Reinforcement learning trees. J Am Stat Assoc 2015; 110: 1770–1784. Google Scholar | Crossref | Medline |
| 11. | Linero, AR. Bayesian regression trees for high-dimensional prediction and variable selection. J Am Stat Assoc 2018; 113: 626–636. Google Scholar | Crossref |
| 12. | Chipman, HA, George, EI, McCulloch, RE. Bayesian cart model search. J Am Stat Assoc 1998; 93: 935–948. Google Scholar | Crossref |
| 13. | Hothorn, T, Bühlmann, P, Dudoit, S, et al. Survival ensembles. Biostatistics 2005; 7: 355–373. Google Scholar | Crossref | Medline |
| 14. | Ishwaran, H, Kogalur, UB, Blackstone, EH, et al. Random survival forests. Ann Appl Stat 2008; 2: 841–860. Google Scholar | Crossref | ISI |
| 15. | Ishwaran, H, Kogalur, UB, Gorodeski, EZ, et al. High-dimensional variable selection for survival data. J Am Stat Assoc 2010; 105: 205–217. Google Scholar | Crossref |
| 16. | Steingrimsson, JA, Diao, L, Strawderman, RL. Censoring unbiased regression trees and ensembles. J Am Stat Assoc 2019; 114: 370–383. Google Scholar | Crossref | Medline |
| 17. | Laird, NM, Ware, JH. Random-effects models for longitudinal data. Biometrics 1982; 963–974. Google Scholar | Crossref | Medline |
| 18. | Verbeke, G, Molenberghs, G. Linear mixed models for longitudinal data. New York: Springer, 2009. Google Scholar |
| 19. | Segal, MR. Tree-structured methods for longitudinal data. J Am Stat Assoc 1992; 87: 407–418. Google Scholar | Crossref | ISI |
| 20. | Eo, SH, Cho, H. Tree-structured mixed-effects regression modeling for longitudinal data. J Comput Graph Stat 2014; 23: 740–760. Google Scholar | Crossref |
| 21. | Wei, Y, Liu, L, Su, X, et al. Precision medicine: Subgroup identification in longitudinal trajectories. Statistical Methods in Medical Research, 29(9), 2603–2616. Google Scholar |
| 22. | Hajjem, A, Bellavance, F, Larocque, D. Mixed effects regression trees for clustered data. Stat Probab Lett 2011; 81: 451–459. Google Scholar | Crossref | ISI |
| 23. | Sela, RJ, Simonoff, JS. RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 2012; 86: 169–207. Google Scholar | Crossref | ISI |
| 24. | Fu, W, Simonoff, JS. Unbiased regression trees for longitudinal and clustered data. Comput Stat Data Anal 2015; 88: 53–74. Google Scholar | Crossref |
| 25. | Hothorn, T, Hornik, K, Zeileis, A. Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 2006; 15: 651–674. Google Scholar | Crossref | ISI |
| 26. | Breiman, L, Friedman, JH, Olshen, RA, et al. Classification and regression trees. London: Chapman & Hall, 1984. Google Scholar |
| 27. | Hajjem, A, Bellavance, F, Larocque, D. Mixed-effects random forest for clustered data. J Stat Comput Simul 2014; 84: 1313–1328. Google Scholar | Crossref |
| 28. | McLachlan, GJ, Krishnan, T. The EM algorithm and extensions. Hoboken, NJ: John Wiley & Sons, 1997. Google Scholar |
| 29. | Kundu, MG, Harezlak, J. Regression trees for longitudinal data with baseline covariates. Biostat Epidemiol 2019; 3: 1–22. Google Scholar | Crossref | Medline |
| 30. | Calhoun, P, Levine, RA, Fan, J. Repeated measures random forests (RMRF): Identifying factors associated with nocturnal hypoglycemia. Biometrics. Epub ahead of print 20 April 2020. DOI: 10.1111/biom.13284 Google Scholar |
| 31. | R Core Team. R: a language and environment for statistical computing.
Vienna, Austria:
R Foundation for Statistical Computing, 2019. Google Scholar |
| 32. | Diggle, PJ, Hutchinson, MF. On spline smoothing with autocorrelated errors. Aust N Z J Stat 1989; 31: 166–182. Google Scholar | Crossref |
| 33. | Zhang, D, Lin, X, Raz, J, et al. Semiparametric stochastic mixed models for longitudinal data. J Am Stat Assoc 1998; 93: 710–719. Google Scholar | Crossref | ISI |
| 34. | Wu, H, Zhang, JT. Nonparametric regression methods for longitudinal data analysis: mixed-effects modeling approaches. Hoboken, NJ: John Wiley & Sons, 2006. Google Scholar |
| 35. | Díaz-Uriarte, R, Alvarez De Andres, S. Gene selection and classification of microarray data using random forest. BMC Bioinf 2006; 7: 3. Google Scholar | Crossref | Medline | ISI |
| 36. | Hejblum, BP, Skinner, J, Thiébaut, R. Time-course gene set analysis for longitudinal gene expression data. PLoS Comput Biol 2015; 11: e1004310. Google Scholar | Crossref | Medline | ISI |
| 37. | Verikas, A, Gelzinis, A, Bacauskiene, M. Mining data with random forests: a survey and results of new tests. Pattern Recognit 2011; 44: 330–349. Google Scholar | Crossref | ISI |
| 38. | Lévy, Y, Thiébaut, R, Montes, M, et al. Dendritic cell-based therapeutic vaccine elicits polyfunctional HIV-specific T-cell immunity associated with control of viral load. Eur J Immunol 2014; 44: 2802–2810. Google Scholar | Crossref | Medline |
| 39. | Genuer, R, Poggi, JM, Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit Lett 2010; 31: 2225–2236. Google Scholar | Crossref | ISI |
| 40. | Genuer, R, Poggi, JM, Tuleau-Malot, C. VSURF: an R package for variable selection using random forests. R J 2015; 7: 19–33. Google Scholar | Crossref |
| 41. | Chaussabel, D, Baldwin, N. Democratizing systems immunology with modular transcriptional repertoire analyses. Nat Rev Immunol 2014; 14: 271. Google Scholar | Crossref | Medline |
| 42. | Bosinger, SE, Li, Q, Gordon, SN, et al. Global genomic analysis reveals rapid control of a robust innate response in SIV-infected sooty mangabeys. J Clin Invest 2009; 119: 3556–3572. Google Scholar | Medline |

