Abstract
The classical and most commonly used approach to building prediction intervals is the parametric approach. However, its main drawback is that its validity and performance highly depend on the assumed functional link between the covariates and the response. This research investigates new methods that improve the performance of prediction intervals with random forests. Two aspects are explored: The method used to build the forest and the method used to build the prediction interval. Four methods to build the forest are investigated, three from the classification and regression tree (CART) paradigm and the transformation forest method. For CART forests, in addition to the default least-squares splitting rule, two alternative splitting criteria are investigated. We also present and evaluate the performance of five flexible methods for constructing prediction intervals. This yields 20 distinct method variations. To reliably attain the desired confidence level, we include a calibration procedure performed on the out-of-bag information provided by the forest. The 20 method variations are thoroughly investigated, and compared to five alternative methods through simulation studies and in real data settings. The results show that the proposed methods are very competitive. They outperform commonly used methods in both in simulation settings and with real data.
References
| 1. | Breiman, L . Random forests. Machine Learn 2001; 45: 5–32. Google Scholar | Crossref | ISI |
| 2. | Meinshausen, N . Quantile regression forests. J Machine Learn Res 2006; 7: 983–999. Google Scholar | ISI |
| 3. | Lin, Y, Jeon, Y. Random forests and adaptive nearest neighbors. J Am Stat Assoc 2006; 101: 578–590. Google Scholar | Crossref |
| 4. | Hothorn, T, Lausen, B, Benner, A, et al. Bagging survival trees. Stat Med 2004; 23: 77–91. Google Scholar | Crossref | Medline | ISI |
| 5. | Athey S, Tibshirani J and Wager S. Generalized random forests. arXiv preprint 2017; arXiv:1610.01271. Google Scholar |
| 6. | Hothorn T and Zeileis A. Transformation forests. arXiv preprint 2017; arXiv:1701.02110. Google Scholar |
| 7. | Lei, J, G'Sell, M, Rinaldo, A, et al. Distribution-free predictive inference for regression. J Am Stat Assoc 2018; 113: 1094–1111. . Google Scholar | Crossref |
| 8. | Vovk V, Gammerman A and Shafer G. Algorithmic learning in a random world. Berlin: Springer Science & Business Media, 2005. Google Scholar |
| 9. | Sexton, J, Laake, P. Standard errors for bagged and random forest estimators. Comput Stat Data Anal 2009; 53: 801–811. Google Scholar | Crossref |
| 10. | Wager, S, Hastie, T, Efron, B. Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Machine Learn Res 2014; 15: 1625–1651. Google Scholar | Medline |
| 11. | Mentch, L, Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Machine Learn Res 2016; 17: 841–881. Google Scholar |
| 12. | Moradian, H, Larocque, D, Bellavance, F. L1 splitting rules in survival forests. Lifetime Data Anal 2017; 23: 671–691. Google Scholar | Crossref | Medline |
| 13. | Moradian, H, Larocque, D, Bellavance, F. Survival forests for data with dependent censoring. Stat Meth Med Res 2017; 28: 445–461. Google Scholar | SAGE Journals |
| 14. | Hothorn, T, Hornik, K, Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. J Comput Graph Stat 2006; 15: 651–674. Google Scholar | Crossref | ISI |
| 15. | Zeileis, A, Hothorn, T, Hornik, K. Model-based recursive partitioning. J Comput Graph Stat 2008; 17: 492–514. Google Scholar | Crossref | ISI |
| 16. | Hyndman, RJ . Computing and graphing highest density regions. Am Stat 1996; 50: 120–126. Google Scholar | ISI |
| 17. | Samworth, RJ, Wand, MP. Asymptotics and optimal bandwidth selection for highest density region estimation. Ann Stat 2010; 38: 1767–1792. Google Scholar | Crossref |
| 18. | R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, https://www.R-project.org/ (2017, accessed 1 February 2019). Google Scholar |
| 19. | Ishwaran H and Kogalur UB. Random forests for survival, regression, and classification (RF-SRC), R package version 2.5.1, https://cran.r-project.org/package=randomForestSRC (2017, accessed 1 February 2019). Google Scholar |
| 20. | Hothorn T. trtf: transformation trees and forests, R package version 0.3-2, https://CRAN.R-project.org/package=trtf (2018, accessed 1 February 2019). Google Scholar |
| 21. | Hothorn, T, Zeileis, A. partykit: A modular toolkit for recursive partytioning in R. J Mach Learn Res 2015; 16: 3905–3909. Google Scholar |
| 22. | Hyndman RJ. Highest density regions and conditional density estimation, R package version 3.1, https://github.com/robjhyndman/hdrcde (2015, accessed 1 February 2019). Google Scholar |
| 23. | Leisch F and Dimitriadou E. mlbench: machine learning benchmark problems, R package version 2.1-1. Vienna, Austria: R Foundation for Statistical Computing, 2010. Google Scholar |
| 24. | Friedman, JH . Multivariate adaptive regression splines. Ann Stat 1991; 19: 1–67. Google Scholar | Crossref | ISI |
| 25. | Breiman, L . Bagging predictors. Machine Learn 1996; 24: 123–140. Google Scholar | Crossref | ISI |
| 26. | Meinshausen N. Quantregforest: quantile regression forests, R package version 13-5. Vienna, Austria: R Foundation for Statistical Computing, 2016. Google Scholar |
| 27. | Tibshirani J, Athey S, Wager S, et al. grf: generalized random forests (beta), R package version 0.10.0, https://CRAN.R-project.org/package=grf (2018, accessed 1 February 2019). Google Scholar |
| 28. | G’Sell M, Lei J, Rinaldo A, et al. Tools for conformal inference in regression, R package conformalInference version 1.0.0, https://github.com/ryantibs/conformal (2017, accessed 1 February 2019). Google Scholar |
| 29. | Bache K and Lichman M. UCI machine learning repository, 2013, https://archive.ics.uci.edu/ml/index.php. Google Scholar |
| 30. | Mayr, A, Hothorn, T, Fenske, N. Prediction intervals for future BMI values of individual children – a non-parametric approach by quantile boosting. BMC Med Res Methodol 2012; 12: 6. Google Scholar | Crossref | Medline |

