Abstract
This paper focuses on hypothesis testing in lasso regression, when one is interested in judging statistical significance for the regression coefficients in the regression equation involving a lot of covariates. To get reliable p-values, we propose a new lasso-type estimator relying on the idea of induced smoothing which allows to obtain appropriate covariance matrix and Wald statistic relatively easily. Some simulation experiments reveal that our approach exhibits good performance when contrasted with the recent inferential tools in the lasso framework. Two real data analyses are presented to illustrate the proposed framework in practice.
References
| 1. | Fahrmeir, L, Kneib, T, Lang, S, et al. Regression: models, methods and applications, Berlin: Springer, 2013. Google Scholar | Crossref |
| 2. | Tibshirani, R . Regression shrinkage and selection via the lasso. J R Stat Soc: Series B 1996; 58: 267–288. Google Scholar |
| 3. | Meinshausen, N, Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann Stat 2006; 34: 1436–1462. Google Scholar | Crossref | ISI |
| 4. | Zhang, C, Huang, J. The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann Stat 2008; 36: 1567–1594. Google Scholar | Crossref | ISI |
| 5. | Beck, A, Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2009; 2: 183–202. Google Scholar | Crossref | ISI |
| 6. | Candes, E, Plan, Y. Near ideal model selection by l1 minimization. Ann Stat 2009; 37: 2145–2177. Google Scholar | Crossref | ISI |
| 7. | Friedman, J, Hastie, T, Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1–22. Google Scholar | Crossref | Medline | ISI |
| 8. | Tibshirani, R . Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc: Ser B 2011; 73: 273–282. Google Scholar | Crossref |
| 9. | Augugliaro, L, Mineo, A, Wit, E. Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. J R Stat Soc: Ser B 2013; 75: 471–498. Google Scholar | Crossref |
| 10. | Tutz, G, Gertheiss, J. Regularized regression for categorical data (with discussion). Stat Model 2016; 16: 161–200. Google Scholar | SAGE Journals | ISI |
| 11. | Mi, K, Yongdai, K, Kuhwan, J, et al. Logistic lasso regression for the diagnosis of breast cancer using clinical demographic data and the bi-rads lexicon for ultrasonography. Ultrasonography 2018; 37: 36–42. Google Scholar | Crossref | Medline |
| 12. | Khanji, C, Lalonde, L, Bareil, C, et al. Lasso regression for the prediction of intermediate outcomes related to cardiovascular disease prevention using the transit quality indicators. Med Care 2018; 57: 62–72. Google Scholar |
| 13. | Frost, H, Amos, C. Gene set selection via lasso penalized regression (SLPR). Nucleic Acids Res 2017; 45: e114–e114. Google Scholar | Crossref | Medline |
| 14. | Lu, Y, Zhou, Y, Qu, W, et al. A lasso regression model for the construction of microrna-target regulatory networks. Bioinformatics 2011; 27: 2406–2413. Google Scholar | Crossref | Medline | ISI |
| 15. | Pripp, A, Stanii, M. Association between biomarkers and clinical characteristics in chronic subdural hematoma patients assessed with lasso regression. PLoS One 2017; 12: 1–15. Google Scholar | Crossref |
| 16. | Guo, P, Zeng, F, Hu, X, et al. Improved variable selection algorithm using a lasso-type penalty, with an application to assessing hepatitis b infection relevant factors in community residents. PLoS One 2015; 10: 1–23. Google Scholar | Crossref | ISI |
| 17. | Musoro, J, Zwinderman, A, Puhan, M, et al. Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Method 2014; 14: 116–116. Google Scholar | Crossref | Medline | ISI |
| 18. | Nesterov, Y . Smooth minimization of non-smooth functions. Math Program 2005; 103: 127–152. Google Scholar | Crossref |
| 19. | Beck, A, Teboulle, M. Smoothing and first order methods: a unified framework. SIAM J Optim 2012; 22: 557–580. Google Scholar | Crossref |
| 20. | Fan, J, Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001; 96: 1348–1360. Google Scholar | Crossref | ISI |
| 21. | Osborne, M, Presnell, B, Turlach, B. On the lasso and its dual. J Comput Graph Stat 2000; 9: 319–337. Google Scholar | ISI |
| 22. | Kyung, M, Gilly, J, Ghoshz, M, et al. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal 2010; 5: 1–44. Google Scholar | Crossref | Medline |
| 23. | Beran, R . Estimated sampling distributions: the bootstrap and competitors. Ann Stat 1982; 10: 212–225. Google Scholar | Crossref |
| 24. | Chatterjee, A, Lahiri, SN. Bootstrapping lasso estimators. J Am Stat Assoc 2011; 106: 608–625. Google Scholar | Crossref | ISI |
| 25. | Bühlmann P. Proposing the vote of thanks: regression shrinkage and selection via the lasso: a retrospective by Robert Tibshirani, ftp://ftp.stat.math.ethz.ch/pub/Manuscripts/buhlmann/discRSS2010.pdf (2010). Google Scholar |
| 26. | Zhang, X, Cheng, G. Simultaneous inference for high-dimensional linear models. J Am Stat Assoc 2017; 112: 757–768. Google Scholar | Crossref |
| 27. | Dezeure, R, Bühlmann, P, Zhang, C. High-dimensional simultaneous inference with the bootstrap. Test 2017; 26: 685–719. Google Scholar | Crossref |
| 28. | Lan, W, Zhong, P, Li, R, et al. Testing a single regression coefficient in high dimensional linear models. J Econom 2016; 195: 134–168. Google Scholar | Crossref |
| 29. | Meinshausen, N, Bühlmann, P. P-values for high-dimensional regression. J Am Stat Assoc 2009; 104: 1671–1681. Google Scholar | Crossref | ISI |
| 30. | Mandozzi, J, Bühlmann, P. Hierarchical testing in the high-dimensional setting with correlated variables. J Am Stat Assoc 2016; 111: 331–343. Google Scholar | Crossref |
| 31. | Minnier, J, Lu, T, Tianxi, C. A perturbation method for inference on regularized regression estimates. J Am Stat Assoc 2012; 106: 1371–1382. Google Scholar | Crossref |
| 32. | Van de Geer, S, Bühlmann, P, Ritov, Y, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 2014; 42: 1166–1202. Google Scholar | Crossref |
| 33. | Zhang, C, Zhang, S. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc: Series B 2014; 76: 217–242. Google Scholar | Crossref |
| 34. | Javanmard, A, Montanari, A. Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 2014; 15: 2869–2909. Google Scholar | ISI |
| 35. | Javanmard A and Lee J. A flexible framework for hypothesis testing in high-dimensions. arXiv:1704.07971[math.ST], Apr. 2017. Google Scholar |
| 36. | Belloni, A, Chernozhukov, V, Hansen, C. Inference on treatment effects after selection among high-dimensional controls. Rev Econ Stud 2014; 81: 608–650. Google Scholar | Crossref |
| 37. | Yang Y. Statistical inference for high dimensional regression via constrained lasso. arXiv:1704.05098 [math.ST], Apr. 2017. Google Scholar |
| 38. | Lockhart, R, Taylor, J, Tibshirani, R, et al. A significance test for the lasso. Ann Stat 2014; 42: 413–468. Google Scholar | Crossref | Medline | ISI |
| 39. | Lee, J, Taylor, J. Exact post model selection inference for marginal screening. Adv Neural Inf Process Syst 2014, pp. 136–144. Google Scholar |
| 40. | Lee, J, Sun, D, Sun, Y, et al. Exact post-selection inference, with application to the lasso. Ann Stat 2016; 44: 907–927. Google Scholar | Crossref |
| 41. | Tibshirani, R, Taylor, J, Lockhart, R, et al. Exact post-selection inference for sequential regression procedures. J Am Stat Assoc 2016; 111: 600–620. Google Scholar | Crossref |
| 42. | Brown, B, Wang, Y. Standard errors and covariance matrices for smoothed rank estimators. Biometrika 2005; 92: 149–158. Google Scholar | Crossref |
| 43. | Royall, RM . Model robust confidence intervals using maximum likelihood estimators. Int Stat Rev 1986; 54: 221–226. Google Scholar | Crossref | ISI |
| 44. | Knight, K, Fu, W. Asymptotics for lasso-type estimators. Ann Stat 2000; 28: 1356–1378. Google Scholar | Crossref | ISI |
| 45. | Pötscher, B, Leeb, H. On the distribution of penalized maximum likelihood estimators: The lasso, scad, and thresholding. J Multivar Anal 2009; 10: 2065–2082. Google Scholar | Crossref |
| 46. | Zhou, Q . Monte carlo simulation for lasso-type problems by estimator augmentation. J Am Stat Assoc 2014; 109: 1495–1516. Google Scholar | Crossref |
| 47. | Owen, AB . A robust hybrid of lasso and ridge regression. Contemporary Math 2007; 443: 59–72. Google Scholar | Crossref |
| 48. | Lockhart R, Taylor J and Tibshirani Rea. covTest: Computes covariance test for adaptive linear modelling, https://CRAN.R-project.org/package=covTest. R package version 1.02 (2013). Google Scholar |
| 49. | Tibshirani R, Tibshirani R, Taylor J, et al. SelectiveInference: Tools for Post-Selection Inference, https://CRAN.R-project.org/package=selectiveInference. R package version 1.2.4 (2017). Google Scholar |
| 50. | Stamey, T, Kabalin, J, McNeal, J, et al. Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J Urol 1989; 141: 1076–1083. Google Scholar | Crossref | Medline | ISI |
| 51. | Simoni, M, Lombardi, E, Berti, G, et al. Mould/dampness exposure at home is associated with respiratory disorders in italian children and adolescents: the sidria-2 study. Occup Environ Med 2005; 62: 616–622. Google Scholar | Crossref | Medline |
| 52. | Migliore, E, Berti, G, Galassi, C, et al. Respiratory symptoms in children living near busy roads and their relationship to vehicular traffic: results of an Italian multicenter study (SIDRIA 2). Environ Health 2009; 8: 27–27. Google Scholar | Crossref | Medline | ISI |
| 53. | Zou, H, Hastie, T, Tibshirani, R. On the ‘degrees of freedom’ of the lasso. Ann Stat 2007; 35: 2173–2192. Google Scholar | Crossref | ISI |
| 54. | Zou, H . The adaptive lasso and its oracle properties. J Am Stat Assoc 2006; 101: 1418–1429. Google Scholar | Crossref | ISI |
| 55. | Fu, L, Wang, Y, Bai, Z. Rank regression for analysis of clustered data: a natural induced smoothing approach. Comput Stat Data Anal 2010; 54: 1036–1050. Google Scholar | Crossref |
| 56. | Muggeo VMR. Interval estimation for the breakpoint in segmented regression: a smoothed score-based approach. Aust N Z J Stat 2017; 59: 311–322. Google Scholar |

