Abstract
The genomics era has led to an increase in the dimensionality of data collected in the investigation of biological questions. In this context, dimension-reduction techniques can be used to summarise high-dimensional signals into low-dimensional ones, to further test for association with one or more covariates of interest. This paper revisits one such approach, previously known as principal component of heritability and renamed here as principal component of explained variance (PCEV). As its name suggests, the PCEV seeks a linear combination of outcomes in an optimal manner, by maximising the proportion of variance explained by one or several covariates of interest. By construction, this method optimises power; however, due to its computational complexity, it has unfortunately received little attention in the past. Here, we propose a general analytical PCEV framework that builds on the assets of the original method, i.e. conceptually simple and free of tuning parameters. Moreover, our framework extends the range of applications of the original procedure by providing a computationally simple strategy for high-dimensional outcomes, along with exact and asymptotic testing procedures that drastically reduce its computational cost. We investigate the merits of the PCEV using an extensive set of simulations. Furthermore, the use of the PCEV approach is illustrated using three examples taken from the fields of epigenetics and brain imaging.
References
| 1. | Abdi, H . Partial least square regression, projection on latent structure regression, PLS-regression. Wiley Interdiscip Rev 2010; 2: 97–106. Google Scholar | Crossref |
| 2. | Härdle, W, Simar, L. Canonical correlation analysis. Appl Multivariate Stat Anal 2007; 2: 321–330. Google Scholar |
| 3. | Friedman, J . Regularized discriminant analysis. J Am Stat Assoc 1989; 84: 165–175. Google Scholar | Crossref | ISI |
| 4. | Ott, J, Rabinowitz, D. A principal-components approach based on heritability for combining phenotype information. Hum Hered 1999; 49: 106–111. Google Scholar | Crossref | Medline |
| 5. | Wang, Y, Fang, Y, Man, J. A ridge penalized principal-components approach based on heritability for high-dimensional data. Hum Hered 2007; 64: 182–191. Google Scholar | Crossref | Medline |
| 6. | Klei, L, Luca, D, Devlin, B, et al. Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol 2008; 32: 9–19. Google Scholar | Crossref | Medline |
| 7. | Fang, Y, Feng, Y, Yuan, M. Regularized principal components of heritability. Comput Stat 2014; 29: 455–465. Google Scholar | Crossref |
| 8. | Nishisato, S . Optimization and data structure: seven faces of dual scaling. Ann Oper Res 1995; 55: 345–359. Google Scholar | Crossref |
| 9. | Chun, H, Keleş, S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc B 2010; 72: 3–25. Google Scholar | Crossref |
| 10. | Vinod, HD . Canonical ridge and econometrics of joint production. J Econometr 1976; 4: 147–166. Google Scholar | Crossref |
| 11. | Leurgans, SE, Moyeed, RA, Silverman, BW. Canonical correlation analysis when the data are curves. J R Stat Soc B 1993; ((55): 725–740. Google Scholar |
| 12. | Lin, J, Zhu, H, Knickmeyer, R, et al. Projection regression models for multivariate imaging phenotype. Genet Epidemiol 2012; 36: 631–641. Google Scholar | Crossref | Medline |
| 13. | Tibshirani, R . Regression shrinkage and selection via the lasso. J R Stat Soc B 1996; (58): 267–288. Google Scholar |
| 14. | Hoerl, AE, Kennard, RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970; 12: 55–67. Google Scholar | Crossref | ISI |
| 15. | Everitt, B, Dunn, G. Applied multivariate data analysis, London: Edward Arnold, 1991. Google Scholar |
| 16. | Rencher, A, Christensen, W. Methods of multivariate analysis, Toronto: John Wiley and Sons, 2012. Google Scholar | Crossref |
| 17. | Johnstone, IM . Multivariate analysis and jacobi ensembles: largest eigenvalue, tracy–widom limits and rates of convergence. Ann Stat 2008; 36: 2638–2638. Google Scholar | Crossref | Medline |
| 18. | Eckhardt, F, Lewin, J, Cortese, R, et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet 2006; 38: 1378–1385. Google Scholar | Crossref | Medline | ISI |
| 19. | Friedman, J, Hastie, T, Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1–22. Google Scholar | Crossref | Medline | ISI |
| 20. | Cotterchio, M, McKeown-Eyssen, G, Sutherland, H, et al. Ontario familial colon cancer registry: methods and first-year response rates. Chronic Dis Can 2000; 21: 81–86. Google Scholar | Medline |
| 21. | Zanke, BW, Greenwood, CM, Rangrej, J, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet 2007; 39: 989–994. Google Scholar | Crossref | Medline | ISI |
| 22. | Fortin, JP, Labbe, A, Lemire, M, et al. Functional normalization of 450 k methylation array data improves replication in large cancer studies. Genome Biol 2014; 15: 503–503. Google Scholar | Crossref | Medline |
| 23. | Houseman, EA, Accomando, WP, Koestler, DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 2012; 13: 86–86. Google Scholar | Crossref | Medline | ISI |
| 24. | Potkin, SG, Guffanti, G, Lakatos, A, et al. Hippocampal atrophy as a quantitative trait in a genome-wide association study identifying novel susceptibility genes for Alzheimer’s disease. PloS One 2009; 4: e6501–e6501. Google Scholar | Crossref | Medline | ISI |
| 25. | Miceli-Richard, C, Wang-Renault, SF, Boudaoud, S, et al. Overlap between differentially methylated DNA regions in blood B lymphocytes and genetic at-risk loci in primary Sjögren’s syndrome. Ann Rheum Dis 2015; (75): 1–8. Google Scholar |
| 26. | Gao, X, Starmer, J, Martin, ER, et al. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet Epidemiol 2008; 32: 361–361. Google Scholar | Crossref | Medline | ISI |
| 27. | Breitling, LP, Yang, R, Korn, B, et al. Tobacco-smoking-related differential DNA methylation: 27 K discovery and replication. Am J Hum Genet 2011; 88: 450–457. Google Scholar | Crossref | Medline | ISI |
| 28. | Lee, KW, Pausova, Z. Cigarette smoking and DNA methylation. Front Genet 2013; 4: 1–11. Google Scholar | Crossref |
| 29. | Lindenmayer, JP, Bernstein-Hyman, R, Grochowski, S, et al. Psychopathology of schizophrenia: initial validation of a 5-factor model. Psychopathology 1995; 28: 22–31. Google Scholar | Crossref | Medline | ISI |
| 30. | Kherif, F, Poline, JB, Flandin, G, et al. Multivariate model specification for fMRI data. Neuroimage 2002; 16: 1068–1083. Google Scholar | Crossref | Medline | ISI |
| 31. | Livshits, G, Roset, A, Yakovenko, K, et al. Genetics of human body size and shape: body proportions and indices. Ann Hum Biol 2002; 29: 271–289. Google Scholar | Crossref | Medline |
| 32. | Arya, R, Blangero, J, Williams, K, et al. Factors of insulin resistance syndrome–related phenotypes are linked to genetic locations on chromosomes 6 and 7 in nondiabetic Mexican-Americans. Diabetes 2002; 51: 841–847. Google Scholar | Crossref | Medline | ISI |
| 33. | Rowe, DB, Hoffmann, RG. Multivariate statistical analysis in fMRI. Eng Med Biol Mag IEEE 2006; 25: 60–64. Google Scholar | Crossref | Medline |
| 34. | Teipel, SJ, Born, C, Ewers, M, et al. Multivariate deformation-based analysis of brain atrophy to predict Alzheimer’s disease in mild cognitive impairment. Neuroimage 2007; 38: 13–24. Google Scholar | Crossref | Medline | ISI |
| 35. | Formisano, E, De Martino, F, Valente, G. Multivariate analysis of fMRI time series: classification and regression of brain responses using machine learning. Magn Reson Imag 2008; 26: 921–934. Google Scholar | Crossref | Medline |
| 36. | Efron, B, Hastie, T, Johnstone, I, et al. Least angle regression. Ann Stat 2004; 32: 407–499. Google Scholar | Crossref | ISI |
| 37. | Simon, N, Friedman, J, Hastie, T, et al. A sparse-group lasso. J Comput Graph Stat 2013; 22: 231–245. Google Scholar | Crossref | ISI |
| 38. | Liquet, B, de Micheaux, PL, Hejblum, BP, et al. Group and sparse group partial least square approaches applied in genomics context. Bioinformatics 2016; 32: 35–42. Google Scholar | Medline |
| 39. | Pearl, J . Causality: models, reasoning, and inference, New York: Cambridge University Press, 2009. Google Scholar | Crossref |
| 40. | Cruchaga, C, Kauwe, JS, Harari, O, et al. GWAS of cerebrospinal fluid tau levels identifies risk variants for Alzheimers disease. Neuron 2013; 78: 256–268. Google Scholar | Crossref | Medline |
| 41. | Yu, CE, Seltman, H, Peskind, ER, et al. Comprehensive analysis of APOE and selected proximate markers for late-onset Alzheimer’s disease: patterns of linkage disequilibrium and disease/marker association. Genomics 2007; 89: 655–665. Google Scholar | Crossref | Medline | ISI |
| 42. | Bu, G . Apolipoprotein E and its receptors in Alzheimer’s disease: pathways, pathogenesis and therapy. Nat Rev Neurosci 2009; 10: 333–344. Google Scholar | Crossref | Medline | ISI |
