Abstract
Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the “optimal” parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.
References
| 1. | Rodriguez, A, Laio, A. Clustering by fast search and find of density peaks. Science 2014; 344: 1492–1496. Google Scholar | Crossref | Medline | ISI |
| 2. | Jiang, DX, Tang, C, Zhang, AD. Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 2004; 16: 1370–1386. Google Scholar | Crossref | ISI |
| 3. | Xu, R, Wunsch, D. Survey of clustering algorithms. IEEE Trans Neural Networ 2005; 16: 645–678. Google Scholar | Crossref | Medline |
| 4. | Datta, S, Datta, S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 2003; 19: 459–466. Google Scholar | Crossref | Medline | ISI |
| 5. | Si, Y, Liu, P, Li, P Model-based clustering for RNA-seq data. Bioinformatics 2014; 30: 197–205. Google Scholar | Crossref | Medline | ISI |
| 6. | MacQueen, J . Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 1967; 1: 281–297. Google Scholar |
| 7. | McLachlan, GJ, Basford, KE. Mixture models: inference and applications to clustering, New York, NY: Marcel Dekker, 1988. Google Scholar |
| 8. | McLachlan, GJ, Peel, D. Finite mixture models, New York, NY: John Wiley & Sons, 2004. Google Scholar |
| 9. | McLachlan, GJ, Do, KA, Ambroise, C. Analyzing microarray gene expression data, Hoboken, NJ: John Wiley & Sons, 2005. Google Scholar |
| 10. | Fraley, C, Raftery, AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002; 97: 611–631. Google Scholar | Crossref | ISI |
| 11. | Rousseeuw, PJ . Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987; 20: 53–65. Google Scholar | Crossref | ISI |
| 12. | Scott, DW . Multivariate density estimation: theory, practice, and visualization, New York: John Wiley and Sons, 1992. Google Scholar | Crossref |
| 13. | Wand, M, Jones, M. Kernel smoothing, New York: CRC Press, 1994. Google Scholar |
| 14. | Sain, SR, Baggerly, KA, Scott, DW. Cross-validation of multivariate densities. J Am Stat Assoc 1994; 89: 807–817. Google Scholar | Crossref | ISI |
| 15. | Qiu, W, Joe, H. Generation of random clusters with specified degree of separation. J Classif 2006; 23: 315–334. Google Scholar | Crossref | ISI |
| 16. | Qiu W and Joe H. clusterGeneration: random cluster generation (with specified degree of separation), 2013,[R package version 1.3.1.] http://CRAN.R-project.org/package=clusterGeneration (accessed 21 September 2015). Google Scholar |
| 17. | Rand, WM . Objective criteria for the evaluation of clustering methods. J AmStat Assoc 1971; 66: 846–850. Google Scholar | Crossref | ISI |
| 18. | Hubert, L, Arabie, P. Comparing partitions. J Classif 1985; 2(1): 193–218. Google Scholar | Crossref | ISI |
| 19. | Armstrong, SA, Staunton, JE, Silverman, LB Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 2002; 30: 41–47. Google Scholar | Crossref | Medline | ISI |
| 20. | Bhattacharjee, A, Richards, WG, Staunton, J Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci 2001; 98: 13790–13795. Google Scholar | Crossref | Medline | ISI |
| 21. | Chowdary, D, Lathrop, J, Skelton, J Prognostic gene expression signatures can be measured in tissues collected in rnalater preservative. J Mol Diagnos 2006; 8: 31–39. Google Scholar | Crossref | Medline | ISI |
| 22. | Dyrskjot, L, Thykjaer, T, Kruhoffer, M Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet 2003; 33: 90–96. Google Scholar | Crossref | Medline | ISI |
| 23. | Golub, TR, Slonim, DK, Tamayo, P Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286: 531–537. Google Scholar | Crossref | Medline | ISI |
| 24. | Gordon, GJ, Jensen, RV, Hsiao, LL Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 2002; 62: 4963–4967. Google Scholar | Medline | ISI |
| 25. | Laiho, P, Kokko, A, Vanharanta, S Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene 2007; 26: 312–320. Google Scholar | Crossref | Medline | ISI |
| 26. | Nutt, CL, Mani, DR, Betensky, RA Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 2003; 63: 1602–1607. Google Scholar | Medline | ISI |
| 27. | de Souto, MCP, Costa, IG, de Araujo, DSA Clustering cancer gene expression data: a comparative study. BMC Bioinform 2008; 9: 497–497. Google Scholar | Crossref | Medline | ISI |
| 28. | Zimek, A, Schubert, E, Kriegel, HP. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 2012; 5: 363–387. Google Scholar | Crossref |
| 29. | Ray, S, Lindsay, BG. The topography of multivariate normal mixtures. Ann Stat 2005; 33: 2042–2065. Google Scholar | Crossref | ISI |
| 30. | The Cancer Genome Atlas Research Network . Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. New Engl J Med 2015; 372: 2481–2498. Google Scholar | Crossref | Medline | ISI |
| 31. | Strehl, A, Ghosh, J. Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 2003; 3: 583–617. Google Scholar |
| 32. | Monti, S, Tamayo, P, Mesirov, J Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 2003; 52: 91–118. Google Scholar | Crossref | ISI |
