Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the “optimal” parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.

1. Rodriguez, A, Laio, A. Clustering by fast search and find of density peaks. Science 2014; 344: 14921496.
Google Scholar | Crossref | Medline | ISI
2. Jiang, DX, Tang, C, Zhang, AD. Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 2004; 16: 13701386.
Google Scholar | Crossref | ISI
3. Xu, R, Wunsch, D. Survey of clustering algorithms. IEEE Trans Neural Networ 2005; 16: 645678.
Google Scholar | Crossref | Medline
4. Datta, S, Datta, S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 2003; 19: 459466.
Google Scholar | Crossref | Medline | ISI
5. Si, Y, Liu, P, Li, P Model-based clustering for RNA-seq data. Bioinformatics 2014; 30: 197205.
Google Scholar | Crossref | Medline | ISI
6. MacQueen, J . Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 1967; 1: 281297.
Google Scholar
7. McLachlan, GJ, Basford, KE. Mixture models: inference and applications to clustering, New York, NY: Marcel Dekker, 1988.
Google Scholar
8. McLachlan, GJ, Peel, D. Finite mixture models, New York, NY: John Wiley & Sons, 2004.
Google Scholar
9. McLachlan, GJ, Do, KA, Ambroise, C. Analyzing microarray gene expression data, Hoboken, NJ: John Wiley & Sons, 2005.
Google Scholar
10. Fraley, C, Raftery, AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002; 97: 611631.
Google Scholar | Crossref | ISI
11. Rousseeuw, PJ . Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987; 20: 5365.
Google Scholar | Crossref | ISI
12. Scott, DW . Multivariate density estimation: theory, practice, and visualization, New York: John Wiley and Sons, 1992.
Google Scholar | Crossref
13. Wand, M, Jones, M. Kernel smoothing, New York: CRC Press, 1994.
Google Scholar
14. Sain, SR, Baggerly, KA, Scott, DW. Cross-validation of multivariate densities. J Am Stat Assoc 1994; 89: 807817.
Google Scholar | Crossref | ISI
15. Qiu, W, Joe, H. Generation of random clusters with specified degree of separation. J Classif 2006; 23: 315334.
Google Scholar | Crossref | ISI
16. Qiu W and Joe H. clusterGeneration: random cluster generation (with specified degree of separation), 2013,[R package version 1.3.1.] http://CRAN.R-project.org/package=clusterGeneration (accessed 21 September 2015).
Google Scholar
17. Rand, WM . Objective criteria for the evaluation of clustering methods. J AmStat Assoc 1971; 66: 846850.
Google Scholar | Crossref | ISI
18. Hubert, L, Arabie, P. Comparing partitions. J Classif 1985; 2(1): 193218.
Google Scholar | Crossref | ISI
19. Armstrong, SA, Staunton, JE, Silverman, LB Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 2002; 30: 4147.
Google Scholar | Crossref | Medline | ISI
20. Bhattacharjee, A, Richards, WG, Staunton, J Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci 2001; 98: 1379013795.
Google Scholar | Crossref | Medline | ISI
21. Chowdary, D, Lathrop, J, Skelton, J Prognostic gene expression signatures can be measured in tissues collected in rnalater preservative. J Mol Diagnos 2006; 8: 3139.
Google Scholar | Crossref | Medline | ISI
22. Dyrskjot, L, Thykjaer, T, Kruhoffer, M Identifying distinct classes of bladder carcinoma using microarrays. Nat Genet 2003; 33: 9096.
Google Scholar | Crossref | Medline | ISI
23. Golub, TR, Slonim, DK, Tamayo, P Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286: 531537.
Google Scholar | Crossref | Medline | ISI
24. Gordon, GJ, Jensen, RV, Hsiao, LL Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 2002; 62: 49634967.
Google Scholar | Medline | ISI
25. Laiho, P, Kokko, A, Vanharanta, S Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene 2007; 26: 312320.
Google Scholar | Crossref | Medline | ISI
26. Nutt, CL, Mani, DR, Betensky, RA Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res 2003; 63: 16021607.
Google Scholar | Medline | ISI
27. de Souto, MCP, Costa, IG, de Araujo, DSA Clustering cancer gene expression data: a comparative study. BMC Bioinform 2008; 9: 497497.
Google Scholar | Crossref | Medline | ISI
28. Zimek, A, Schubert, E, Kriegel, HP. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 2012; 5: 363387.
Google Scholar | Crossref
29. Ray, S, Lindsay, BG. The topography of multivariate normal mixtures. Ann Stat 2005; 33: 20422065.
Google Scholar | Crossref | ISI
30. The Cancer Genome Atlas Research Network . Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. New Engl J Med 2015; 372: 24812498.
Google Scholar | Crossref | Medline | ISI
31. Strehl, A, Ghosh, J. Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 2003; 3: 583617.
Google Scholar
32. Monti, S, Tamayo, P, Mesirov, J Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 2003; 52: 91118.
Google Scholar | Crossref | ISI
Access Options

My Account

Welcome
You do not have access to this content.



Chinese Institutions / 中国用户

Click the button below for the full-text content

请点击以下获取该全文

Institutional Access

does not have access to this content.

Purchase Content

24 hours online access to download content

Research off-campus without worrying about access issues. Find out about Lean Library here

Your Access Options


Purchase

SMM-article-ppv for $41.50
Single Issue 24 hour E-access for $543.66

Cookies Notification

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more.
Top