An integrative analysis of microRNA and mRNA expression--a case study.

Background MicroRNAs are believed to play an important role in gene expression regulation. They have been shown to be involved in cell cycle regulation and cancer. MicroRNA expression profiling became available owing to recent technology advancement. In some studies, both microRNA expression and mRNA expression are measured, which allows an integrated analysis of microRNA and mRNA expression. Results We demonstrated three aspects of an integrated analysis of microRNA and mRNA expression, through a case study of human cancer data. We showed that (1) microRNA expression efficiently sorts tumors from normal tissues regardless of tumor type, while gene expression does not; (2) many microRNAs are down-regulated in tumors and these microRNAs can be clustered in two ways: microRNAs similarly affected by cancer and microRNAs similarly interacting with genes; (3) taking let-7f as an example, targets genes can be identified and they can be clustered based on their relationship with let-7f expression. Discussion Our findings in this paper were made using novel applications of existing statistical methods: hierarchical clustering was applied with a new distance measure—the co-clustering frequency—to identify sample clusters that are stable; microRNA-gene correlation profiles were subject to hierarchical clustering to identify microRNAs that similarly interact with genes and hence are likely functionally related; the clustering of regression models method was applied to identify microRNAs similarly related to cancer while adjusting for tissue type and genes similarly related to microRNA while adjusting for disease status. These analytic methods are applicable to interrogate multiple types of -omics data in general.


Background
MicroRNAs (miRNAs) are a class of small non-coding RNAs that are believed to regulate gene expression [1,2]. The fi rst two miRNAs, lin-4 and let-7, were experimentally discovered in 1993 and 2000 [3,4]. Since then more than 4300 miRNAs have been identifi ed in plants, animals, and viruses using cDNA sequencing and computational predictions [5][6][7][8]. MiRNAs regulate their target genes, through basepairing, by inducing mRNA degradation and translational repression [1,2]. In humans, miRNAs might regulate as many as a third of the protein coding genes [9]. MiRNAs are likely to have an important impact on development in various cellular processes, such as cancer. MiRNAs have been shown to be linked to a number of cancer types in several studies on individual miRNAs [10,11].
Cancer is a complex and heterogeneous disease whose initiation and progression are infl uenced by a variety of molecular changes [12]. A complete characterization of the genomic changes may help predict the pathologic behavior of cancer. Genome-wide profi ling of gene expression has been increasingly applied in clinical settings to understand genomic changes at the mRNA level. Examples can be found in breast, colon, ovarian, and prostate cancer [13][14][15][16]. Such knowledge has improved our understanding of cancer biology and facilitated the discovery of new cancer subtypes and new biomarkers for cancer diagnosis, prognosis, and treatment [17,18].
Large-scale expression profi ling has recently become available for miRNAs as well [19]. Profi ling methods for miRNA expression are mostly based on glass-slide microarrays [20][21][22] and the latest development is the bead-based fl ow cytometry technique [23]. Genome-wide miRNA studies allow the investigation of genomic changes in cancer at the miRNA level and are likely to provide additional clues to the mechanisms of tumorigenesis [23][24][25]. In particular, when miRNA and mRNA expression are both measured on the same samples, an integrative analysis can be performed to compare miRNAs and mRNAs profi les and to study their interaction patterns. The goal of this paper is to demonstrate three aspects of such an integrative analysis through novel applications of existing statistical methods.
In this paper, we will focus on the following three aspects of an integrative analysis of mRNA and miRNA expression data, while taking into account of sample phenotype data (for example, tumor versus normal samples).
(1) Stable sample clustering based on miRNA expression in comparison with that based on gene expression. (2) Identifi cation of cancer-related miRNAs and clustering of these miRNAs into groups that similarly interact with genes and into groups that are similarly affected by cancer. (3) Identifi cation of candidate target genes for a given miRNA and clustering of these genes based on their relationship with miRNA expression and disease status.
We will demonstrate these three aspects of an integrative analysis using a published study of miRNA and mRNA expression in various types of tumor samples [23]. A set of 46 samples, whose miRNA expression and gene expression were both measured, was used in our analysis (Supplementary  Table 1). These 46 samples consist of 28 tumor samples belonging to fi ve tissue types and their 18 normal counterparts (Ͼ1 normal per tissue type). MiRNAs and genes with truncated values in Ͼ10% samples are excluded, which results in 128 miR-NAs and 7149 genes in our analysis.

Clustering samples
Pioneered by Eisen et al. [26], hierarchical clustering is the most commonly used method for sample clustering using expression profi les. With hierarchical clustering, a distance measure is calculated between the expression profi les of each gene (or gene cluster) pair, and a recursive bottom-up or top-down algorithm is then employed to merge or split genes based on their distance. Examples of distance measures include the Euclidean distance and one minus the Pearson correlation coefficient. Hierarchical clustering does not require the number of clusters to be pre-specified and has nice visualization properties (dendrogram and heatmap). Similar to many other clustering algorithms, a wellrecognized drawback of hierarchical clustering, however, is that it always generates a clustering even when there is no real underlying clustering in the data. It is not apparent whether the clustering structure refl ects a 'true' pattern in the data or is just an artefact of the clustering algorithm. Methods based on resampling have been proposed to evaluate the significance of a clustering [27][28][29]. These methods simulate perturbations of the original data and assess the stability of the clustering results.
Also based on resampling, Monti et al. proposed a method, called 'consensus clustering', that makes use of the resampling results to guide clustering [30]. Briefl y, consensus clustering quantifi es the agreement among clustering runs over the perturbed data sets, measured by a consensus matrix whose elements are the frequency that two samples are clustered together, and then performs hierarchical clustering using the consensus matrix as similarity matrix.
In the consensus clustering, the co-clustering frequency measure counts co-clustering frequency of two samples among perturbed data sets that include both samples. Instead, we apply the clustering of each perturbed data set to classify samples in the original data set using the nearest-centroid method and then count the frequency of two samples being classifi ed together among all perturbations. We will call this method as 'stable hierarchical clustering'. We used a partitional clustering method, PAM (partitioning around medoids) [31], to cluster each perturbed data set in this paper. Details of the stable hierarchical clustering method are provided in Method section.
We fi rst applied stable hierarchical clustering to identify stable sample clusters based on miRNA expression (Fig. 1a). Interestingly, except for three colon tumors, tumor samples were well separated from normal samples, regardless of tissue type. A potential explanation of the mis-clustering of the three colon tumors is normal tissue contamination, which colorectal cancer is prone to. The three colon tumors were excluded from our subsequent analysis. Nonetheless, this clustering result suggests that miRNA expression has the potential of distinguishing tumors from normal samples for clinical diagnosis.
We also applied stable hierarchical clustering to cluster samples based on gene expression. It did not separate tumors from normal samples as efficiently and it tended to recognize tissue types rather than disease status (Fig. 1b). A possible explanation is the following: (i) only a small number of genes have signals differentiating cancer from normal and a large number of genes are only adding noise, which might obscure or distort the signals; and (ii) miRNAs are upstream in the regulatory network and thus might contain more accurate information about the state of the sample.
Hierarchical clustering has been applied to cluster samples in the original publication by Lu et al. [23]; however, they clustered tumor samples only and discovered clusters refl ecting various tumor characteristics. We showed, through subjecting both tumors and normals to clustering and adopting a new distance measure for clustering, that miRNA expression can clearly distinguish tumors from normals, regardless of tissue type. Lu et al. showed the inferiority of mRNA expression in distinguishing GI versus Non-GI tumors, while we showed its inferiority in distinguishing tumors versus normals.

Identifying and clustering cancer-related miRNAs
Filtering has been commonly adopted as a useful pre-processing step, to remove uninformative genes and to reduce computational burden, for gene clustering. We applied similar fi ltering step for miRNA clustering. Specifi cally, we selected a subset of miRNAs that are related to cancer, by modelling miRNA expression using a per-gene linear regression model with disease status and tissue type as covariates [32] and evaluating the signifi cance of the correlation with disease status using an Empirical Bayesian t test [33]. Among the 128 miRNAs, 89 (70%) were found to be related to cancer at the signifi cance level of 0.001. The 89 miRNAs were all down-regulated in tumors. In the subsequent cluster analysis, we focused on a subset of 38 miRNAs, which had both a small p-value (Ͻ0.001) and a large fold change (Ͼ3) ( Table 1, Fig. 2).
To better understand the grouping structure among these 38 cancer-related miRNAs, we clustered them in two ways. One is to identify groups of miRNAs that similarly interact with genes. The other is to identify groups of miRNAs that are similarly affected by disease status.
MiRNAs were clustered based on their correlation patterns with gene expression. The  correlation matrix between miRNA expression and gene expression were calculated and graphically displayed with a heatmap, where rows are genes and columns are miRNAs. MiRNAs were then clustered based on this correlation matrix, so that miRNAs with similar correlation profi les (ie. similar columns of the correlation matrix) were clustered together. For the clustering step, the hierarchical clustering method was employed, with the Euclidean distance as the distance measure. MiRNAs closer on the dendrogram share similar correlation pattern with genes and are thus likely functionally related. The miRNA-gene correlation heatmap and the corresponding miRNA clustering were generated separately for tumor samples and normal samples (Figs. 3a-b). Of note, several let-7 family miRNAs, although belonging to different clusters based on mean expression levels, share similar correlation patterns with gene expression profi les in both tumor and normal tissues.
MiRNAs were also clustered based on their relationship to disease status. A new clustering method, the clustering of regression models (CORM) method, models expression using regression and assumes that miRNAs in the same cluster share the same regression coeffi cients [34]. This method tends to provide more stable clustering than K-means clustering, as it explicitly models different sources of variations and bases clustering solely on the systematic variation [35]. Using the CORM method with disease status and tissue type as the covariates, miRNAs were clustered so that miRNAs in the same cluster have similar mean expression among tumor samples and normal samples for each tissue type (Table 1, Fig. 4). It is reassuring to see that variants of the same miRNA tend to belong to the same cluster. For example, let-7f and let-7g are both in cluster 2. Of note, miRNAs in cluster 2 (let-7f, let-7g, miR-15a, miR-30c, and miR-126) are expressed at a similar level among normal samples and downregulated at a similar level among tumors, regardless of tissue type. In addition, miRNAs in cluster 4 (miR-199a, miR-199b, miR-200a, and miR-214) are signifi cantly down-regulated in kidney tumors. Interestingly, miR-200a was fi rst cloned in mouse kidney tissue and its expression was confi rmed in humans [36].

Identifying and clustering target genes for let-7f
MiRNAs are thought to negatively regulate mRNA in one of two ways depending on the degree of complementarity between the miRNA and its target [37]: (1) miRNAs that bind perfectly to their target's coding sequence are thought to result in mRNA degradation, and (2) miRNAs that bind with imperfect complementarity to the 3' UTR block target gene expression at the level of protein translation.
Target gene prediction is an important but complicated task for miRNA studies [38]. Several    algorithms have been proposed for miRNA target prediction, which mostly rely on the assumption of base pairing and evolutionary conservation. Examples of the sequence-based prediction algorithms include MiRanda, PicTar, and TargetScanS [39][40][41][42]. A large number of miRNA targets are still unknown [43]. MiRNA expression profi ling provides an alternative for identifying target genes, especially those targeted through degradation, by correlating miRNA and gene expression. It can potentially provide in vivo evidence of gene targeting, as opposed to the in silico evidence provided by the sequence-based prediction algorithms. We took let-7f as an example to demonstrate how to identify and cluster miRNA target genes using miRNA and gene expression. For each gene, the expression level is modeled using a linear regression model with let-7f expression, disease status, and their interaction as the covariates. This full model is then compared to a reduced linear regression model with disease status as the covariate, using a likelihood ratio test, to evaluate the association between gene expression and let-7f expression. A set of 178 genes showed signifi cant association with let-7f expression using a signifi cance cut-off of p-value = 0.001 (Supplementary Table 2). These 178 are potential let-7f target genes, with a false discovery rate of about 0.015 [44]. These 178 genes are enriched in nucleic acid binding and regulation of DNA replication or transcription; a number of the predicted target genes are related to cancer, such as RASSF7, RAB34, ARAF, BCL2L14, MLL3, MORF4L2, PERP, and SELENBP1. The RAS family has been shown to be let-7 targets experimentally [45]. In our analysis, RASSF7 (alias HRAS1) and RAB34 (member of RAS super-family) were predicted to be let-7f targets. Specifi cally, RASSF7 is negatively correlated with let-7f expression among normal samples and positively correlated among tumors (Fig. 5a). The opposite pattern holds for RAB34 (Fig. 5b).
Using CORM with let-7f expression, disease status, and their interaction as covariates, the 178 genes were clustered so that genes in the same cluster vary similarly as let-7f varies given the disease status (Supplementary Table 2). Interestingly, clusters 1-4 seem to be mirror images of clusters 5-8, respectively (Fig. 6). Genes in cluster 1 are negatively correlated with let-7f in normal samples; genes in cluster 2 and 3 are negatively correlated in normal samples and positively correlated in tumors; genes in cluster 4 are positively correlated in tumors; and genes in cluster 9 are positively correlated in normal samples and negatively correlated in tumors. RASSF7 and ARAF belong to cluster 3, while RAB34 and PERP belong to cluster 6.

Stable hierarchical clustering
Stable hierarchical clustering groups samples based on the co-clustering frequency among repeated bootstrap sampling. Specifically, (i) bootstrap sample sets are generated by resampling with replacement from the original sample set; (ii) for each bootstrap sample set, samples were partitioned to a prespecifi ed number of clusters using PAM (partitioning around medoids) method and the corresponding cluster centers are applied to classify the samples in the original sample set using the nearest-centroid method; (iii) for each sample pair in the original sample set, the frequency of being assigned to the same cluster is calculated across bootstrap sample sets. The co-clustering frequency is then used as a similarity measure for hierarchical clustering to identify stable clusters of samples.

Clustering of regression models method
As for per-gene regression analysis, CORM uses regression to model systematic variation in expression levels but, in addition, assumes that  genes in the same cluster share the same values of regression coefficients [34]. To identify gene clusters, CORM was applied using the same regression model as for their per-gene analysis.
Let X gi (n gi × p) denote the design matrix for gene g and sample i, F βk,ξk the conditional distribution of genes in cluster k given the covariates with parameters β k and ξ k , β k (p × 1) the vector of regression coeffi cients, and µ(.; .) the regression function. The model underlying CORM can be written as where u g is a random variable on (1, 2, ... , K) with probabilities (π 1 , π 2 , ... , π K ). Complete specification of the CORM modeling framework requires identifi cation of the error structure The clustering of linear models (CLM) method can be applied to cross-sectional data to fi nd genes whose expression levels are similarly related to a set of covariates. In cross-sectional data, a single expression value is measured for a gene on a sample; hence, y gi reduces to y gi and X gi to x gi . The underlying model for CLM can be written as The EM algorithm can be used to fi t the CLM model, and implementation details can be found in [34].
For the analysis of the human cancer data in this paper, the CLM method was used to (1) identify miRNAs similarly related to disease status and tissue type with disease status and tissue type as covariates, and (2) identify genes similarly related to let-7f and Figure 6. Profi le plot for CORM clusters of the 178 genes predicted as let-7f targets. Genes were clustered based on their relationship with let-7f and disease status using the CORM method. Each panel is for a gene cluster and each line is for a gene. X axis represents four conditions: normal, normal with one unit let-7f expression, tumor, and tumor with one unit let-7f expression. Y axis represents the average gene expression level for a given condition.
disease status with let-7f expression level, disease status, and their interaction as covariates.
CLM is closely related to K-means clustering, both being partitional clustering; however, Kmeans clusters gene based on the expression levels directly, while CLM based on the relationship between expression and the covariates and hence pools information across samples. Comparing to K-means, CLM tends to identify more stable clusters across samples [35].

Discussion
The study of miRNAs has received a lot of attention lately [1,2]. There is evidence that miRNAs are involved in animal development and cell cycle regulation. MiRNAs might play an important role in cancer and in regulating cancer-related genes [10,11,23,45]. Our fi ndings in this paper suggest the signifi cance of microRNA expression itself in cancer diagnosis. Using let-7f as an example, we showed that its expression is correlated with the expression of a number of known oncogenes and the directionality of the correlation (positive correlation versus negative correlation) may be different in tumors and in normal tissues.
Target prediction is an important component in understanding miRNAs and their functions. As an alternative to existing sequence-based algorithms, an expression-based strategy for miRNA target prediction was proposed in this paper and its feasibility was demonstrated through an application to a human cancer data set. Like the sequencebased predictions, the expression-based predictions also have limitations. For example, correlation is not direct interaction and genes correlated with a miRNA might be down-stream genes of miRNA direct targets. Rather the two predictions are complementary and could be combined to prioritize candidate targets for experimental validation.
Many types of high-throughput '-omics' data have recently emerged, such as gene copy number, gene expression, and proteomics data. The interpretation and integration of these data pose a challenge for both experimental and quantitative scientists in this fi eld. The analytic methods in this paper provide a new tool to interrogate these high-throughput data in an integrative fashion. In particular, CORM has been previously applied to data collected under various experimental designs, such as cross sectional, longitudinal with no replication, and longitudinal with replications. In this paper, we demonstrated yet another application of CORM to clustering miRNAs or genes with respect to specifi c covariates of interests. We have focused on the cluster analysis in this paper, including the clustering of samples, miRNAs, and genes. These exploratory analyses are one aspect of an integrative analysis of miRNA and gene expression. We will investigate other types of integrative analysis in the future to gain a better understanding of the relationship between miRNAs and genes as well as their joint behaviors.