Public databases and software for the pathway analysis of cancer genomes.

The study of pathway disruption is key to understanding cancer biology. Advances in high throughput technologies have led to the rapid accumulation of genomic data. The explosion in available data has generated opportunities for investigation of concerted changes that disrupt biological functions, this in turns created a need for computational tools for pathway analysis. In this review, we discuss approaches to the analysis of genomic data and describe the publicly available resources for studying biological pathways.


Background
The development of cancer involves the accumulation of genetic and epigenetic alterations. Genetic events such as chromosomal rearrangements, changes in gene dosage, and sequence mutations can infl uence gene expression patterns, which contribute to the hallmark phenotypes of cancer 1,2 . The interaction between pathways and the involvement of pathways in multiple phenotypes complicate the interpretation of gene expression patterns. For example, the epidermal growth factor receptor (EGFR, HER1, ERBB-1) signaling pathway plays a role in specifi c phenotypes including resistance to apoptosis, increased proliferation, mitogenesis, transcription of numerous target genes, and actin reorganization, in several cancers ( Fig. 1) 1,3,4 . In order to decipher the interaction within and between pathways, computational tools are necessary to annotate components, to identify co-regulated expression, and to identify sets of genes or pathways which are statistically over/under-represented within a dataset.

Methods for Gene Classifi cation
A major analytical step to mine large microarray data is sample classifi cation or identifi cation of gene sets with characteristic biological function. Entrez Gene at the National Center for Biotechnology Information (NCBI) provides unique identifi ers for genes, and is a searchable database providing genespecifi c information and links to external databases, including the Gene Ontology (GO) consortium, KEGG and Reactome 5 . A limitation of Entrez Gene is that genes are searched individually, which could be time consuming. Here, we describe the Gene Ontology (GO), a structural language to annotate gene functions for batch processing, and also methods of clustering analysis. The algorithmic basis of clustering identifi es a pattern associated in a data set, which could be subsequently followed by GO analysis to identify its underlying biology.

Gene Ontology annotation
The Gene Ontology (GO) Consortium was established in 2000 to provide a controlled vocabulary for annotating homologous gene and protein sequences in different organisms 6,7 . GO classifi es genes and gene products based on three hierarchical structures that describe a given entry's biological processes, cellular components, and molecular functions, and organizes them into a parent-child relationship 6 . Through easy on-line access (http://www.geneontology.org), the genome databases are being unifi ed to expedite the process of retrieving information on genes and proteins based on shared biology among multiple organisms. Several software tools, including GoMiner 8,9 , The biological system is integrative with tightly regulated processes, and genes with similar functions often exhibit coordinated expression patterns [13][14][15][16] . Transcriptional profi ling studies typically aim to identify patterns of change among clinically related samples or to classify subgroups of samples [15][16][17] . Clustering of microarray data is widely used to identify groups of genes that display coordinated expression patterns performed in a supervised or unsupervised manner ( Fig. 2) 13,14,[17][18][19][20][21] . Unsupervised clustering is to classify data without a priori labeling of samples, whereas supervised clustering classifi es data based on knowledge of samples type (e.g. cancer subtype) [21][22][23][24] . Clustering techniques are generally classifi ed into two types: hierarchical and partitional 25,26 . Hierarchical clustering is constructed by either agglomerative (bottom-up) or divisive (top-down) approaches 25 . Agglomerative algorithms begin with separate clusters and merge them into successively larger clusters, while divisive algorithms begin with the whole dataset and divide the data into smaller clusters successively 25 . The output of agglomerative clustering is a tree of clusters called a dendrogram, in which each branch represents group of genes that  Figure 1. Example of EGFR-mediated signaling changes, a commonly disrupted pathway in lung cancer. The EGFR pathway could be disrupted by an increased expression of growth factor ligands. By targeting EGFR with tyrosine kinase inhibitors (TKIs) and MAb (monoclonal antibodies), EGFR activity can be eliminated. However, a downstream factor (e.g. MAPK signaling pathway) may also be activated to disrupt the pathway, thus making TKIs ineffective. Pathway data was obtained and selected from the Cancer Cell Map database and drawn using Cytoscape.
have a higher order relationship (Fig. 2B) 25,27 . Partitional clustering directly reduces the dataset into a set of non-overlapping clusters 26 . Representative algorithms of partitional clustering include k-means clustering and self-organizing maps (SOM) 25 . k-means clustering requires the user to defi ne k number of clusters 26,28 , and SOM partitions data into a two dimensional grid of clusters 13,29,30 . However, hierarchical clustering is more frequently used [17][18][19][20]30 . Detailed reviews of clustering algorithms are available and this topic will not be discussed further in this review 26,[31][32][33] .

Dimensionality reduction
Dimensionality reduction of data is used to minimize the number of input variables for fi nding coherent patterns of gene expression in an effi cient manner 25,34,35 . Algorithms like principle component analysis (PCA) and multi-dimensional scaling (MDS) both employ this technique for classifi cation procedures 25,34,36,37 . PCA visualizes multidimensional datasets by projecting data into a sub-space with 2 or 3 dimensions (Fig. 2C) 34,35,37,38 . The three-dimensional graphical display of MDS can be useful to portray relationships among the  Figure 2. Graphical output display of heatmap, hierarchical clustering, and principal component analysis. A: An example of a heatmap representation of 30 simulated profi les helps the user to easily visualize four groups of samples along the x-axis with distinct characteristics expression patterns for 300 genes. Heatmap facilitates the grouping of altered genes and sample clusters, but does not convey any spatial relationship between clustered samples. B: An example of a dendrogram generated from hierarchical clustering of the simulated data represented in fi gure 2A. A dendrogram is a tree diagram consisting of many U-shaped lines connecting objects to represent hierarchical clusters. In this dendrogram, four clusters of samples are formed based on distinct expression signatures. C: A two-dimensional graphical visualization of principal components analysis (PCA) based on the simulated data shown in fi gure 2A. Samples are color-coded based on the four clusters observed by hierarchical clustering in 2B. data points but might be complex to interpret and require subjective judgments.
Classifi cation analysis may provide some pattern to the experimental datasets. Subsequently, the identifi ed pattern may be further evaluated for biological interpretation using tools such as GO and/or Entrez Gene. However, the inherent limitation of pre-processed databases is subjective to the interpretation of the curator. Therefore, further validation should be considered. In a study that was conducted under the hypothesis that members in the same cluster would share related biological annotations, the majority of the clusters generated by three different clustering algorithms do not correspond well with known biology 39 . Furthermore, there is a need to improve the different clustering algorithms to enhance consistency of the results 39,40 . It is crucial to associate biological functions or regulatory pathways with each identifi ed cluster of genes in order to deduce biological signifi cance to each sample group 41 .

Construction of Pathway Database
A remarkable number of published articles have collectively yielded thousands of molecular interactions for human and for model species. The challenge is to extract these individual interactions from the literature and to comprehend the dynamics of the interlocking networks as a whole. In recent years, massive efforts have been devoted to managing, integrating, and interpreting the available scientifi c information in a meaningful manner (i.e. building interactomes or networks of genes and pathways) 42,43 . Three categories of information are essential for the construction of interactome databases: gene and protein sequences, gene and protein biological information, and molecular interaction resources (Fig. 3). The major repositories of genes and protein sequences are listed in Table 1. Examples of nucleotide sequence databases include NCBI GenBank, EMBL, and DDBJ, all of which are part of the International Nucleotide Sequence Database Collaboration to facilitate data exchange and enhance accuracy [44][45][46][47] . The major databases for gene and protein biological information are listed in Table 2. Gene Ontology (GO), OMIM, Entrez Gene, and Universal Protein Resource Knowledgebase (UniProtKB) are the foundation for building these hierarchical databases 5,7,48,49 . The main publicly available molecular interaction databases are listed in Table 3. Currently, DIP, IntAct, MINT, HPRD, and MIPS all support the Human Proteome Organization (HUPO) Proteomics Standards Initiative Molecular Interaction (PSI-MI) standard format [50][51][52][53][54][55] . This is a unifi ed data standard to represent molecular interaction data in a controlled vocabulary, which facilitates data comparison, exchange, and linking queries together 51 .
The wealth of biological resources can complicate the construction of pathway databases (Fig. 4). When assembling information into a pathway database, developers must be cautious to   www.ebi.ac.uk/swissprot [49] distinguish those interactions that are deduced from hypothetical situations from those that have been experimentally confi rmed. Within the latter group, care must also be taken to determine whether interactions have been confi rmed in a single direct experiment or a high-throughput experiment. Furthermore, the use of natural language processing (NLP) systems to automate the extraction of information from published articles and to identify relationships between gene and protein names or interactions must be reviewed for biological relevance 56,57 . This method is useful as a fi rst-pass tool for mining and extracting the knowledge in the literature. However, the constantly advancing nature of research, the further refi nement of biological knowledge associated with each gene or www.geneontology.org [7] Entrez Gene -NCBI database that focuses on gene-to-sequence relationship and provides gene-specifi c information.  [89] protein further refi ning, the incompletion of the annotation database, and the complexity of entity names in the biological domain often makes it challenging for NLP to be high-quality with huge successes.

Descriptions of Specifi c Pathway Database
Pathway databases facilitate the data mining process for cancer researchers. The major pathway databases are listed in Table 4. A collection of biological pathway and network databases is summarized in Table 5, including Pathguide: The Pathway Resource List (http://www.pathguide. org) 58 . This website is updated regularly and currently about 224 biological pathway resources are accessible through the Pathguide website. Here, we focus on a subset of databases that are publicly available.

KEGG
The KEGG (Kyoto Encyclopedia of Genes and Genomes) database has been established since 1995 and has been one of the most popular knowledge databases to date 59 . The KEGG PATHWAY database consists of manually assembled pathway maps based on inspection of published literature. Pathway maps are grouped into metabolism, genetic information processing, environmental information processing, cellular processes, human diseases, and drug development. Most of the pathways associated with cancer are listed in the environmental information processing section, which is further subdivided into membrane transport, signal transduction, and signaling molecules and interaction. Beside human databases, information from other model organisms such as chimpanzee, mouse, rat, dogs, cows, and pigs is also available. KEGG pathway maps can be manipulated through the KEGG Markup Language (KGML) fi les, which provide graphical information to customize pathways.

The Cancer Cell Map
The Cancer Cell Map (http://cancer.cellmap.org) is the only database that focuses on signaling pathways implicated in cancer. This resource contains ten cancer-related pathways and each pathway has approximately 100 to 400 interactions. Interactions are manually curated and reviewed for biological validity. Extensive  Y string.embl.de [100] information is provided in each pathway, including the cellular locations of the proteins, the types of physical interactions including molecular interaction, biochemical reaction, catalysis and transport, and post-translational protein modifi cations. The original citations, experimental evidences, and links to other databases are also listed. Gene expression data can also be visualized in the context of Cancer Cell Map pathways using the Cytoscape network visualization and analysis software 60 .

Human Protein Reference Database
The HPRD (Human Protein Reference Database) contains ten cancer signaling pathways and ten immune signaling pathways which are graphically visualized in GenMAPP pathway maps 54,61 . The HPRD also offers the fl exibility for investigators to refi ne their search of interested protein by multiple criteria, including molecular class from GO, domain name, motif, site of expression, length of protein sequence, molecular weight, and disease association (e.g. ovarian cancer and breast cancer

Reactome
Reactome is a publicly available, peer-reviewed resource of human biological pathways. 62

Visualization tools
Cross-talk between pathways can complicate the graphical representation of observed biological interactions. Therefore, visualization tools such as Cytoscape 60 and GenMAPP 61 have been developed to illustrate molecular interactions intuitively.

Cytoscape
Cytoscape is a software tool for the integration of pathways with expression profi les. It allows the querying of networks by using several fi ltering tools, and linking a given network to public databases for functional annotations 60 . An important feature of Cytoscape is its extensible software framework which allows users to implement new algorithms and network computations. In addition to its use by the Cancer Cell Map (described above), Cytoscape can also be used in conjunction with other protein interaction databases or genetic interaction databases 63,64 . Molecular species are represented as nodes and intermolecular interactions are linked as edges. Different visual properties such as node color, shape, and size can be chosen, and subsets of nodes and edges can be displayed based on the criteria that are selected by the user. Visualization properties and analysis parameters are customizable.

GenMAPP
GenMAPP (Gene Map Annotator and Pathway Profiler; previously called Gene MicroArray Pathway Profi ler) is a computer program designed to display gene expression data in the context of biological pathways 61 . Based on the quantitative data that is loaded, the program will map genes onto relevant pathways and the user can set up criteria to color code the genes accordingly. GenMAPP visualize data in a fi le format called "MAPPs", which allow users to organize the genes by their functional component. The user has the choice to download specifi c pathways or from the archive of MAPPs at www.netpath.org. The MAPPs database is manually curated, with interactions derived from textbooks, review articles, and public pathway databases. Gen-MAPP also has the feature to construct and modify the pathways by the user, a quality that is not possible if analyzing pre-existing pathway databases like EcoCyc, MetaCyc, and KEGG. Gene identifi cation (ID) from GenBank, SWISS-PROT, Gene Ontology, or other known databases is used to link the gene object on the MAPP to public databases like SWISS-PROT or Entrez Gene by selecting the gene of interest. In addition, GenMAPP displays gene expression levels and provides statistical analysis based on the representation of altered genes in a given pathway MAPP.

Software Tools to Analyze HTP Data
GoMiner 8,9 , MAPPFinder 10 , and EASE 65 are software tools developed to correlate gene expression changes with GO terms to categorize the biological processes, cellular components, or molecular functions that are statistically affected. However, visualization of the pathway networks is challenging and complicated. Many software tools have been developed for microarray researchers to analyze large scale high-throughput data within the context of biological pathways, including the above mentioned Cytoscape and GenMAPP. Some of the most commonly used software tools are listed in Table 6.
Here, we describe some of the freely available software tools that provide graphical representations of gene networks.

Pathway Processor
Pathway Processor is designed to visualize whole genome microarray data in the framework of metabolic networks and provides statistical signifi cance of the reliability of each differentially expressed gene 66 . This program displays data based on the information from the KEGG pathway database. Pathway Processor is implemented as two programs: Pathway Analyzer and Expression Mapper. Pathway Analyzer is the portion responsible for the statistical analysis of pathway signifi cance, while Expression Mapper facilitates the visualization of this data on KEGG pathway maps.

Whole Pathway Scope
Whole Pathway Scope (WPS) is a software tool to analyze high-throughput microarray experiments by referencing pathway or gene information from KEGG, BioCarta, and Gene Ontology 67 . The internal database also includes information from the Genetic Association Database and MedGene Database to allow users to rapidly identify diseaseassociated genes and highlight them inside their network diagram or select them for further network manipulation. One of the key features is the ability to view multiple experiments simultaneously and color-code the expression value with its p-value.
In addition, this software allows users to customize their own metabolic pathway and gene groupings with the option of using statistical analysis.

Pathway Explorer
Pathway Explorer is a web-based service available at https://pathwayexplorer.genome.tugraz.at to map expression profi les of genes onto pathway maps extracted from KEGG, BioCarta, and Gen-MAPP 68 . This web-based service reduces the local Multidimensional profiling of a cancer genome Figure 5. Genome-wide integrative analysis to identify pathways disrupted in cancer. Genome-wide analyses including copy number profi ling, epigenetic profi ling, and transcription profi ling performed on the same cancer sample could narrow down the number of candidate genes, which would in turn help to pinpoint disrupted pathway involved in cancer.

Future Considerations
The development of various computational tools to interrogate biological databases is accelerating the process to understand high-throughput genomic studies. However, these new tools pose new challenges, and one must be cautious about the limitations and errors associated with various databases. For example, it has been reported that when a partial Enzyme Commission (EC) number, which is a combination of four digits to annotate enzymatic activities without the fourth digit, is assigned to a gene, several pathway databases have used partial EC number annotations and inaccurately assigned them to a set of reactions that are associated with the same partial EC number under each orthology group 69 . Pathway database users should be aware of the possible inherent problems associated with any databases due to the variable quality of the published data. Comprehensive examination of the literature, as well as additional experimental validation, should be used to confi rm any fi ndings. Crossplatform integrative analysis of genomics, epigenomics, and transcriptional profi ling will offer a deeper understanding of the biological complexity underlying disease processes (Fig. 5) 70 . The current challenge is to incorporate these data together for direct comparison, visualization, and analysis in order to identify salient gene candidates 71 . Once this is accomplished, the next step will be to place these candidates in the context of their proper signaling pathways for a given cancer type. Ultimately, the software programs used to do this should be intuitive to use, provide accurate information, allow customizable analyses, and offer sophisticated statistical tools. All of these features will be essential for characterization of disrupted gene networks in cancer. This will set the stage for rational therapeutic selection based on the underlying genetic realties of a specifi c tumor 38,41 .