Evidence of highly regulated genes (in-Hubs) in gene networks of Saccharomyces cerevisiae.

Uncovering interactions between genes, gene networks, is important to increase our understanding of intrinsic cellular processes and responses to external stimuli such as drugs. Gene networks can be computationally inferred from repeated measurements of gene expression, using algorithms, which assume that each gene is controlled by only a small number of other proteins. Here, by extending the transcription network with cofactors (defined from protein-protein binding data) as active regulators, we identified the effective gene network, providing evidence of in-hubs in the gene regulatory networks of yeast. Then, using the notion that in-hub genes will be differentially expressed over several experimental conditions, we designed an algorithm, the HubDetector, enabling identification of in-hubs directly from gene expression data. Applying the HubDetector to 488 genome-wide expression profiles from two independent datasets, we identified putative in-hubs overlapping significantly with in-hubs in the effective gene network.


Introduction
In recent years, several shared features of biological networks, such as metabolic, protein-protein, and genetic networks, have been discovered (Barabási and Oltvai, 2004). Although most nodes (metabolites, proteins or genes) in these networks have few connections, some-referred to as " hubs"-have large numbers of connections. The discovery of hubs has led to superior immunization strategies (Pastor-Satorras and Vespignani, 2002) and improved effi ciency in identifying protein interaction networks (Lappe and Holm, 2004). In addition the biological importance of protein hubs have been demonstrated by Jeong et al. (Jeong et al. 2001), where a correlation between high connectivity and lethality in yeast was found. Moreover, analysis of gene expression data and TF-DNA binding data has revealed that some TFs serve as out-hubs, thus regulating several other genes (Lee et al. 2002;Luscombe et al. 2004;Bergmann et al. 2004).
In contrast, the existence of in-hubs, genes having several incoming regulatory connections, has not yet been determined because of possible structural constraints limiting the number of TF binding sites in gene promoter regions. Analysis of TF-DNA binding data has shown a narrow in-degree distribution in the transcription regulation network (Lee et al. 2002).
Another limitation which also motivates our analysis is that since our current knowledge as represented by these data sources, is most likely incomplete, it would therefore be advantageous to fi nd novel in-hubs directly from gene expression data. However, it has not yet been possible to identify in-hubs directly from gene expression data by using network reconstruction algorithms because these algorithms suffer from a complexity problem (Gardner and Faith, 2005;Tegnér et al. 2003;Tegnér and Björkegren, 2007;Bansal et al. 2007;Han et al. 2004). As an illustration, consider the problem of identifying all incoming connections 3 to a given gene. This corresponds to choosing k predictive genes out of N possible, which can be done in N k ( ) ways. For example, there are ∼10 13 different possibilities for how 10 incoming connections can be wired into a single gene in a 100 gene network. To come to terms with this complexity problem, network reconstruction algorithms as a rule limit the maximal in-degree (k max ) for each gene to a small number (i.e. k max Ͻ 5 or less). Thus, current network identifi cation algorithms do not have the power to detect in-hubs from gene expression data. It is therefore desirable to develop algorithms that can directly detect in-hubs from gene expression data and thereby increase the power of current network identifi cation algorithms.
The paper is organized as follows. In the fi rst part of we assess the existence of in-hubs in the yeast network by integrating transcription and protein-protein binding data into what we refer to as the effective gene network. Then, in the second part we develop an algorithm-the HubDetectorwhich can identify in-hubs as defined by the effective gene network directly from gene expression data. The HubDetector is evaluated in silico and validated using two different gene expression datasets from yeast.

Identifi cation of an effective gene to gene network by data integration
In a gene network the genes are nodes and directed edges represent regulatory interactions between regulators and a target genes. This is in contrast to edges in the transcription factor network since the edges in a gene regulatory network often involve multiple complex interactions involving indirect pathways of protein and metabolites (Brazhnik et al. 2002;Blais and Dynlacht, 2005). One example of protein-protein interactions mediating signals from a regulator via a TF to a target gene is found in the galactose utilization pathway, in which the TF Gal4p is modulated by Gal3p and Gal80p (Blais and Dynlacht, 2005;Ideker et al. 2001). This principle is illustrated by analysis of the MET3 network (Fig. 1a). When only the strict TF network was considered, no gene had more than 16 incoming edges (Fig. 1b). However, when proteins binding to a TF were considered as active cofactors some genes had 40 or more incoming edges. Importantly, most genes had few edges regardless of whether the strict TF network or protein-TF interactions were considered. To further justify that it makes biological sense to include proteins binding to TFs we assessed the fraction of genes annotated to the gene ontology category "transcription regulation activity". Table 1 shows cofactor proteins are known to be involved in transcription regulation in 26.7% of the cases. This is less than the well characterized set of TFs (68.7%) but signifi cantly higher than for all genes (7.1%; P Ͻ 10 -10 ).
To calculate the number of incoming edges to a given gene in the yeast network, we collapsed the protein-protein and transcription networks into what we refer to as the effective gene network ( Fig. 2 and Methods). By defi ning the effective gene network in yeast from these data sources, we could calculate the average number of connections between all yeast verified ORFs (Cherry et al. 1998), the 300 most-regulated genes (in-hubs), and 237 most regulating genes (out-hubs) ( Table 1). In-hubs had an average of 24.4 incoming edges, whereas out-hubs had only slightly more incoming edges than the average for all genes (6.0 versus 4.2). Moreover, 46.4% of the out-hubs, but only 12.6% of the in-hubs, were involved in transcription regulation. Thus, by integrating currently available data of TF and protein (A) A schematic illustration of regulation of Met3, a gene involved in sulphate assimilation. Shown are its fi ve TFs (Met32, Cbf1, and Gcr2, Ume6 and Tye7) (squares) and proteins binding to these TFs (circles) (MacIsaac et al. 2006;Deane et al. 2002;Xenarios et al. 2000). Considering proteins binding to TFs as transcription co-factors result in 23 proteins regulating MET3 expression. (B) The number of incoming regulatory edges for S. cerevisiae genes. and the number of incoming regulatory edges, considering the TF network alone (squares) and with proteins binding to TF infl uencing TF activity (circles). binding, we fi nd evidence for the existence of in-hubs in the regulatory network of yeast. Moreover, outhubs (many of which are TFs) appeared to be mostly distinct from in-hubs.

Design and evaluation of the HubDetector algorithm in silico
In-hubs have been diffi cult to detect directly from gene expression data. There are several explanations for this. Current approaches for identifying networks from gene expression datasets cannot detect in-hubs, since the combinatorial complexity is too great when there are several in-coming edges (Gardner and Faith, 2005;Tegnér et al. 2003;Tegnér and Björkegren, 2007;Bansal et al. 2007;Han et al. 2004). Moreover, our current knowledge of the yeast network is based on available experimental datasets and is therefore incomplete. Moreover, different cellular conditions activate different subsets of in-hubs, leaving some inactive and undetectable (Luscombe et al. 2004). To identify active in-hubs under different cellular conditions directly from gene expression data, without prior knowledge of the architecture of the gene network, we designed an algorithm referred to as HubDetector. The HubDetector algorithm is based on the idea that perturbations, (genetic or changes to the cellular environment) more often affect the transcription of in-hubs than other genes in the network, due to the larger number of incoming edges (Fig. 3).
Using random in-silico networks to generate simulated gene expression data from a well defi ned network structure allowed systematic evaluation of the algorithm. The HubDetector was tested on   (Fig. 4). The HubDetector performance converges after around 300 experiments depending on network structure and noise ( Fig. 4 and Table 2). Moreover, Table 2 shows simulation results when changing network architecture, noise level and perturbation strategy. The HubDetector is robust against alterations in network architecture and noise level. We observe that the performance of the HubDetector is better for lower noise levels. A perturbation strategy which randomly perturbs multiple targets (called "environmental" in Table 2) appear to be benefi cial to both performance and convergence rate. Thus the HubDetector could be applied to datasets with single gene perturbations or environmental states.

Validation of the HubDetector using gene expression data
To further validate the HubDetector we analyzed to two publicly available whole-genome yeast expression datasets. From dataset A (Hughes et al. 2000), we used 273 of 300 gene expression profi les generated from mostly nonlethal deletions. Dataset B (Mnaimneh et al. 2004), generated more recently, consisted of 215 expression profiles from yeast cultures with titratable promoters of genes essential for cell survival. From each dataset, HubDetector generated a list of genes ranked according to their HubScores (see Supplementary Data on-line). The observed relation between the HubScore and the number of regulatory interactions of each gene in the effective network ( Fig. 5a and b) clearly demonstrates that the HubDetector identifi es in-hubs. The correlations between the HubScore and the number of regulatory interactions are 0.18 (P Ͻ 10 -32 ) and 0.11 (P Ͻ 10 -13 ) for expression dataset A and B respectively. The observed correlations were found to be robust against errors in the TF binding data. The two gene expression datasets (A and B) are based on perturbations of essential versus nonessential genes. The propagation of gene activity throughout the networks should therefore differ (Luscombe et al. 2004). The HubDetector was expected to predict two subsets of in-hubs from each dataset. Surprisingly the genes that had a high HubScore in dataset A or B (upper 20th percentile or 903 genes in each dataset) overlapped by a 48.3% (Fig. 5c).

Discussion
In this study we considered proteins binding to transcription factors acting as transcription co-factors and thus regulating the target gene of that TF. The function of proteins binding to TFs acting as functional cofactors were supported by Gene ontology analysis revealing an over-representation of gene ontology function "Transcription regulator activity". Our analysis uncovered a signifi cantly broader in-degree distribution of the target genes, Figure 3. Design of the HubDetector algorithm. HubDetector is based on the rather simple idea that genes that are differentially expressed often in different internal/external conditions are more likely to be in-hubs. The rationale for this hypothesis is illustrated with a simple network (A) consisting of nine nodes with one out-hub (gene E) that regulates genes B, C, F, H, and I and one in-hub (gene B) that is regulated by genes A, C, D, and E. In this network, the HubDetector should detect gene B if gene expression measurements are obtained after other genes in the network are perturbed. Arrows indicate regulatory interactions. The table (B) shows which gene will be differentially expressed (D.E.) given different perturbation targets in the network. Gene B (the in-hub) is differentially expressed most frequently across all targets. To enable the prediction of in-hubs directly from gene expression data we introduced an algorithm, the HubDetector. Validation of the HubDetector using two distinct gene expression datasets independently showed statistically signifi cant correlations between the HubScore and the number of incoming edges in the effective gene network. It should be noted that the genes that the HubDetector identifi es are the genes changing expression most often between different cellular states these, differ from the genes with the highest variance as studied in Bilu and Barkai (2005).
Interestingly, several of the identifi ed in-hubs from the expression data were identical to in-hubs recovered from the integration of protein-protein interactions with TF-DNA binding data. It should be noted the HubDetector also identifi es several putative and novel in-hubs which do not have many incoming edges in the effective network which was constructed in the fi rst part of this work. Rung et al. (2002) proposed a disruption network reconstruction algorithm based on the idea that whenever the expression level of a gene is changed by the deletion of another gene a putative interaction has been identifi ed. Our study reveals that it is possible to interpret some hubs in a disruption network as putative in-hubs. However, in contrast to the disruption network reconstruction, the HubDetector is applicable not only to specifi c perturbations like gene deletions but also to nonspecific perturbations like distinct cellular or environmental states.
Using the HubDetector does not require a complete identifi cation of all interactions within the network. Thus, the HubDetector bypasses the limitations of current high-resolution network identifi cation algorithms since the N k ( ) problem is avoided. It should also be noted that current modular analysis techniques (e.g. Segal et al. (2005)) do not detect in-hubs. An interesting future application of the HubDetector is to use it prior to network reconstruction in order to obtain an estimate of the in-degree distribution in a biological network and predict genes for which a higher k -value (in-degree) is required in the network reconstruction algorithm. Thus, the stage is set to systematically map in-hubs by applying HubDetector to the increasing number of publicly available whole-genome expression datasets.
Out-hubs (many of which are TFs) are important regulators of cellular activities and therefore are considered to be good drug targets (Butt and Karathanasis, 1995). In-hubs are instead highly regulated, maybe acting as sensors of intra-and extra-cellular environments, such as altered growth conditions or unfavorably states (i.e. diseases), and providing this information to the appropriate out-hubs. Thus, in-hubs merit further attention in future studies.   The number of regulatory interactions as calculated from the effective gene networks, shown as a function of the HubScore generated by the HubDetector. HubDetector was applied to two datasets: (A) 273 gene expression profi les of 300 generated from mostly nonlethal gene deletions (Hughes et al. 2000) and (B) 215 expression profi les generated from yeast cultures with titratable promoters of genes essential for cell survival (Mnaimneh et al. 2004). The effective gene network in-hubs were identifi ed as described in Figure 2 and Methods. (C) Venn diagram illustrating the intersections of gene-sets predicted to be in-hubs and the gene-set with most regulators in the effective gene network. The two circles above represent the genes with highest HubScore (top 20%) in dataset A and B respectively. The circle below represent the genes with most in-edges (top 20% ) in the effective gene network.

Gene identifi ers
The analysis in this study is restricted to verifi ed ORFs, symbols and names for 4517 ORFs where downloaded from from Saccharomyces Genome Database (Cherry et al. 1998) in February 2007.

Construction of the effective gene network in yeast
Transcription regulation interaction data were downloaded from http://fraenkel.mit.edu/improved map/orfs by factor.tar.gz in January 2007 (MacIsaac et al. 2006). In our analysis we are using the 4900 edges between 2022 genes described in the file orfs by factor p 0.005 cons2.txt. Protein-protein binding data were collected from the Database of Interacting Proteins (DIP) in March 2007(Xenarios et al. 2000, only including the high-confidence interactions (Deane et al. 2002) resulted in a Protein-Protein interaction (PPI) network involving 5562 interactions between 2310 proteins. We used the edges in the transcription network to defi ne a fi rst set edges in the effective gene network. Then, for a given TF node in the transcription network, we identifi ed all the proteins that bind to that TF. For this TF, a second set of edges are added from each binding protein to each target gene which is regulated by the TF via a primary edge. By repeating this procedure for all TFs an effective gene network was constructed. The effective gene network, constructed from the transcription network and protein-protein datasets referred to above, has 18 949 regulatory edges and 2085 genes connected by at least one edge.

In-silico gene network architecture and computational model
In agreement with previous experimental studies of biological networks, we used a wide out-degree distribution of the in-silico networks ( Barabási and Oltvai, 2004). Two types in-degree distributions where analyzed, power law and gaussian. The construction of networks with power law degree distribution follows the procedure as described by (Wagner, 2002), using τ = 1.8, γ = 1000, and the maximal in-degree and out-degree was 400. Networks with a gaussian in-degree distribution are drawn from the non-nega-tive part of N (-50,30). The network was constructed by randomly connecting nodes approximating the above in-degree and out-degree distribution. Genes are connected with weighted directed edges. The weights representing regulatory strengths are sampled from two normal distributions N [10,3] and N [-10, 3] with equal probability. The diagonal of the connectivity matrix is subtracted by a decay term representing degradation of mRNA. The degradation terms are 10% larger than the largest eigenvalue for the connectivity matrix, thereby ensuring stability of the fi nal system. The gene regulatory dynamics follows a system of linear differential equations: where x(t) is expression vector at time t and A is the connectivity matrix with a i,j as the regulatory strength of gene j on gene i (a ij = 0 if j does not regulate i).

Generation of in-silico gene expression data
Without any external force on the system (P = 0), we obtain the trivial solution (x = 0), which corresponds to the wild-type steady state. To design an algorithm for detecting hubs from experimental perturbation data, we simulate perturbations by changing the stimuli (P) on the right-hand side of the equation. The euclidean length of the perturbation is always 1.0. For "Single gene" perturbations only one element (P i ) is changed while "environmental" perturbations refer to a procedure where a random number of genes are perturbed simultaneously. (Drawn from the non negative part of the normal distribution N[5.0, 2.0] for each experimental profi le.) After perturbation the new steady-state solution is found by solving the linear equations.
White noise is added to the expression values before differentially expressed genes are identifi ed. The noise is normally distributed (around 0) with standard deviation σ noise . Thus the noise level can be varied using different values of σ noise . Knowledge of the noise level enables us to compute a cutoff which will result in a predetermined false positive rate (0.01). Which results in a specifi city in the range 0.6-0.9 for cases presented in this article, however this depends on the actual number of affected genes and σ noise . All simulation results in this article are based on 20 repetitions with different instantiations of the gene network. Table S1. Twenty-seven expression profi les removed from the original expression dataset of Hughes et al. (2000). For detailed description of all 300 expression profi les see on-line material for Hughes et al. 2000. Two experiments perturb the same gene. Only one experiment is kept.
Evidence of Highly Regulated Genes (in-Hubs) in Gene Networks of Saccharomyces Cerevisiae Jesper Lundström, Johan Björkegren and Jesper Tegnér