Prediction of functional class of proteins and peptides irrespective of sequence homology by support vector machines.

Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented.


Introduction
Functional clues contained in the amino acid sequence of proteins and peptides Eisenberg et al. 2000;Bock and Gough, 2001;Lo et al. 2005) have been extensively explored for computer prediction of protein function and functional peptides. Sequence similarity (Baxevanis, 1998;Bork and Koonin, 1998;Schuler, 1998), motifs (Hodges and Tsai, 2002), clustering (Enright and Ouzounis, 2000;Enright et al. 2002;Fujiwara and Asogawa, 2002), and evolutionary relationships (Eisen, 1998;Benner et al. 2000) are typical examples of highly successful methods for facilitating functional prediction of proteins and peptides, which are primarily based on some form of sequence similarity or clustering. However, these methods tend to become less effective in the absence of suffi ciently clear sequence similarities (Eisen, 1998;Rost, 2002;Whisstock and Lesk, 2003). In a comprehensive evaluation of sequence alignment methods against 15,208 enzymes labeled with an International Enzyme Commission EC class index, it has been found that approximately 60% of the EC classes containing two or more enzymes could not be perfectly discriminated by sequence similarity at any threshold (Shah and Hunter, 1997). The low and non-homologous proteins of unknown function constitute a substantial percentage, up to 20%~100%, of the open reading frames (ORFs) in many of the currently completed genomes ). Therefore, it is desirable to explore other methods that are less dependent or independent of sequence or structural similarity (Smith and Zhang, 1997;Eisenberg et al. 2000).
This article reviews the strategies, performances, current progresses and diffi culties in applying SVM for predicting various functional classes and interaction profiles of proteins and peptides. Algorithms for representing proteins and peptides by using amino acid sequence derived structural and physicochemical descriptors (Bock and Gough, 2001;Karchin et al. 2002; are also discussed. Web servers for facilitating the computation of these descriptors and for predicting the functional classes of proteins and peptides by the SVM method are discussed.

Functional Classes of Proteins and Peptides
Apart from sequence and structural classes, proteins have been classified into functional classes. Active sites of the members of each class share common structural and physicochemical properties to support the common functionality, which can be explored for predicting the function of proteins from amino acid sequence derived structural and physicochemical descriptors independent of sequence homology. One example is enzyme families. Enzymes represent the largest and most diverse group of all proteins, catalyzing chemical reactions in the metabolism of all organisms. Based on their catalyzed chemical reactions, enzymes can be divided into three levels of functional classes. The fi rst level is composed of 6 super families (EC1 oxidoreductases, EC2 transferases, EC3 hydrolases, EC4 lyases, EC5 isomerases, and EC6 ligases), the second level contains 63 families (such as EC3.4 hydrolases acting on peptide bonds and EC4.1 carbon-carbon lyases), and the third level contains 254 subfamilies (such as EC2.7.1 phosphotransferases with an alcohol group as acceptor). Active sites of enzymes are inherently reactive environments packed with specifi c types of amino acid residues and cofactors, and these and other structural features facilitate binding and catalysis of specifi c types of substrates .
Another example is DNA binding proteins, which play critical roles in regulating such genetic activities as gene transcription, DNA replication, DNA packaging, and DNA repair (Lewin, 2000). Prediction of DNA-binding proteins is important for studying proteins involved in genetic regulation (Aguilar et al. 2002;Stawiski et al. 2003;Sarai and Kono, 2005). DNA recognition by proteins is primarily mediated by combination of such structural and physicochemical features as specifi c DNA binding domains (Bewley et al. 1998; Garvie and Wolberger, 2001), helix structures (Garvie and Wolberger, 2001), minor groove binding architectures (Bewley et al. 1998), asymmetric phosphate charge neutralization (Bewley et al. 1998), conserved amino acids (Luscombe and Thornton, 2002), hydrogen bonds (Luscombe et al. 2001), watermediated bonds (Fujii et al. 2000;Luscombe et al. 2001), and indirect recognition mechanism (Steffen et al. 2002). DNA-binding proteins can be further divided into 9 major functional classes plus several smaller ones (such as covalent protein-DNA linkage proteins and terminal addition proteins). The 9 major classes are DNA condensation (for wrapping of DNA around histones), DNA integration (mediating the insertion of duplex DNA into a chromosome), DNA recombination (for cleaving and rejoining DNA), DNA repair, DNA replication, DNA-directed DNA polymerase (catalyzing DNA synthesis by adding deoxyribonucleotide units to a DNA chain using DNA as a template), DNA-directed RNA polymerase (catalyzing RNA synthesis by adding ribonucleotide units to a RNA chain using DNA as a template), repressor (interfering with transcription by binding to specifi c sites on DNA), and transcription factor.
The third example is transporter families. Transporters play key roles in transporting cellular molecules across cell and cellular compartment boundaries, mediating the absorption and removal of various molecules, and regulating the concentration of metabolites and ionic species (Hediger, 1994;Seal and Amara, 1999;Borst and Elferink, 2002). Specifi c transporters have been explored as therapeutic targets (Dutta et al. 2003;Joet et al. 2003;Birch et al. 2004) and a variety of transporters are responsible for the absorption, distribution and excretion of drugs (Kunta and Sinko, 2004;Lee and Kim, 2004). Thus functional assignment of transporters is important for facilitating drug discovery and research of genomics, cellular processes and diseases. There are active and passive transporters. Active transporters couple solute transport to the input of energy and these can be divided into two classes: ion-coupled and ATP-dependent transporters. Ion-coupled transporters link uphill solute transport to downhill electrochemical ion gradients. ATP-dependent transporters are directly energized by the hydrolysis of ATP and they transport a heterogeneous set of substrates. Passive transporters include facilitated transporters and channels, which allow the diffusion of solutes across membranes. These transporters evolve from common themes into families of different architectures (Hediger, 1994;Driessen et al. 2000;Saier, 2000). Transporters are divided into TC families based on their mode of transport, energy coupling mechanism, molecular phylogeny and substrate specifi city (Saier, 2000). TC families are classifi ed at four levels (TC class, TC sub-class, TC family, and TC sub-family) as indicated by a specifi c TC number TC I.X.J.K.L. Here I = 1, …, 9 represents each of the 9 TC classes, X = A, B, C, D, E, … represents each of the TC sub-classes that belong to a TC class, J = 1, … represents each of the TC families that belong to a TC sub-class, K = 1, … represents each of the TC sub-families that belong to a TC family, and L = 1, … represents individual transporters under a sub-family.
The fourth example is lipid-binding proteins, which play important roles in cell signaling and membrane traffi cking (Downes et al. 2005), lipid metabolism and transport (Glatz et al. 2002;Haunerland and Spener, 2004), innate immune response to bacterial infections (Bingle and Craven, 2004), and regulation of gene expression and cell growth (Bernlohr et al. 1997). Prediction of the functional roles of lipid-binding proteins is important for facilitating the study of various biological processes and the search of new therapeutic targets.
One of the intensively studied peptide classes is MHC-binding peptides (Bhasin and Raghava, 2004c). Peptide binding to MHC is critical for antigen recognition by T-cells. One of the mechanisms of immune response to foreign or self protein antigens is the activation of T-cells by the recognition of T-cell receptors of specifi c peptides degraded from these proteins and transported to the surface of antigen presenting cells (Abbas and Lichtman, 2005). Peptides recognized by T-cells are potential tools for diagnosis and vaccines for immunotherapy of infectious, autoimmune, and cancer diseases (Shoshan and Admon, 2004). In many respects, MHCbinding and other protein-binding peptides possess similar characteristics as proteins of specifi c functional classes in that they also share some structural and physicochemical features to facilitate the common function: binding to MHC or other proteins (Matsumura et al. 1992;Zhang et al. 1998;McFarland and Beeson, 2002).

Support Vector Machine Approach for Predicting Functional Classes of Proteins and Peptides
Support vector machines can be explored for functional study of proteins and peptides by determining whether their amino acid sequence derived properties conform to those of known proteins and peptides of a specifi c functional class (Cai and Lin, 2003;Cai et al. 2004b;Cai and Doig, 2004;Han et al. 2004b;Dobson and Doig, 2005).
The advantage of this approach is that more generalized sequence-independent characteristics can be extracted from the sequence derived structural and physicochemical properties of the multiple samples that share common functional or interaction profi les irrespective of sequence similarity. These properties can be used to derive classifi ers (Bock and Gough, 2001;Bock and Gough, 2003;Cai and Lin, 2003;Han et al. 2004b;Xue et al. 2004b;Bhasin and Raghava, 2004c;Cai et al. 2004b;Cai and Doig, 2004;Dobson and Doig, 2005;Lo et al. 2005;Martin et al. 2005;Ben-Hur and Noble, 2005) for predicting other proteins and peptides that have the same functional or interaction profi les.
The task of predicting the functional class of a protein or peptide can be considered as a two-class (positive class and negative class) classifi cation problem for separating members (positive class) and non-members (negative class) of a functional or interaction class. SVM and other well established two-class classifi cation-based machine learning methods can then be applied for developing an artifi cial intelligence system to classify a new protein or peptide into the member or non-member class, which is predicted to have a functional or interaction profi le if it is classifi ed as a member. Sequence-derived structural and physicochemical properties have frequently been used for representing proteins and peptides (Bock and Gough, 2001;Bock and Gough, 2003;Cai and Lin, 2003;Bhasin and Raghava, 2004c;Cai et al. 2004b;Cai and Doig, 2004;Han et al. 2004b;Ben-Hur and Noble, 2005;Dobson and Doig, 2005;Lo et al. 2005;Martin et al. 2005) in the development of SVM and other machine learning classifi cation systems for predicting the functional and interaction profi les of proteins. Figure 1 illustrates the process of using SVM for training and predicting proteins or peptides that have a specifi c common functional or interaction profi le. Proteins or peptides known to have and not have the profi le are represented by separate sets of feature vectors, which are composed of descriptors derived from the sequence of these proteins or peptides for representing their structural and physicochemical properties. These two sets of feature vectors are projected into a multi-dimensional space in which they are separated by a hyper-plane in such a way that those having the profi le are on one side and those without the profi le are on the other side of the hyper-plane. A new protein or peptide can be predicted to have the same profi le if its feature vector is projected on the side of the hyper-plane where other proteins or peptides having the profi le are located.

Representation of Protein and Peptide Sequences
Protein or peptide sequences have been represented by a number of amino acid sequence derived structural and physicochemical descriptors (Bock and Gough, 2001;Karchin et al. 2002;. They include amino acid composition, dipeptide composition, sequence autocorrelation descriptors, sequence coupling descriptors, and the descriptors for the composition, transition and distribution of hydrophibicity, polarity, polarizibility, charge, secondary structures, and normalized Van der Waals volumes. Web servers such as PROFEAT ) (http://jing.cz3.nus. edu.sg/cgi-bin/prof/prof.cgi) and ProtParam ) (http://www.expasy.org/ tools/protparam.html) have appeared for facilitating the computation of these descriptors. CBS Prediction Servers (http://www.cbs.dtu.dk/services/) can be used for computing other sequence derived features such as cleavage sites, nuclear export signals, and subcellular localization.
Amino acid composition is the fraction of each amino acid type in a sequence f (r) = N r / N, where r = 1, 2, 3, …, 20, N r is the number of amino acid of type r and N is sequence length. Dipeptide composition is defi ned as fr (r,s) = N rs / (N−1), where r,s = 1, 2, 3, …, 20, and N ij is the number of dipeptide represented by amino acid type r and s (Bhasin and Raghava, 2004a). Autocorrelation descriptors are defined from the distribution of amino acid properties along the sequence (Kawashima and Kanehisa, 2000). The amino acid indices used in these autocorrelation descriptors include hydrophobicity scales (Cid et al. 1992), average fl exibility indices (Bhaskaran and Ponnuswammy, 1988), polarizability parameter (Charton and Charton, 1982), free energy of solution in water (Charton and Charton, 1982), residue accessible surface area in trepeptide (Chothia, 1976), residue volume (Bigelow, 1967), steric parameter (Charton, 1981), and relative mutability (Dayhoff and Calderone, 1978). Each of these indices is centralized and normalized before the calculation. The frequently used autocorrelated descriptors include Moreau-Broto autocorrelation descriptors, normalized Moreau-Broto autocorrelation descriptors and Geary autocorrelation descriptors.
The quasi-sequence-order descriptors are derived from both the Schneider-Wrede physicochemical distance matrix (Schneider and Wrede, 1994;Chou, 2000;Chou and Cai, 2004) and the Grantham chemical distance matrix (Grantham, 1974) between the 20 amino acids. Three descriptors, composition (C), transition (T) and distribution (D), are derived for each of the following physicochemical properties: hydrophibicity, polarity, polarizibility, charge, secondary structures, and normalized Van der Waals volume (Dubchak et al. 1995;Dubchak et al. 1999;. For each property, the constituent amino acids in a protein or peptide are divided in three classes according to its attribute such that each amino acid is encoded by one of the indices 1, 2, 3 according to the class it belongs to. For instance, amino acids can be divided into hydrophobic (CVLIMFW), neutral (GASTPHY), and polar (RKEDQN) groups. C represents the number of amino acids of a particular property (such as hydrophobicity) divided by the total number of amino acids in a protein sequence. T characterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. D measures the chain length within which the fi rst, 25%, 50%, 75% and 100% of the amino acids of a particular property is located respectively. Overall, there are 21 elements representing these three descriptors: 3 for C, 3 for T and 15 for D.  Table 1.

Algorithms and Software Tools of Support Vector Machines
Let the training data of two separate classes, each containing n samples, be represented by (x 1 , y 1 ), (x 2 , y 2 ), …, (x n , y n ), i = 1, 2, …, n, where x i ∈ R N is a vector in an N-dimensional space representing various physicochemical and structural properties of a protein or peptide, and yi ∈(-1, +1) indicates class label (e.g. (+) represents members and (-) non-members of a functional class). In linear SVM, given a weight vector w and a bias b, it is assumed that these two classes can be separated by two margins parallel to the hyper-plane as illustrated in Figure 2 (a), which can be represented as a single inequality: ( ) 1, for 1, 2, , where w = (w 1, w 2 , …, w n ) T is a vector of n elements. As shown in Figure 2 (b), there are a number of separate hyper-planes for an identical group of training data. The objective of SVM is to determine the optimal weight w 0 and optimal bias b 0 such that the corresponding hyper-plane separates S+ and S-with a maximum margin and gives the best prediction performance. This hyper-plane is called Optimal Separating Hyper-plane (OSH) as illustrated in Figure 2 (c).
The equation for a hyper-plane can be written as: By using geometry, the distance between the two corresponding margins is 2 / w . Therefore, the OSH can be obtained by minimizing w under inequality constraints (Eq. (1)). This optimization problem could be efficiently solved with the introduction of Lagrangian multiplier a i .
[ ] The solution to this optimization Quadratic Programming (QP) problem requires that the gradient of L(w, b, α) with respect to w and b vanishes, resulting in the following conditions: By substituting Eqs. (4) and (5) into Eq. (3), the QP problem becomes the maximization of the following expression: under the constraints 1 0, 0 , 1, 2, , α α where C is a penalty for training errors for softmargin SVM and is equal to infi nity for hardmargin SVM. The points located on the two optimal margins will have nonzero coeffi cients α i among the solutions to Eq. (6), and are called Support Vectors (SV). The bias b 0 can be calculated as follows: After determination of support vectors and bias, the decision function that separates the two classes can be written as: Nonlinear SVM projects feature vectors into a high dimensional feature space by using a kernel function K(x,y). The linear SVM procedure is then applied to the feature vectors in this feature space. After the determination of w and b, a given vector x can be classifi ed by using A positive or negative value indicates that the vector x belongs to the members or non-members of a functional class, respectively. In Equation (10), Kernel function K(x,y) represents a legitimate inner product in the input space: A number of kernel functions have be used in SVM. Examples of the most popular ones are: Polynominal: ( ) Gaussian: A vector has a limited number of components, each representing a specifi c physicochemical, structural or biological quantity. Each quantity is normalized or scaled, such that its value is of fi nite value. From a practical point of view, x ⋅ y is of fi nite value so as to avoid the value of polynomial kernel reaching infi nity.

Methods for Training, Testing and Estimating Generalization Capabilities of Support Vector Machines Classifi cation Systems
Several validation methods have been used for training, testing, and estimating generalization errors of a SVM model (Bhasin and Raghava, 2004a;Martin et al. 2005;Plewczynski et al. 2005;Lei and Dai, 2006) based on a "re-sampling" strategy (Weiss and Kulikowski, 1991;Shao and Tu, 1995). The commonly used validation methods include N-fold cross validation, leave one out, leave v out, jack-knifi ng, and bootstrapping. In N-fold cross validation, samples are randomly divided into N subsets of approximately equal size. N-1 subsets are used as a training set for developing a SVM model, and the remaining one is used as a testing set for evaluating the prediction performance of that model. This process is repeated N times such that every subset is used as a testing set once. The average accuracy of the N number of SVM models is used for measuring the generalization capability of the SVM method. When N equals to the total number of samples, the method is called "leave one out" such that every sample is used for testing a SVM model trained by using all of the other samples. "Leave-v-out" is a more elaborate and expensive version of the "leave something out" cross-validation that involves leaving out all possible combinations of v samples as a test set. In jack-knifi ng, samples are distributed and used for training and testing the SVM models in the same way as that of "leave one out" method, but the generalization error of the derived SVM models is estimated based on the comparison of the average accuracy of subsets and that of all sets of these SVM models. In bootstrapping, different combinations of randomly selected subsets of samples are separately used for training SVM models each of which is tested by using the compounds not included in the respective training set.
Moreover, independent evaluation sets have also been used for testing the performance of SVM classifi cation systems Liu et al. 2005;Wang et al. 2005;Lin et al. 2006c). In using this approach, samples are divided into training, testing, and independent validation set based on their distribution in protein or peptide descriptor space. Protein or peptide descriptor space is defi ned by the commonly used structural and chemical descriptors of proteins or peptides. Samples can Table 1. Web-servers for computing functional class of proteins and peptides by using support vector machines. Web-sites of support vector machine software are also given.

Category
Web-server or software URL be clustered into groups based on their distance in the descriptor space by using such methods as hierarchical clustering (Johnson, 1967). An upperlimit of the largest separation of r can be used for restricting the size of each cluster. One or more representative samples are randomly selected from each group to form a training set that is suffi ciently diverse and broadly distributed in the chemical space. One or more of the remaining compounds in each group are randomly selected to form the testing set. The remaining samples are used as the independent evaluation set, which show reasonable level of structural diversity and distinction with respect to compounds of other groups. The performance of SVM has been measured by using the positive prediction accuracy P + for proteins that have a specifi c property and the negative prediction accuracy Pfor proteins without that property (Bock and Gough, 2001;Bock and Gough, 2003;Cai and Lin, 2003;Bhasin and Raghava, 2004c;Cai et al. 2004b;Cai and Doig, 2004;Han et al. 2004b;Xue et al. 2004b;Dobson and Doig, 2005;Lo et al. 2005;Martin et al. 2005;Ben-Hur and Noble, 2005). Moreover, an overall accuracy P = (TP+TN)/N, where TP and TN is the true positive and true negative respectively and N is the number of proteins or peptides, can also be used to indicate the overall prediction performance. In some cases, P, P + and Pare insuffi cient to provide a complete assessment of the performance of a discriminative method (Provost et al. 1998;Baldi et al. 2000). Thus the Matthews h a s been used for measuring the performance of support vector machine (Bhasin and Raghava 2004a;Bhasin and Raghava 2004b;Cai et al. 2004b;Han et al. 2004b;Huang et al. 2005;Kumar et al. 2006).

Assessment of the Performance of Support Vector Machine Classifi cation Systems
Performance for predicting functional classes of proteins and peptides Table 2 summarizes the reported performance of the use of SVM for predicting protein functional classes. The reported P + and Pvalues are in the range of 25.0%~100.0% and 69.0%~100.0%, with the majority concentrated in the range of 75%~95% and 80%~99.9% respectively. Based on these reported results, SVM generally shows certain level of capability for predicting the functional class of proteins and protein-protein interactions. In many of these reported studies, the prediction accuracy for the non-members appears to be better than that for the members. The higher prediction accuracy for non-members likely results from the availability of more diverse set of non-members than that of members, which enables SVM to perform a better statistical learning for recognition of non-members.
The performance of SVM for predicting functional classes of peptides are given in Table 3. Prediction of protein-binding peptides have primarily been focused on MHC-binding peptides (Bhasin and Raghava, 2004c), the reported P + and Pvalues for MHC binding peptides are in the range of 75.0%∼99.2% and 97.5%∼99.9%, with the majority concentrated in the range of 93.3%∼95.0% and 99.7%~99.9% respectively. These studies have demonstrated that, apart from the prediction of protein functional classes, SVM is equally useful for predicting protein-binding peptides and small molecules.

Performance for predicting functional classes of novel proteins
The performance of SVM for predicting the functional profile of novel proteins has also been evaluated by several studies listed in Table 4. These novel proteins are of two types. The fi rst includes several groups of proteins that have no homologous counterpart in well-established protein database, and the second contains pairs of homologous enzymes that belong to different functional families. The non-homologous nature of the fi rst type of novel proteins complicates the task of using sequence alignment and clustering methods for determining their functions. On the other hand, the homologous nature of the second type of novel proteins may result in false association of proteins of different functional families if sequence similarity is used as the sole indicator of functional association. Therefore, it is desirable to explore other methods with less or no reliance on homology to complement sequence similarity and clustering methods (Smith and Zhang, 1997;Eisenberg et al. 2000). From Table 4, SVM appears to have the capacity of correct prediction of 46.3%~76.7% of the novel proteins found from the literatures.
The ability of SVM in predicting the functional profi le of the fi rst type of novel proteins have been attributed to the non-discriminative nature of SVM for selecting class members, and to the use of structural and physicochemical descriptors for representing proteins (Hou et al. 2004;Han et al. 2004a;Cui et al. 2005;Han et al. 2005a;Zhang et al. 2005). In some cases, protein function is determined by specifi c structural and chemical features at active sites, and these features are shared by distantly related as well as closely related proteins of the same functional property (Schomburg et al. 2002). Some of these function-related features might be captured by the residue properties such as hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, surface tension, secondary structures and solvent accessibility (Bull and Breese, 1974;Lin and Timasheff, 1996), which have been incorporated in the descriptors used in the construction of the feature vectors for these proteins. The function of a protein is determined by a variety of factors. Changes such as local active-site mutation, variations in surface loops, and recruitment of additional domains may result in functional diversity among homologous proteins (Todd et al. 2001). While these changes appear to be small at the local sequence level, some of the aspects of these changes may also be captured by the descriptors associated with hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility.

Performance for predicting proteins with specifi c structural characteristics
Subgroups of proteins of specifi c functional classes are known to have common structural features. For instance, a number of RNA-binding proteins have a modular structure and contain RNA-binding domains of 70-150 amino acids that mediate RNA recognition (Mattaj, 1993;Perez-Canadillas and Varani, 2001). Three classes of RNA-binding domains have been documented to bind RNA in a sequence independent manner, and these domains are RNA-recognition motif (RRM), double-stranded RNA-binding motif (dsRM), and K-homology (KH) domain (Perez-Canadillas and Varani, 2001). A fourth class of RNA-binding domain, S1 RNA-binding domain, has also been found in a number of RNA-associated proteins (Bycroft et al. 1997). These domains have distinguished structural features responsible for RNA recognition and binding. Thus the performance of SVM classifi cation of functional classes of proteins can be evaluated by examining whether or not proteins containing one of these domains can be correctly classifi ed into the respective class Leslie et al. 2004;Kunik et al. 2005;Lin et al. 2006c).
A search of protein family and sequence databases shows that there are a total of 260, 74, 190, and 41 RNA-binding protein sequences known to contain RRM, dsRM, KH and S1 RNA-binding domain respectively. The majority of these sequences are included in the training and testing set of all RNA-binding proteins. In the corresponding independent evaluation set, there are 35, 16, 93, and 10 sequences containing RRM, dsRM, KH, and S1 RNA-binding domain respectively. All but one protein sequence are correctly classifi ed as RNA-binding by SVM, which shows the capability of SVM ). The only incorrectly predicted protein sequence is HnRNP-E2 protein fragment in the group that contains KH domain.
The incompleteness of this sequence might partially contribute to its incorrect prediction by SVM.
In another example, some lipid-binding proteins are known to contain lipid-binding domains or motifs (Balla, 2005). Several families of such lipidbinding proteins have been documented and Table 3. Performance of support vector machine prediction of functional classes of peptides. N+ and N-are the number of members and non-members in a class, P+ and P-are the reported prediction accuracy for members and non-member respectively, and P is the reported overall accuracy.

A0201
Orthogonal (  examples of these families are TIM, PP-binding or GCV_H. These families have distinguished structural features responsible for lipid recognition and binding. A search of protein family and sequence databases shows that there are 227, 184, and 139 lipid-binding protein sequences known to contain TIM, PP-binding or GCV_H domain respectively. The majority of these sequences are included in the training and testing set of all lipid-binding proteins. In the corresponding independent evaluation set, there are 81, 27, and 30 sequences containing TIM, PP-binding or GCV_H domain respectively. Most of these protein sequences are correctly classifi ed as lipid-binding by SVM, and there is only 1, 1, and 2 misclassifi ed sequences in the TIM, PP-binding or GCV_H domain families respectively (Lin et al. 2006c). The incorrectly predicted protein sequences are triosephosphate isomerase (fragment), putative acyl carrier protein, mitochondrial precursor, glycine cleavage system H protein, mitochondrial precursor (fragment), probable glycine cleavage system H protein 2 and mitochondrial precursor. Most of these incorrectly predicted sequences are fragments. Therefore, sequence incompleteness appears to be a factor that partially contributes to the incorrect prediction of these sequences by SVM.
Effect of different sets of protein descriptors to the classifi cation of functional classes of proteins As shown in Table 2 and Table 3, different sets of protein descriptors have been used in SVM prediction of various functional classes of proteins and peptides, all of which have shown impressive predictive performances Gao et al. 2005;Li et al. 2006). Non-the-less, there is a need to comparatively evaluate the effectiveness of these descriptor-sets in a single study and to examine whether combined use of these descriptor-sets help to improve predictive performance. For such a purpose, we tested the performance of seven popular descriptor-sets and two of their combinations in SVM prediction of six different classes of proteins. These sets are amino acid composition ) (class 1), dipeptide composition (Gao et al. 2005) (class 2), normalized Moreau-Broto autocorrelation (Feng and Zhang, 2000;Lin and Pan, 2001) (class 3), Moran autocorrelation (Horne, 1988) (class 4), Geary autocorrelation (Sokal and Thomson, 2006) (class 5), sets of composition, transition and distribution of physicochemical properties (Dubchak et al. 1995;Dubchak et al. 1999;Bock and Gough, 2001;Cai et al. 2004a;Han et al. 2004b;Lo et al. 2005;Lin et al. 2006a;Cui et al. 2007a) (class 6), sequence order (Grantham 1974;Schneider and Wrede, 1994;Chou, 2000;Chou and Cai, 2004) (class 7), the frequently used combination of amino acid composition and dipeptide composition (Gao et al. 2005) (class 8), and combination of the seven individual sets of descriptors (class 9). The six protein functional classes are e n z y m e E C 2 . 4 ( N C -I U B M B 1 9 9 2 ) , G protein-coupled receptors, transporter TC8.A (Saier et al. 2006), chlorophyll (Suzuki et al. 1997), lipid synthesis proteins involved in lipid synthesis, and rRNA-binding proteins. These classes were selected because of their functional diversity and level of diffi culty in achieving high prediction performance. The reported SVM prediction performance for these classes tend to be lower than other classes , which are ideal for critically evaluating the effectiveness of different descriptor-sets. The dataset statistics and SVM performance of the nine descriptor-sets are given in Table 5 and the overall performance scores of these descriptor-sets are given in Table 6. The overall performance scores are composed of 4 categories defi ned by the values of MCC of a SVM model: "Exceptional", "Good", "Fair" and "Poor" when MCC is in the range of Ͼ0.9, 0.8-0.9, 0.6-0.8, and Ͻ0.6 respectively. Overall, there is no single preferred descriptor-set for all cases. Sets 6, 8, and 9 tend to exhibit higher sensitivity, with the exception of chlorophyll proteins, while classes 1 and 7 tend to be among the lowest ranked. The combined classes 8 and 9 generally give the highest MCC values, again with the exception of chlorophyll proteins, while classes 1 and 7 tend to return the lowest MCC values. These fi ndings are consistent with the results from a reported study that suggest that amino acid composition, polarity, solvent accessibility and charge, are more important than other properties, in order of prominence, for SVM classifi cation of specifi c protein functional classes (Lin et al. 2006b). Using the entire set of descriptors (class 9) does not necessarily always gives better performance, which is consistent with the fi ndings that analysis of the contribution of individual descriptors and the selection of the relevant ones are highly useful  for improving SVM prediction performance (Glen et al. 1989;Xue et al. 1999;.

Contribution of individual protein descriptors to the classifi cation of functional classes of proteins
In using SVM for predicting functional classes of proteins, several descriptors have been used to describe physicochemical characteristics of each protein (Bock and Gough, 2001;Ding and Dubchak, 2001;Cai et al. 2002a;Cai et al. 2002b;Han et al. 2004b). It has been reported that, not all descriptors contribute equally to the classifi cation of proteins, some have been found to play relatively more prominent role than others in specifi c aspects of proteins (Ding and Dubchak, 2001). It is therefore of interest to examine which descriptors are more important in the classifi cation of proteins. Contribution of individual descriptors to protein classifi cation has been investigated by separately conducting classification using each feature property (Ding and Dubchak, 2001). By using the same method, one fi nds that, in order of prominence, the polarity, hydrophobicity, amino acid composition, and solvent accessibility play more prominent roles than other feature properties in the classifi cation of lipid-binding protein (Lin et al. 2006c). Polarity and hydrophobicity have been shown to be important for lipid-protein interactions such that lipid binding sites are located in a hydrophobic and low polarity environment (Lugo and Sharom, 2005). High-affi nity lipid binding site in some proteins appear to be located at sequence segments with specific amino acid composition (Hamilton et al. 1986), and specifi c sequence motifs have been used for predicting lipid-binding proteins (Gonnet and Lisacek, 2002;Eisenhaber et al. 2003;Juncker et al. 2003;Gonnet et al. 2004;Eisenhaber et al. 2004). A study of apolipophorin-III in lipid-free and phospholipid-bound states showed that lipidbinding involves increased solvent accessibility due to gross tertiary structural reorganization (Raussens et al. 1996). Therefore, the selected descriptors are consistent with these experimental fi ndings.

Analysis of descriptor contributions by using feature selection method
More rigorous feature selection methods (Xue et al. 2004a;Al-Shahib et al. 2005a;Al-Shahib et al. 2005b;), such as recursive feature elimination (RFE) (Guyon et al. 2002), can be applied to the SVM classifi cation of functional classes of proteins to select those descriptors most relevant to the prediction of proteins of a particular class (Guyon et al. 2002;Yu et al. 2003). The details of the implementation of this method can be found in the literatures (Xue et al. 2004a;Xue et al. 2004b).
Feature selection procedure can be demonstrated by the following illustrative example of the development of a SVM classifi cation system for predicting DNA-binding proteins: This system is trained by using a Gaussian kernel function with an adjustable parameter σ. Sequential variation of σ is conducted against the whole training set to fi nd a value that gives the best prediction accuracy. This prediction accuracy is evaluated by means of 5-fold cross-validation. In the fi rst step, for a fi xed σ, the SVM classifi er is trained by using the complete set of features (protein descriptors) described in the previous section. The second step involves the computation of the ranking criterion score DJ(i) for each feature in the current set. All of the computed DJ(i) is subsequently ranked in descending order. The third step involves the removal the m features with smallest criterion scores. In the fourth step, the SVM classifi cation system is re-trained by using the remaining set of features, and the corresponding prediction accuracy is computed by means of 5-fold cross-validation. The fi rst to fourth steps are then repeated for other values of σ. After the completion of these procedures, the set of features and parameter σ that give the best prediction accuracy are selected. A total of 28 features were selected by RFE, which are given in Table 7. In order of prominence, compositions of specific amino acids, Van der Waalse volume, polarity, polarizability, surface tension, secondary structure, and solvent accessibility are found to be important for predicting DNA-binding proteins. Protein-DNA binding is known to involve specifi c recognition sequence and induced conformation changes (Cheng et al. 1993). Therefore it is expected that the combined features of amino acid composition and surface tension is important for characterizing DNA-binding proteins. DNA binding also involves spatial arrangement or pre-arrangement of specifi c group of amino acids at the binding site (Patel et al. 2006). It is thus not surprising that such important interactions as polarizability, hydrophobicity, polarity and surface tension are coupled to the size of the amino acid  2,6,1,3,9,5,4,7 sequence segment at a DNA-binding site. Many proteins bind DNA via minor groove interaction between protein non-polar surfaces and DNA hydrophobic sugar clusters (Tolstorukov et al. 2004). As a result, the combined features of hydrophobicity and solvent accessibility are expected to be important for describing these proteins. The usefulness of these 28 selected features can be further tested by constructing a SVM classifi cation system based solely on these features. The prediction accuracies of this new system are 87.2% and 92.6% for DNA-binding and non-DNA-binding proteins respectively, which is slightly improved against those of 85.7% and 91.2% by using all features. This suggests that the use of selected subset of features enhances prediction performance by reducing the noise created by the redundant and irrelevant features.

Comparison of SVM prediction performance under different kernel functions
Apart from the Gaussian kernel function of sequence-derived physicochemical properties, several other kernel functions have been developed and applied for SVM classifi cation of proteins and DNAs (Jaakkola et al. 1999;Zien et al. 2000;Tsuda et al. 2002;Vert et al. 2003;Vishwanathan and Smola, 2003;Leslie et al. 2003;Liao and Noble, 2003;Ratsch et al. 2005;Kuang et al. 2005). It is of interest to test the usefulness of some of these kernel functions for predicting functional classes of proteins. The string-kernel function has been extensively used and it has shown promising potential for protein and DNA studies (Vishwanathan and Smola, 2003;Ratsch et al. 2005). This kernel function is constructed by comparison of sequences of classes of proteins or DNAs and the assignment of individual weights to amino acids or nucleotides to describe physicochemical or other characteristics of the proteins and DNAs. This kernel function is used to develop three SVM systems for predicting the class of lipid-degradation, lipid metabolism, and lipid synthesis proteins. Spectrum kernel with mismatches (Leslie et al. 2003) is used to generate the string-kernel for each protein. Testing results by using an independent set of proteins for each class show that the SE is 77.2%, 75.8%, 77.8%, and the SP is 97.6%, 96.4%, 94.2% for each of these classes respectively (Lin et al. 2006c). Thus comparable prediction performance can be achieved by using string-kernel SVM, which suggests the usefulness of this and other kernel functions for SVM prediction of functional classes of proteins.

Comparison of SVM prediction performance with other machine learning methods
Several other machine learning (ML) methods have been explored for predicting the functional classes of proteins and peptides. These methods include artifi cial neural network (ANN), k-nearest neighbors (KNN), decision tree and hidden Markov model (HMM). They have been used for predicting enzymes (Jensen et al. 2002), receptors , transporters , structural proteins , mitochondrial proteins (Kumar et al. 2006), cell cycle regulated proteins (de Lichtenberg et al. 2003), growth factors , and allergen proteins (Zorzet et al. 2002;Soeria-Atmadja et al. 2004). The reported P+ and P-values of these ML methods are in the range of 37.8%~87% and 66.0%~99.9%, with the majority concentrated in the range of 60%~85% and 70%~90% respectively. These values are slightly lower than the values of 75%~95% and 80%~99.9% of the SVM, suggesting that other ML methods are also useful for predicting the functional class of proteins and peptides.

Underlying Diffi culties in Using Support Vector Machines
The performance of SVM critically depends on the diversity of samples (proteins and peptides) in a training dataset and the appropriate representation of these samples. The datasets used in many of the reported studies are not expected to be fully representative of all of the proteins, peptides and small molecules with and without a particular functional and interaction profi le. Various degrees of inadequate sampling representation likely affect, to a certain extent, the prediction accuracy of the developed statistical learning models. SVM is not applicable for proteins, peptides and small molecules with insuffi cient knowledge about their specifi c functional and interaction profi le. Searching of the information about proteins, peptides and small molecules known to possess a particular profi le and those do not possess that profi le is a key to more extensive exploration of statistical learning methods for facilitating the study of protein functional and interaction profi les. Apart from literature sources such as PubMed (Beebe, 2006), databases such as Swiss-Prot (Dorazilova and Vedralova, 1992), Genbank (Benson et al. 2004), pirpsd (Barker et al. 1999), geneontology (Chalmel et al. 2005), PDB (Berman et al. 2000), enzyme database (Bairoch, 2000), TransportDB (Ren et al. 2004), HMTD (Yan and Sadee, 2000), ABCdb (Quentin and Fichant, 2000), TiPS (Alexander, 1999), GPCRDB (Horn et al. 2003), SYFPEITHI (Rammensee et al. 1999), MHCPEP (Brusic et al. 1996), JenPep (Blythe et al. 2002), MHCBN (Bhasin et al. 2003), FIMM (Schonbach et al. 2000), and FSSP database (Holm and Sander, 1996) are also useful for obtaining information about protein/peptide functional and interaction profi les.
In the datasets of some of the reported studies, there appears to be an imbalance between the number of samples having a profi le and those without the profi le. SVM method tends to produce feature vectors that push the hyper-plane towards the side with smaller number of data (Veropoulos, 1999), which often lead to a reduced prediction accuracy for the class with a smaller number of samples or less diversity than those of the other class. It is however inappropriate to simply reduce the size of non-members to artifi cially match that of members, since this compromises the diversity needed to fully represent all non-members. Computational methods for re-adjusting biased shift of hyperplane are being explored (Brown et al. 2000). Application of these methods may help improving the prediction accuracy of SVM in the cases involving imbalanced data.
While a number of descriptors have been introduced for representing proteins and peptides (Bock and Gough, 2001;Karchin et al. 2002;, most reported studies typically use only a portion of these descriptors. It has been found that, in some cases, selection of a proper subset of descriptors is useful for improving the performance of SVM (Xue et al. 2004a;Al-Shahib et al. 2005a;Al-Shahib et al. 2005b). Therefore, there is a need to explore different combination of descriptors and to select more optimum set of descriptors for more cases, which can be conducted by using feature selection methods (Xue et al. 2004a;Al-Shahib et al. 2005a;Al-Shahib et al. 2005b). Efforts have also been directed at the improvement of the effi ciency and speed of feature selection methods (Furlanello et al. 2003), which will enable a more extensive application of feature selection methods. Moreover, indiscriminate use of the existing descriptors, particularly those of overlapping and redundant descriptors, may introduce noise as well as extending the coverage of some aspects of these special features. Thus, it may be necessary to introduce new descriptors for the systems that have been described by overlapping and redundant descriptors. Investigation of cases of incorrectly predicted samples have also suggested that the currently-used descriptors may not always be suffi cient for fully representing the structural and physicochemical properties of proteins, peptides and small molecules (Xue et al. 2004b;Li et al. 2005;Yap and Chen, 2005). These have prompted works for developing new descriptors (Bhardwaj et al. 2005).

Concluding remarks
SVM has consistently shown promising capability for predicting functional classes of proteins and peptides. Proper use of descriptors for representing proteins and peptides may help further improving the performance of SVM for predicting functional profi les of proteins and peptides. The introduction of new descriptors would better represent characteristics that correlate with novel functional and interaction profi les. Moreover, various feature selection methods may be used for selecting optimal set of descriptors for a particular prediction problem. Existing algorithms can be improved and new algorithms may be introduced for enhancing the performance and accuracy of support vector machine. The prediction capability of SVM can be further enhanced with increasing availability of biological data and more extensive knowledge about sequence, structure, transcription, post-transcriptional processing features that defi ne the functional profi les of proteins and peptides. These efforts will enable the development of SVM into useful tools for facilitating the study of functional profi les of proteins and peptides to complement other well-established methods such as sequence similarity and clustering methods.