Characterization of non-trivial neighborhood fold constraints from protein sequences using generalized topohydrophobicity.

Prediction of key features of protein structures, such as secondary structure, solvent accessibility and number of contacts between residues, provides useful structural constraints for comparative modeling, fold recognition, ab-initio fold prediction and detection of remote relationships. In this study, we aim at characterizing the number of non-trivial close neighbors, or long-range contacts of a residue, as a function of its "topohydrophobic" index deduced from multiple sequence alignments and of the secondary structure in which it is embedded. The "topohydrophobic" index is calculated using a two-class distribution of amino acids, based on their mean atom depths. From a large set of structural alignments processed from the FSSP database, we selected 1485 structural sub-families including at least 8 members, with accurate alignments and limited redundancy. We show that residues within helices, even when deeply buried, have few non-trivial neighbors (0-2), whereas beta-strand residues clearly exhibit a multimodal behavior, dominated by the local geometry of the tetrahedron (3 non-trivial close neighbors associated with one tetrahedron; 6 with two tetrahedra). This observed behavior allows the distinction, from sequence profiles, between edge and central beta-strands within beta-sheets. Useful topological constraints on the immediate neighborhood of an amino acid, but also on its correlated solvent accessibility, can thus be derived using this approach, from the simple knowledge of multiple sequence alignments.


Introduction
Among the set of relatively simple principles that governs the three-dimensional structures of globular protein domains (Chothia, 1984), two are of obvious importance: i) the masking of a large part of the main chain polarity through the establishment of hydrogen bonds between the amide protons and carbonyl oxygens (mainly within α-helices and β-sheets) and, ii) the hydrophobic effect, underlying the formation of hydrophobic cores of globular domains. In this context, we have highlighted several years ago that strong hydrophobicity has to be conserved in some key positions of a given fold, which were called "topohydrophobic" positions (Poupon and Mornon, 1998;Poupon and Mornon, 2001). Within a typical globular domain, a third of amino acids belongs to a clear hydrophobic group (VILFMYW), but only a half of these strong hydrophobic amino acids occupies "topohydrophobic" positions (Poupon and Mornon, 1998;Poupon and Mornon, 2001), which are mainly located within αand βregular secondary structures.
"Topohydrophobic" positions have noticeable features, as observed from a comprehensive analysis of structural alignments and their associated three-dimensional structures: i) the amino acids in these positions are much more buried than those occupying "non-topohydrophobic" positions (Poupon and Mornon, 1998); ii) the side chains of these amino acids are markedly less dispersed from one domain to another (though belonging to the same fold), than those located at "non-topohydrophobic" positions (Poupon and Mornon, 1998;; iii) they constitute a continuous network of positions in close contact, matching well the inner part of the hydrophobic core (Poupon and Mornon, 1998;; iv) they are mainly occupied by amino acids constituting the folding nuclei .
Identification of these "topohydrophobic" positions from the knowledge of sequence data only is possible in practice if an accurate alignment of a small number (e.g. 5 to 8) of suffi ciently divergent sequences sharing the same fold (e.g. in the 15-25% sequence identity range) can be performed. From sequence data only, amino acids of crucial importance for the considered fold can be thus highlighted, thereby providing topological constraints at long distance along the sequences, which can be useful in a general way to understand topological features of the protein universe (Lindorff-Larsen et al. 2005).
In the present study, we refi ne and extend the concept of "topohydrophobic" positions, by introducing a generalized topohydrophobic index, which evaluates at each position of a given sequence alignment the fraction of amino acids belonging to the hydrophobic group. We then wish to characterize the number of non-trivial close neighbors of each position of a multiple alignment, depending on this generalized topohydrophobic index deduced from current evolutionary profi les and on the associated predicted secondary structure state. The non-trivial close neighborhood of a residue, which can also be defi ned as non-local or long range contacts, is the set of amino acids suffi ciently distant in the 1D sequence but close in the tertiary structure of the considered protein domain. Residues known to be in local proximity (e.g. covalence and α or β local chain neighbors) are excluded from this set.
In order to defi ne the foundations for predictive studies, we fi rst perform a comprehensive analysis on the basis of accurate reference alignments, selected from structural databases. Hence, we consider a large set of structural alignments allowing good statistics and only focusing on regular secondary structures that are at the building blocks of protein globular domains. Thus, the core blocks defi ned in this way only include regions aligned with maximal reliability. The topohydrophobic index is based on the natural partition of amino acids in two groups, considering the mean atom depth associated with each kind of amino acid ). This value is indeed closely related to the mean hydrophobicity, and provides a clear separation between hydrophobic residues and the other ones.
The present analysis signifi cantly differs from previous estimations of absolute contact numbers of residues from amino acid sequence data (Fariselli and Casadio, 2000;Ishida et al. 2006;Kinjo et al. 2005;Pollastri et al. 2001;Pollastri et al. 2002;Yuan, 2005). Indeed, these studies generally consider all contacts in a large sphere (typical distance cut-off of 12 Å between Cβ atoms), whereas we focus here on the mean local non-trivial neighborhood of a position within both kinds of regular secondary structures (α-helices and β-strands) using multiple alignments and a short distance cutoff of 7 Å between Cα atoms. Consequently, the number of predicted neighbors is considerably smaller, in the range of 0 to 6, instead of typically 0-50, as described in previous works. Our study also differ from those devoted to the prediction of long range contact maps (e.g. Punta and Rost, 2005), as these do not generally focus on the quantifi cation of these contacts with respect to the secondary structure and to the evolutionary hydrophobicity profi le of the considered residue.
We show here that an informative neighborhood of residues can be highlighted from sequence data, which differs between helices (often 0 to 2 such neighbors) and strands (mainly 3 to 6 neighbors). Moreover, a clear multimodal behavior of strands can be observed, with a fi rst main state around three neighbors (tetrahedral arrangement), and the other one around six neighbors (two tetrahedra sharing a vertex). This multimodal behavior allows the distinction between central and edge β-strands. Given the high accuracy reached by secondary structure predictors using multiple alignments (e.g. Frishman and Argos, 1997;Jones, 1999;Pollastri and McLysaght, 2005;Rost and Sander, 1995;Thompson and Goldstein, 1997), the present study offers the possibility of acquiring a good quality information to predict tertiary structures from sequence data only, using a minimal number of parameters.

Datasets and reduction of redundancy
The structural alignments used in this study provide enough data to obtain accurate results, while still supporting a structural relevance. Structural alignments performed and/or extensively corrected by human expertise, as those used for the previous description of "topohydrophobic" positions (Poupon and Mornon, 1998), furnish particular good data; however, due to the considerable increase of structural data, such an expert-based procedure is now unconceivable for analysis on a large scale.
Among the main available databases of structural alignments (e.g. BaliBASE (Thompson et al. 1999;Thompson et al. 1999), HOMSTRAD (Mizuguchi et al. 1998), PALI (Balaji et al. 2001), FSSP (Holm and Sander, 1994)), only FSSP (after Families of Structurally Similar Proteins) offers a large number of families, which include at least 8 members and display enough sequence divergence to be informative. For example, PALI, using the SCOP classifi cation (Murzin et al. 1995), only includes, at the time of this study, 171 families with 8 members or more. Moreover, this number dramatically decreases when adding a sequence divergence criterion (Sequence Identity (SI) between two members belonging to a same family shall be less than 50%). FSSP is based on an automatic processing of structural alignments, using a score of structural similarity (Z-score) (Holm and Sander, 1993). The FSSP release we considered contains 2859 sub-families, 2520 being composed of at least 8 members and thus satisfying the selection criteria on work positions, as defi ned below (Fig. 1). The amount of data is important, as these 2520 alignments include 403 500 sequences, built from 26 577 different amino acid chains. Many chains are therefore present in several sub-families, particularly owing to the presence of the same globular folds within multi-domain proteins. This redundancy has to be reduced before any analysis.
To that aim, we use two criteria: the level of sequence identity (SI) and the structural alignment quality (Z). One expects, as a main feature, that the structural quality is on average markedly better within regular secondary structures (α-helices and β-strands) than within coil regions. Hence, we do not consider loops and linker regions, in which alignments are known to be often of bad quality or even senseless. i) Sequence Identity (SI). Among families, a pairwise sequence identity (SI) cut-off of 90% dramatically reduces the considered amino acid chain numbers from 26 577 to 5055. A more stringent SI threshold (50%) led to yet conserve 3519 different sequences. We consider this value as a good compromise between the amount of informative data and an acceptable level of redundancy. Meanwhile, the number of families with at least 8 members only slightly decreases (2520 for the initial dataset, 2431 for SI = 90% and 2406 for SI = 50%). Figure 1A shows that the mean pairwise identity on work positions within each sub-family is indeed low (8.3%), giving evidence for a low redundancy, while keeping good structural superimposition (Fig.1B). ii) Structural alignment quality (Z) (Holm and Sander 1994). In the same order of idea, a compromise has to be searched between the amount of data and their structural relevance. Among several thresholds, we choose a low value of Z = 4 for the multiple alignment quality (this value is calculated regarding the leader sequence of the family). Indeed, higher values such as Z Ն 10 reduce the number of sub-families with at least 8 members to 549, while Z Ն 4 leads to consider 1721 sub-families. Figure 1B illustrates the actual distribution of Z values (the mean is 7.3), which are in the range of Z-scores between pairs of native-state structural homologues (typically Ͼ5 (Dietmann et al. 2002)). Combining both thresholds (SI = 50 % and Z Ն 4), we obtain a database of 1721 sub-families of at least 8 members, including a total of 98 436 sequences, 2876 sequences being distinct from each other. Figure 2A summarizes this process (steps 1 to 3).
Step 4 considers a composition identity (CI) threshold between families (0.5, 0.5) (see below and Fig. 2B). iii) Composition identity between families. On average, each amino acid chain appears in 35 sub-families. Two sub-families may thus contain identical members. This redundancy has also to be reduced as much as possible. To that aim, we compute the composition identity CI ij for each pair (F i , F j ) of N sub-families and consider that they are related if CI ij Ͼ D. We then build all the subgroups of related sub-families and, among each subgroup, we eliminate the most common sequences in related families in order to decrease their composition identity to new acceptable CI ij values. This is done until all remaining sub-families in the subgroup are unrelated. Note that if the number of sequences in a given sub-family becomes lower than 8, the sub-family is discarded. Moreover, by eliminating sequences in sub-families that belong to different subgroups, new composition similarities may appear between those sub-families.
That is why we decided to perform successive cycles, decreasing the threshold D from 0.8 to the 0.5 fi nal value. During this procedure, we only discard 200 sub-families and 100 amino acid sequences, while two thirds of redundant sequences (approximately 66 000) are eliminated. Figure 2B illustrates the convergence of this process, which leads to a dataset of 1485 sub-families (31 327 sequences and 2727 distinct amino acids chains) with at least 8 members (mean 20) and sharing no more than (0.5,0.5) composition identity (Fig. 1C). In a given family, pairwise sequence identity is necessarily less than 50% and is generally much lower ( Fig. 1A) and members have a reliable structural alignment quality (Z Ն 4) with respect to the leader sequence of the family (mean 7.3, Fig. 1B).
The original FSSP alignments are reformatted according to the following information: sub-family name and PDB accession number of the leader sequence, number of members (Ն8), PDB accession numbers of these members, associated structural FSSP Z indexes, alignment length, corresponding aligned sequences and aligned secondary structures (assigned through DSSP ). In addition, 3D coordinates of α-Carbons and solvent accessibilities calculated by DSSP  are reported for each residue. Figure 3 shows a typical fi le for a family of eight members.

Amino acid classes.
The large dataset of reliable multiple alignments constituted here remains however considerably too small to consider the twenty different amino acids in each work position. The clustering of amino acids into a limited number of classes is thus necessary. Usually, three to six classes may be rationally defi ned (e.g. VILFMYW for the strong hydrophobic class, mainly present within the internal sides of regular secondary structures, GPDSN as main loop-forming residues and ARC-QTEKH for the intermediate class Hennetin et al. 2003)). Here, we consider a Step 1; 90 % sequence identity threshold.
Step 3; Structural Z-score threshold Ն 4. Step 4; Composition identity between families Յ (0.5, 0.5). B. The three-steps CI redundancy elimination (see text), number of different chains (solid line), total number of chains (dotted lines). simple partition into two classes, derived from a continuous scaling of the 20 amino acids with respect to their mean atom depth, as defi ned from a representative set of globular proteins . Mean atom depth indeed allows the sorting of the 20 amino acids in two distinct groups: IVFLWMCYA (G 1 ) and HTGSPNRQDEK (G 2 ) (Fig. 4). This classifi cation shows good agreement with mean amino acid burying values, defi ned through Voronoï tessellations on representative sets of globular domains (Soyer et al. 2000). The two main groups G 1 (mainly hydrophobic amino acids) and G 2 (mainly neutral and hydrophilic amino acids) gather 44 and 56% of the total number of amino acids, respectively. The amino acids of group G 1 are similar to those that were considered hydrophobic by other studies dedicated to long-range contacts (e.g. Punta and Rost, 2005).

Work positions
We name "work positions" positions in the multiple alignment for which at least 8 amino acids are aligned. The consideration of this absolute number, rather than a relative proportion of all aligned sequences, allows the handling of representative subsets of these alignments, while ignoring positions in which gaps are predominant.

Generalized topohydrophobic index
Each work position is characterized by its percentage in amino acids belonging to the G 1 group. We name it generalized topohydrophobic index or y 1 , because it records the proportion of hydrophobic amino acids (G 1 ) occupying the position. Distributions of the y 1 parameter are plotted within histograms, according to grouping intervals of 1/8 as a reference to the minimal number of amino acids (8), which have to be present in a work position to be considered.

Major secondary structure
We choose to take into account only work positions in which a same secondary structure is suffi ciently conserved (at more than x%). Figure 5A shows the number of work positions as a function of this threshold x. We consider that x Ն 75% offers an acceptable compromise, ensuring that work positions are structurally relevant according to the secondary structure conservation and keeping enough data to perform a large-scale study. Figure 5B shows the distribution of work positions Figure 3. A sub-family example. A. Sequence and secondary structure alignment fi le. The sub-family "1mai", whose leader sequence is the PH domain of the phospholipase C delta (pdb code 1mai), includes eight members. B. Superimposition of the PH folds of 1mai and 1bak (Z-score 8.3), according to the FSSP alignment shown in A. 53 Cα belonging to the seven strands and to the C-terminal helix have been superimposed (RMSD 1.59 Å). The superimposed segments of these two sequences share 19 % of identity (13 % on the entire domain). This superimposition is typical of this sub-family and is representative of the whole bank.
in the different secondary structures as a function of the generalized topohydrophobic index y 1 .
Mean solvent accessibility of a work position Relative accessibilities are computed starting from the absolute accessibilities provided by DSSP . The standard accessible surfaces in Å 2 for residues are derived from canonical G-X-G confi guration calculations by Shrake and Rupley (Shrake and Rupley, 1973)

Non-trivial neighbors
The non-trivial neighborhood of an amino acid can be described from the known atomic coordinates.  , plotted in the decreasing order of mean atom depths, show two distinct groups of amino acids; on the one hand, the mainly hydrophobic ones (44 % of the total number of amino acids in the bank) and on the other hand neutral and hydrophilic ones (56 % of the amino acids). Histidine, which lies at the frontier between these two groups, was also shown to be the most indifferent amino acid regarding its α β or coil states . Two amino acids are defi ned as non-trivial neighbors if their Cα are separated by less than 7 Å (Tudos et al. 1994) and if they are distant in sequence from more than 6 residues (Fig. 6). The mean number of neighbors for a work position is defi ned as the average number of non-trivial neighbors of the amino acids belonging to that position. An even better way to consider the amino acid neighborhood, which is independent of a cutoff threshold value, would have been to use a description through pondered Voronoï tessellations (Angelov et al. 2002;Dupuis et al. 2005;Dupuis et al. 2004;Soyer et al. 2000). However, this description is prohibitively time-consuming and thus out of scope for a large-scale study.

Dataset
A set of benchmark alignments is selected as described in the Methods section, in order to estimate the number of long-range (or non trivial) contacts of amino acids, with respect to the general topohydrophobic index deduced from the multiple sequence alignment and to the associated secondary structure. The dataset considered here includes 1485 sub-families (31 327 sequences and 2727 distinct amino acids chains) with at least 8 members (mean number 20) and sharing no more than (0.5, 0.5) composition identity, a parameter that was introduced in order to avoid redundancy between subfamilies. In a given family, pairwise sequence identity is necessarily less than 50% and quite always far below (mean 8.3 %) and the members have a confi dent structural alignment quality (Z) of at least 4 (mean 7.3) with respect of the leader sequence of the family. It is worth noting that all proteins sharing a same fold, fulfi lling the selected sequence identity and structural alignment quality criteria described above, are not clustered into a unique family. Some sub-families described above are subsets of proteins possessing at least one domain with a given fold. This distribution in several sub-groups is directly dependent on the initial FSSP dataset and to the selection procedure. For example, some members of the family shown in Figure 3 (family 1mai-Pleckstrin Homology (PH) fold) are found in eight other families with a PH fold domain. However, the alignments well cover the known universe of globular domains, and are thus representative of the structural conservation and diversity within proteins.
We analyze the main features of "work positions" in multiple alignments (see defi nition in the Methods section), for which more than 75% of the residues share the same secondary structure. As structural superimpositions and secondary structure assignments were automatically performed, local mismatches may occur. However, these mismatches only constitute a marginal fraction within the fi nal alignments obtained after fi ltering of the initial dataset. Only 8% of the 97 000 retained work positions exhibit more than one H/E discrepancy and thus only constitute a background noise, which do not sensibly modify the main results of this study. The good quality of solvent accessibility predictions, which are directly performed on our fi ltered database of structural alignments (see below) and are similar to results obtained with other methods Pascarella et al. 1998;Thompson and Goldstein, 1996), further supports the overall structural relevance of work positions.
The partition of amino acids in two groups G 1 (IVFLWMCA) and G 2 (HTGSPNRQDEK), as introduced in the Methods section, and the distribution of group compositions in 1/8 lead to 9 distinct topohydrophobic y 1 values (0, 0.125, …, 1), which can describe a work position. 27 classes of work positions (X, y 1 ) can thus exist, combining y 1 and X, the major secondary structure (X = helix, strand or coil). The 27 classes are often largely represented in the bank. The less populated classes are the limit cases, consisting in fully hydrophilic strands (Strand, 0) and fully hydrophobic coils (Coil, 1) (462 and 296 work positions, respectively; Fig. 5B). We principally consider the 18 classes of work positions associated with regular secondary structures (X = H or E; 60 021 and 36 830 work positions, respectively).

Positions within helices
Relative solvent accessibility Figure 7A illustrates the behavior of the mean relative solvent accessibility in helix work positions within multiple alignments, as a function of the generalized topohydrophobic index y 1 , ranging from 0 to 1. As expected, the mean relative accessibility to solvent diminishes when y 1 increases. We also consider the individual behaviors of G 1 -and G 2residues. We observe that the G 1 -and G 2 -values depend on the y 1 value of the work positions, and both diminish when y 1 increases. The two curves are quite parallel for the two groups, with the G 1 mean values smaller, as expected, than the G 2 ones. The distribution of mean relative accessibilities around the mean values, shown in Figure 7A, is illustrated in Figure 8A. For very low y 1 values (low hydrophobicity), the mean relative accessibilities are distributed according to a Gaussian-like rule centered on 0.45 and, as y 1 increases, this curve smashes towards the origin, with a mean below 0.1 for 95% of the 1977 totally hydrophobic work positions (y 1 = 1). For y 1 = 0 (fully neutral or hydrophilic positions), a small peak, indicated by a star, reveals the existence of buried positions. It likely corresponds to salt bridges, and more generally to pairs of side-chains in mutual neutralizing polar contacts within globular cores. This observation moreover provides indirect biophysical support to the data quality of the FSSP-derived bank.

Number of non-trivial close neighbors
The number of non-trivial close neighbors (Fig. 8B) shows a symmetrical behavior compared to the relative accessibility (Fig. 8A). The number of non-trivial neighbors of work positions within helices increases as hydrophobicity rises from y 1 = 0 to y 1 = 1, but is rarely greater than 2, even for completely buried positions (mean accessibility Ͻ 0.1), within the internal sides of helices. This mainly results from the principal occupancy, in such confi gurations, of the close neighborhood by trivial neighbors, which restrains the free space for external residues, and from the convex geometry of α-helices, roughly cylindrical, with a large dispersion of side chains. G 1 and G 2 groups are both concerned by this increase of the number of nontrivial neighbors (Fig. 7B). Work positions with high hydrophobicity within helices mainly establish contacts with other helices (Fig. 7C). Moreover, these contacts mainly involve G 1 amino acids within the hydrophobic core (data not shown). Figure 8C illustrates such a situation.

Positions within strands
A similar investigation was performed for work positions associated with β-strands (Figs. 9 and 10). The most striking result for β-strands is a strong increase of the number of the non-trivial fi rst neighbors and a clearly multimodal distribution observed for almost all y 1 values, and in particular for the less hydrophobic ones (low y 1 values). The weakly populated mode, centered on approximately one neighbor, is likely associated with highly external positions at the extremity of some strands. The two other modes (near 3 and 6 neighbors) are likely to correspond to external (edge) and internal (central) positions of strands within β-sheets, respectively. Indeed, the second mode (around 3) mainly relies on the architecture of β-strands within sheets, where side chains in positions i, i + 1, i + 2 in one strand occupy a roughly equilateral triangle. This triangle constitutes the basis of the interaction with another amino acid j of a neighboring strand, linked to the "i" strand through canonical main chain H-bonds. These four residues constitute a more or less deformed tetrahedron (distance between Cβ ~6.2 Å), which represents the basic unit of compact packing of similar sized spheres (Fig.10C). The third mode (around 6) mainly corresponds to a geometry with two tetrahedra (one strand sandwiched by two others) sharing a vertex, which has 6 first non-trivial neighbors (Fig.10C). Many deviations from this ideal scheme occur and tend to fl atten the Gaussian distribution. As for helices, the number of non-trivial neighbors increases with hydrophobicity of a work position (Fig. 9B) and strand non-trivial neighbors are very often found within other strands (Fig. 9C). The present study quantifi es this behavior and offers the opportunity to gain information on the probable participation of an amino acid in an internal or external strand position, through the only knowledge of multiple sequence alignments.

Infl uence of fold classes
The dataset is large enough to estimate the putative infl uence of fold classes on some parameters. Four main classes, as described in the SCOP classifi cation (Murzin et al. 1995), were considered (all-α (297 sub-families), all-β (370 sub-families), α/β (530 sub-families) and α + β (131 sub-families)). One can expect that differences in the tertiary structures between the four fold classes are refl ected in the level of hydrophobic contacts, involving residues of the G1 group, and in particular in positions with a high topohydrophobic index (y 1 = 1). Hence, one can observe that the mean number of non-trivial neighbors belonging to the G 1 group for strand work positions with a high topohydrophobic index is sensibly higher for the α/β class than for the three others (4.51 versus 4.02 (α), 3.25 (β) and 3.79 (α + β); Fig. 11). This is all the more noticeable than the total number of non-trivial neighbors of strands work position with a topohydrophobic index of 1 is rather constant (Table 1). A hypothesis to explain such a behavior is that a larger number of fully hydrophobic work positions with a structural role exist in the α/β and even α classes, but this remains to be further investigated. Furthermore, one can note that better performance of programs for the prediction of long-range contacts are reported by at least two studies for this same α/β class (MacCallum, 2004;Punta and Rost, 2005).

Discussion
The prediction of non-trivial neighborhood, or long-range contacts, from protein sequences is of particular interest to improve comparative modeling and to enhance fold recognition and ab-initio fold prediction. It can also help to detect remote relationships between protein sequences and to solve experimental structures. Contact prediction methods have received much attention during the last decade and often combine the evolutionary information available from multiple alignments and the prediction of secondary structures. They can be roughly classifi ed in two non-exclusive categories: statistical correlated mutations approaches (see for examples Halperin et al. 2006;Kundrotas and Alexov, 2006) and machinelearning approaches (see for example Punta and Rost, 2005). While most methods aimed at predicting contact maps, several other approaches have been developed to estimate the total number of contacts (Fariselli and Casadio, 2000;Ishida et al. 2006;Kinjo et al. 2005;Pollastri et al. 2001;  Using Gaussian approximation to deconvoluate the overall profi le highlights the multimodal distribution of strand neighbors. Three modes (1, 2 and 3) are present: ~1.2, 3.3 to 4.5 and 5.6 to 6.5 mean neighbors, respectively. C. Two fi rst views. Current tetrahedron found between Cβ of residues i, i + 1, i + 2 of a strand and another residue in an adjacent strand. The example shown in two orthogonal views is from 1mai (S98, I99, V100 and V75). The mean tetrahedron edge size is 6.3 Å. Last view. Two tetrahedra sharing a vertex: i, i + 1, i + 2 of a strand; j, j + 1, j + 2 of another one, which sandwiches a residue. The shown example is also taken from 1mai (V75, R76, M77/L108, D109, L110/S98; mean edge size of 5.9 Å). Pollastri et al. 2002;Yuan, 2005), but these generally defi ne large numbers of coordination, including trivial neighborhood, and rarely link these numbers to the topological and evolutionary features of the region which includes the concerned residue.
Our analysis outlines the relationship between the mean number of non-trivial neighbors and a topohydrophobic index, which relies on the mean hydrophobicity of a position within a multiple alignment of sequences, as a function of the secondary structure. The topological data we collected here might be used in a predictive perspective, as secondary structures can currently be predicted with a good accuracy using multiple alignments (see for example Rost and Sander, 1993). As noticed in earlier studies (Punta and Rost, 2005), the performance of the various estimations that can be made on the long-range contacts directly depends on the quality of the evolutionary profi les, which have to be large and to contain divergent sequences to furnish accurate information.
The original result of this study is that different behaviors relative to non-trivial neighbors can be observed for helix and for strand residues, and among strands, for central and edge β-strands. Starting from these observations, the prediction of the topological nature of β-strands can be approached using classifi cation methods like decision trees (see Supplementary data 1). Briefl y, using parameters such as the length of the strand, its mean hydrophobicity and periodicity of G 1 and G 2 residues, combined with topohydrophobic index, decision trees lead to an accuracy of 80% for the prediction of edge/central positions within β-sheets (Supplementary data 1). Although it is diffi cult to compare methods using different datasets for training and prediction, this approach appears to achieve a prediction accuracy similar to the one obtained by Siepen and coworkers , which is based on the use of support vector machine (SVM) and decision trees.
The use of the topohydrophobic index, combined with information on the nature of secondary structures, the group (G 1 or G 2 ) to which the residue belongs, as well as environmental parameters, describing the local periodicity, also allows the prediction of the relative solvent accessibilities of a residue within a work position into two or three states models (exposed, intermediate and buried; see Supplementary data 2). In the ideal case, when secondary structures are "known", solvent accessibility predictions using this methodology led to Q2 of 79% (16% threshold) versus 75% for other methods tested on the same dataset and based on neural networks  or probability profi les/  support vector machines  and to Q3 of 65% versus 58% for the same other methods (9-36 % threshold). On the one hand, the accurate prediction of solvent accessibility using generalized "topohydrophobicity" provides additional constraints on informative positions of a sequence (the work positions). On the other hand, these results further support the intrinsic quality of the dataset used for this study.
The present analysis shed light on important geometrical and topological parameters that can help to understand protein sequence-fold relationships. It appears of particular interest that the dichotomy (hydrophobicity-hydrophilicity) between only two nearly equally populated classes of amino acids provides a very simple way to derive useful and often accurate topological data, that can be useful for protein fold recognition. Thompson, J., Plewniak, F. and Poch, O. 1999. A comprehensive comparison of multiple sequence alignment programs. Nucleic. Acids Res., 27:2682-90. Thompson, J.D., Plewniak, F. and Poch, O. 1999. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15:87-8. Thompson, M.J. and Goldstein, R.A. 1996. Predicting solvent accessibility: higher accuracy using bayesian statistics and optimized residue substitution classes. Proteins, 25:38-47. Thompson, M.J. and Goldstein, R.A. 1997. Predicting protein secondary structure with probabilistic scheme of evolutionarily derived information. Protein Sci., 6:1963-75. Tudos, E., Fiser, A. andSimon, I. 1994. Different sequence environments of amino acid residues involved and not involved in long-range interactions in proteins. Int. J. Pept. Protein Res., 4:205-8. Yuan, Z. 2005. Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics, 6:248.
Characterization of Non-Trivial Neighborhood Fold Constraints from Protein Sequences using Generalized Topohydrophobicity Guillaume Fourty, Isabelle Callebaut and Jean-Paul Mornon

Supplementary Data 1
Use of decision trees for predicting the edge/central nature of β-strands, as a function of the topohydrophobic index and of the predicted secondary structures Using sequence data from work positions in multiple alignments and the J4.8 implementation of the C4.5 program (Quinlan, 1993) to derive decision trees, we aimed at predicting the topological nature of strands (central, or edge). We adopted the following strategy:

Dataset
We used information provided by DSSP  on beta partners (BP) and we only considered "complete" strands (undamaged by the DSSP assignment and the FSSP automatic multiple alignment procedure). We identifi ed on the leader sequences of the 1485 sub-families, 7541 central strands and 7886 edge strands (49% and 51% of the total strands, respectively). We observed that 75% of amino acids possessing less than four non-trivial neighbors belong to edge strands, while 83% of amino acids possessing more than fi ve non-trivial neighbors are central strands. This can be related to the canonical neighborhood of one or two tetrahedral configurations, as commented in the main text. Among the 15407 strands selected above, 8018 possess at least one well-defi ned work position, which can thus be used to predict their nature (central or edge). This dataset was used to provide training and cross-validation data.

Selected attributes for decision tree classification
We used only few attributes in order to obtain a simple classifi cation tree (and therefore simple rules) and to easily discriminate their infl uence in the prediction process. The parameters were the length L of the strand expressed in amino acid units, the strand hydrophobicity H defi ned as H = Σ L i=1 h i / L (h i = 1 for G1 amino acids and 0 for G2 ones), the polar periodicity of the strand P defi ned as P = Σ p i /(L-1) (p i = 0 if h i = h i+1 and p i = 1 if h i ≠ h i+1 ), the strand charge C defi ned as C = Σ C i / L (C i = 1 for D, E, K, R, H, C i = 0 for other amino acids). These parameters can be extended to the mean values H m , P m , C m for aligned sequences within sub-families. From multiple sequence alignments, we also introduced a simple additional parameter: the mean topohydrophobicity, T m , which is the mean of y 1 indexes when several work positions are present in the considered strand.
The predictive power of this approach shall be compared to the basic level of a random prediction (50%) or that of the major class (edge β-strands) at 51%. Table S1 shows the results for the leader strands of the considered sub-families, using various decision trees built with single parameters or combinations of them. Immediately after the length L, hydrophobicity H is determinant. With only two parameters, a decision tree is yet efficient to distinguish central and edge strands, as shown in Figure S1 and gives 77% of good predictions. The use of the strand length L and two parameters deduced from multiple sequence alignments H m and T m leads to nearly 80% of good predictions (Table  S2). This combination seems to be the best one, although implying only few basic data. As often, it is diffi cult to precisely compare these results with other approaches dealing with the same topic, as many features differ. However, prediction accuracy appears to reach the same level as in a previous study ). This analysis uses secondary structure elements (β-strands) defi ned using DSSP  from experimental structures. Accuracy should be reduced starting from secondary structure predictions, although a good level of secondary structure prediction accuracy can now be reached using predictive tools such as PSI-PRED (Jones, 1999).   (Matthews, 1975) (Matthews, 1975) Supplementary Data 2 Use of decision trees for predicting the relative accessibility as a function of the topohydrophobic index and of the predicted secondary structures.
The estimation of the number of non-trivial neighbors described in this study is based on divergent and accurate multiple sequence alignments, explored through a highly simplifi ed alphabet made of only two amino acid classes G 1 and G 2 (see main text). We similarly addressed the prediction of the relative solvent accessibility of a residue into two or three state models. First, in order to calibrate the process, we considered that the secondary structures are known, i.e. we used the DSSP assignments based on 3D coordinates . Then, we used this approach to predict the burying of selected positions within the multiple alignments, assuming that secondary structure predictions in these positions are accurate.

Selected attributes for the decision tree classifi cation
To describe a residue of the leader sequence, occupying a work position, we used the secondary structure state (H or E), the generalized topohydrophobic parameter y 1 deduced from the multiple Figure S2. Prediction of relative solvent accessibility. Evolution of the level of good predictions (dotted line) and of the fraction of predicted residues (solid line), as a function of the occupancy of work positions. The accessibility threshold is fi xed to 16% and the secondary structure conservation to 75%.
sequence alignment, the group parameter G 1 (0 or 1) of the considered residue and four environment parameters Env i (i=1, 4) associated with positions i. Env i = (G n-i + G n+i )/2. Env i can thus take three values: 1.0, 0.5, 0.0, describing the local periodicity. Gaps and 5 amino acids at each extremity of the sequence were discarded. As for the prediction of edge and central β-strands, these attributes were completed by those derived from multiple alignments, which are the SSM (Secondary Structure -Major state) and the topohydrophobic index y 1 . Building decision trees with those 7 attributes is time consuming when studying the whole FSSP-derived database. In order to overcome this diffi culty, we used a reduced bank of 270 multiple alignments derived from the 1485 sub-families of the whole bank. These 270 leader sequences include non-redundant SCOP folds with a total of 77 108 amino acids and 16 000 (H or E) work positions.
• Infl uence of the work position occupancy on prediction. To evaluate the infl uence of the available data in a work position, we used a Q2 index in a two state model with a classical relative solvent accessibility threshold of 16%. Figure S2 shows that, as expected, the level of good predictions increases with the occupancy of a work position and is quite satisfying above 8 to 10 members per work position. For the time being, the bottom level of major secondary structure is kept at 75% for each considered work position, as described in the Material and Methods section. • Infl uence of the major secondary structure threshold on prediction. We fi xed the minimal work position occupancy at 8 and let the major secondary state range from 33 % to 100%. Figure S3 shows the link between these parameters and confirms that a level of 75% for the major secondary structure threshold constitutes an acceptable compromise for a large-scale study. When work position occupancy and major secondary structure conservation are high, predictions are better but remain applicable to a reduced set of work positions. The couple (1, 33 %) leads to 74% of good predictions for 100% of H or E positions. In contrast, (35, 95%) leads to 87 % of good predictions but only for 3% of H or E positions. (8, 75%) and (10, 80%) give 77% and 79% of good predictions, respectively, for 40% and 27% of H or E positions. All these predictions are performed through a 10-fold cross-validation procedure on the whole bank of 16 000 residues occupying a work position. • Two states prediction. Figure S4 shows the decision tree built for a two-state model with a threshold at 16% of relative solvent accessibility and work positions (10, 80%). It led to 79% of good predictions. Clearly, Env 1 and Env 3 are of minor infl uence with respect to Env2 and Env4 tuned on the natural periodicity of strands and helices, respectively. • Three states prediction. Using the same defi nition of work positions (10, 80 %) and a three-state model with classical 9% and 36% thresholds of relative solvent accessibility, a similar process leads to a Q3 of 65 good predictions. • Comparison with previous approaches. In order to evaluate the predictive power of our approach, in the case where the secondary structure is assumed to be "ideally" known (i.e. by automatic assignment based on experimental atomic coordinates), we compared it to results obtained on banks composed of between 111 and 421 structures by other sophisticated approaches using neural networks (NN , Bayesian statistics (Thompson and Goldstein 1996) and probability profi les/ support vector machines (PP) ). For example, for a 16% threshold, Q2 are around 75% for NN and PP and around 79% for our approach; for a 9%-36% three states model, Q3 are close to 58% for all methods and around 65% for our approach.
It is worth noting that our approach does not predict all positions (it focuses on available work positions, see Material and Methods for a defi nition Figure S3. Prediction of relative solvent accessibility. Evolution of the level of good predictions (dotted line) and of the fraction of predicted residues (solid line) as a function of the secondary structure conservation. of residues correctly predicted by RAPT (and not by PHD) is higher than the reverse situation (Table  S3). Thus, within work positions, RAPT provides on average better prediction results than PHD. Although addressing a limited number of residues, it takes advantage of simple decision rules, which are easily interpretable.