Universal Features for the Classification of Coding and Non-coding DNA Sequences.

In this report, we revisited simple features that allow the classification of coding sequences (CDS) from non-coding DNA. The spectrum of codon usage of our sequence sample is large and suggests that these features are universal. The features that we investigated combine (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine, Guanine, Adenine probabilities in 1st, 2nd, 3rd position of triplets, respectively, (iv) the product of G and C probabilities in 1st and 2nd position of triplets. These features are a natural consequence of the physico-chemical properties of proteins and their combination is successful in classifying CDS and non-coding DNA (introns) with a success rate >95% above 350 bp. The coding strand and coding frame are implicitly deduced when the sequences are classified as coding.


Introduction
Since amino acids are encoded by codons, which are triplets of nucleotides (A, C, G or T, i.e. Adenine, Cytosine, Guanine and thymine, respectively), coding DNA is necessarily a multiple of three nucleotides. Therefore, should a stretch of DNA start and end with stop codons (TAA, TAG, TGA) separated by a whole number of nucleotide triplets, the question arises as to whether this DNA stretch is coding or not. Hereafter, we will refer to these DNA stretches as "open reading frames" (ORF).
ORFs are expected to be shorter in DNA sequences with AT (Adenine + Thymine) levels 50% for the obvious reason that A and T are more frequent in stop codons than G. Since there are three stop codons and 61 amino acid codons, (3:61) a stop codon occurs with a probability of approximately one in twenty (1:20). Furthermore, given three base pairs per codon, this should lead to one stop codon every sixty base pairs, in which A, C, G or T are equally likely to occur. Therefore, one would expect the ORF size to be around 60 bp. Of course, the frequency of stop codons may vary significantly depending upon the local nucleotide composition (see below in the section of Results). However, one could say that the probability of an ORF being a coding sequence increases with its size. Most proteins are larger than 100 codons (300 bp) and their ORFs should be, therefore, relatively easy to classify. Unfortunately, the coding sequences (CDS) of eukaryotes are split up by the non-coding DNA of introns leaving coding stretches (exons) 300 bp.
The physico-chemical constraints on proteins induce specific usage of nucleic triplets that can be efficiently detected by Markov Models. 1 Investigating the evolutionary origin of the genetic code, Ikehara et al 2 showed that it may have originated from a four-amino acid system, the GNC code. This GNC code (G for Guanine, N for any of the 4 nucleotides, C for Cytosine) is able to encode GADV-proteins (G for Glycine, A for Alanine, D for Aspartic acid, V for Valine) with appropriate three-dimensional structures, being water soluble globular proteins (hydropathy, α-helix, β-sheet, and β-turn) and also having catalytic activities. 3 According to Ikehara et al, 2 this primitive code would have evolved first in a code with 16 codons and ten amino acids, the so called SNS (S for strong: G or C) and then in the RNY (R for purines, Y for pyrimidines) ancestral codon suggested by Shepherd. 4 Consequently, the coding DNA is characterized by at least two fundamental features: (i) the absence of the in-frame stop codon and (ii) a higher purine frequency in 1st position of codons 4 that we called the 'purine bias' (Rrr).
Next, we investigated the contribution both of Rrr bias and also of stop codon distribution in the classification of coding vs non-coding ORFs. Our methodology is designed for the diagnosis of coding ORFs in small DNA sequences in the size range 200 to 1000 bp with the assumption that they contain a single coding region. Larger sequences where multiple coding regions are expected would need to be investigated with a sliding window. The procedure involves four steps: (i) Extracting all ORFs from the six frames of a given DNA fragment. (ii) Attributing a putative coding strand to these ORFs. (iii) Eliminating those ORFs without the purine bias of CDSs. (iv) Selecting the largest of these ORFs and declaring it as CDS. To eliminate false positives due to very small ORFs, we filtered them out by setting a minimal size threshold. Consequently, ORFs are simply classified as non-coding when they do not match the Rrr bias above a given size threshold.
Exploring CDSs and introns among six model species covering the whole spectrum of codon usage in eukaryotes, we found that the strand diagnosis is 95% at 350 bp and that the success rate of the coding diagnosis is 98%. However, we found that 18% of the CDSs whose size is 350 bp may not be detected. Tightening up our classification for "true" coding DNA is possible, but would affect the number of ORFs effectively retrieved.

Coding features
We revisited the contribution of purines to coding sequences (CDS) by computing the relative frequency of the four nucleotides Adenine, Cytosine, Guanine and Thymine (A, C, G and T) in the three positions of triplets and the six frames (the three frames on both plus and minus strands). All relative frequencies of this study were calculated as the ratio of a given occurrence to the number of contiguous triplet N = n/3 where n is the nucleotide number in the sequence. The relative nucleotide frequencies were denoted P i with i ∈{A, C, G, T}. The contribution of purines (A and G) was evaluated in the three positions j ∈{1, 2, 3} of triplets by computing both the sum (P A1 + P G1 , P A2 + P G2 , P A3 + P G3 that we noted AG1, AG2, AG3, respectively, in the following) and the product (P A1 P G1 , P A2 P G2 , P A3 P G3 ) of their relative frequencies over the six frames k ∈{-1, -2, -3, +1, +2, +3}. We also computed the relative frequency of stop codons TAA, TAG, TGA (P STOP ) and the product of relative frequencies of C, G and A in the three consecutive positions of triplets, i.e. (P C1 P G2 P A3 ), (P G1 P A2 P C3 ), and (P A1 P C2 P G3 ), over the six frames.
Using the frequencies just described, we set up five features for the diagnosis of coding ORFs as follows: (i) The quantity f 1 = 1-P STOP . If we consider the example of a coding sequence, f 1 is equal to 1 in frame k = +1 since there is no in-frame stop codon within the coding frame of a coding sequences and since we defined the ORF as a DNA stretch between two stop codons separated by a whole number of nucleotide triplets, or alternatively as a DNA stretch between a sequence extremity and a stop codon separated by a whole number of nucleotide triplets. By contrast, f 1 is expected 1 in non-coding frames because there is no constraint against stop codons in these frames. The value of f 1 in non-coding frames is expected to decrease with the size of the coding sequence at a rate that is proportional to its AT level. (ii) We also found that the statements P C1 P G2 P A3  P G1 P A2 P C3 and P C1 P G2 P A3  P A1 P C2 P G3 are generally true (93% of the cases) in frame k = +1. Therefore, the features f 2 = 1-P C1 P G2 P A3 and f 3 = P G1 P A2 P C3 -P C1 P G2 P A3 + P A1 P C2 P G3 -P C1 P G2 P A3 are also positive and maximum in most coding frames (see below). (iii) As stated above, the coding sequences are characterized by a purines bias. 4 therefore, one has P A1 P G1  P A2 P G2 and P A1 P G1  P A3 P G3 in frame k = +1 and the quantity f 4 = P A1 P G1 -P A2 P G2 + P A1 P G1 -P A3 P G3 should be positive and have its maximum in frame k = +1. (iv) A significant proportion of GC-rich CDSs are deprived of a stop codon on more than one frame over large sequence sizes (300 bp). However, most CDSs with GC  55% are also P G1 P C1  P G2 P C2 (see below). We took this into account by calculating the feature f 5 = P G1 P C1 -P G2 P C2 .
The procedure of coding ORF diagnosis described here involves the following steps: (i) the diagnosis of the coding strand, (ii) the identification of the ORFs that have a purine bias similar to that of CDSs and (iii) the extraction of the largest of these putative coding ORFs.

Strand classification
We tested the success rate of coding strand classification on the 5' and 3' sides of CDSs. For this, we extracted sequence pieces whose sizes varied between 50 and 600 bp from both CDS extremities. We then calculated the quantity S = f 1 + f 2 for all ORFs over the six frames of each of these CDS pieces. The sequences corresponding to frames k = -1, -2, -3 (the minus strand) were converted in their equivalent k = +1, +2, +3 in order to evaluate all sequences in their 5'-3' orientation. An ORF from the plus strand was considered potentially coding when the maximum of S was found for a frame of the plus strand, i.e. frames +1, +2, +3. Similarly, an ORF from the minus strand was considered potentially coding when the maximum of S was found for a frame of the minus strand, i.e. frames -1, -2, -3. When an ORF from the plus strand corresponded to the maximum of S for a frame of the minus strand, i.e. -1, -2, -3, and vice versa, the ORF was eliminated from the list.

Coding vs. non-coding classification
The ORFs selected as described above must then be confirmed for their coding potential. We classified a sequence as coding or non-coding (intron) by scoring the purine bias. For this, we calculated the maximum of the quantity C = f 1 + f 3 + f 4 over the six frames k. When C was higher than a threshold the sequence was classified coding, when lower, non-coding. The threshold value was found to be 1.05. We slightly improved the success rate of C in GC-rich sequences by calculating the maximum of the quantity C = f 1 + f 3 + f 4 + f 5 over the six frames when the GC level of the sequence was 55%, otherwise we calculated the maximum of the quantity C = f 1 + f 3 + f 4 , as described above.

Minimum OrF size for coding diagnosis
Considering a DNA sequence, its largest ORF (LORF) is not necessarily the coding one. For instance, considering the sequence of an expressed sequence tag (EST) from the 3' end of a cDNA, an ORF in the 3' UTR (non-coding) can be larger than the piece of coding sequence that it contains. However, the largest ORF among the ORFs that are classified coding (LcORF) has a higher probability of being actually coding. Here, we consider "coding" ORFs to be those with the Rrr bias of CDSs. Thus, LcORF, (i.e. the largest of the ORFs with Rrr bias) has higher probably to represent the actual coding ORF of a DNA segment. ORFs containing around 150 to 200 bp with Rrr bias are relatively common in introns. Intronic LcORFs are therefore a potential source of false positives. We investigated their size distribution in comparison to that of LORFs among the six frames of introns. The comparison of LORF and LcORF distributions is informative concerning the gain in sensitivity that is achieved by taking the Rrr bias into account for coding ORF diagnosis. Of course, the strategy of selecting the LcORF as the only coding ORF candidate eliminates the possibility of detecting coding ORFs that would overlap on the plus and minus strand. This has been done deliberately to simplify the experimentation and does not alter our conclusions.

Algorithm
The procedure outlined above can be summarized in the following algorithm: 1. Load the sequence, 2. Scan the three frames in the "+" and "-" (the complementary) strands, 3. Construct a table with the ORFs of the three frames by splitting the corresponding sequence according to stop codons for "+" e "-" strands, 4. For each strand, scan the ORF table and: • measure the ORF size, • if the ORF is larger than the selected size threshold:  calculate the f 1 , f 2 , f 3 , f 4 , and f 5 in the six frames of the ORF under analysis,  search among the six frames the one that corresponds to the maximum of S,  if the maximum occur for a frame 3, the strand is declared "+",  continue if the strand is declared "+", otherwise analyze the next ORF, 05, the ORF is declared "coding", 5. Chose the largest (LcORF) among "+" and "-" ORFs.

sequence material
Given that this study tends to be a reference case, we built datasets with CDSs of six model species covering the complete range of GC levels in 3rd positions of codons (GC3) and sequence complexity in eukaryotes. We chose GC3 as a criterion to evaluate codon usage diversity. Because of degeneration in the genetic code affecting 3rd position of codons, it is here that both variation in GC and also codon usage is the most extensive. Codon usage has been proven to interfere with the efficiency of gene prediction. It is the main factor explaining why algorithms based on machine learning must be trained. Therefore, a fundamental issue in gene prediction concerns the degree of codon variation which exists between species, as seen in these reference sequences. To avoid interferences with false positives of predicted genes, we filtered out the CDSs that were not experimentally validated through a peer reviewed publication in order to avoid the possible contribution of annotation errors. Among the species considered here, Plasmodium falciparum (CDS = 197, GC3 = 0%-30%) is extremely GC-poor 5 while Chlamydomonas reinhardtii (CDS = 102, GC3 = 60%-100%) is extremely GC-rich. 6 these two species stand at opposite ends of the spectrum of eukaryote GC3 variation. Arabidopsis thaliana (CDS = 1,206, GC3 = 25%-65%) has a genome whose GC level 7 is representative of core dicots 8 while Oryza sativa (CDS = 401, GC3 = 25%-100%) is a species representative of Gramineae. The common ancestor of this plant family underwent a transition of nucleotide composition. 8,9 The consequence of this transition is that the genes of this species are shared in two classes with two different codon usages. This feature confounds gene prediction in this species. 9,10 D. melanogaster (CDS = 1,262, GC3 = 40%-85%) is a species that also underwent a transition of nucleotide composition among insects. 11 Finally, H. sapiens (CDS = 1,199, GC3 = 30%-90%) is representative of warm-blooded vertebrates. 12 Because of the transition of nucleotide composition that occurred in mammals, the genes of H. sapiens are shared in five different classes. 13 Another important sequence feature for the purpose of gene prediction is sequence entropy 14 since its increase may lead to decrease the level of the Rrr bias.
Complete nuclear CDSs of the above species were retrieved from GenBank (release 137, August 15th, 2003) and filtered according to Carels et al 9 in order to eliminate redundancy and potentially false positive or doubtful genes resulting from wrong in silico predictions. The sequences all started with ATG and ended with a stop codon and none included in-frame internal stop codon.
We also built datasets of CDS fragments (frame + 1) of the six model species with fixed sizes of 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600 bp extending from (i) the first ATG until the desired sequence size and from (ii) the 3' side (next to the stop codon, but excluding it).
We tested the success rate of exon/intron classification with the CDS samples just described and the samples of intron sequences of A. thaliana

Results
According to Shepherd, 4 we found that the purine level is the highest, on average, in the 1st position of codons of all six species (data not shown). Therefore, we denoted this purine bias by Rrr. However, the difference between the product of purine probabilities in 1st and 2nd positions was higher than that between the sum (%) of these probabilities.
The product of purine probabilities was, on average, P A1 P G1 = 0.09 and P A2 P G2 = 0.05. Both values are remarkably conserved among distant species whatever their average GC level (Fig. 1). Two peaks of purine distribution in 3rd position of codons were found for rice (Fig. 1A). One, centered on P A3 P G3 = 0.015, is characteristic of extremely GC-rich genomes such as C. reinhardtii (Fig. 1E). The second peak centered on P A3 P G3 = 0.050, is common to the other genomes (Table 1). Table 1 shows that the product of purine probabilities in 3rd codon position is close to 0 for extremely GC-rich CDSs. Despite its extremely high AT composition, P. falciparum also shows the Rrr bias (Fig. 1F). The Rrr bias promotes purine compensation between the three positions of codons ( Table 2). The intensity of these compensations changes according to the species. It is interesting to note that in contrast to A, G does not show correlation between 1st and 2nd positions of codons in any of the six species.
In agreement to Figure 2, P A1 and P G1 are relatively constant over species except in P. falciparum where both purines obviously compensate each other. The absence of correlation between P A1 and P G1 in H. sapiens and D. melanogaster (Table 2) is not surprising since their distributions overlap closely. The correlation between P A1 and P A2 is more surprising since they also overlap closely. This shows that the correlation can be significant over a very small range of variation in base composition. By contrast, the absence of correlation between P G1 and P G2 is surprising since the relationship between these two bases is such that P G2 is lower than P G1 in every species. The difference between P G1 and P G2 is larger than that between P A1 and P A2 (Fig. 2). We also found negative correlations between P A1 P G1 and GC3 (-0.37), on the one hand, and between AG1 and GC3 (-0.35), on the other hand. The major contribution to these correlations is due to A1 since the correlation between P A1 and GC3 was -0.57 while that between P G1 and GC3 was 0.20. The negative correlation of purines in 1st position of codons and GC3 shows that the purine bias Rrr tends to be weaker for GC-rich genes. Other interesting regularities that can be derived from Figure 2 are that P C1 , P G2 and P A3 are lower than their respective probabilities in other positions of codons. A3 is clearly compensated by C3 as appears from negative correlation between A3 and C3 (r = -0.9, data not shown). This is shown at Figure 3 where the overlap between P C1 P G2 P A3 , P G1 P A2 P C3 and P A1 P C2 P G3 is only 7% of the CDSs of the six species considered together. This property of CDS is the consequence of the Rrr bias. It is essential for the diagnosis of the coding strand in GC-rich sequences. However, it must be used in combination with stop codon distribution to allow sufficient success rate (see below).
The bias in stop codon distribution introduced by the coding frame is not satisfactory for a secure diagnosis of the coding strand when GC-rich CDSs are small (Fig. 4). The success rate of coding strand diagnosis using stop codons only depends on the average level of AT. Short GC-rich sequences (O. sativa and C. reinhardtii) can be deprived of stop codon in non-coding frames as well. Therefore, the quantity S = f 1 + f 2 allows much more accurate coding strand diagnosis S = f 1 (Fig. 5).
However, the power of this simple function for the classification of exons and introns is low (data not shown). We found a solution to this problem by measuring the asymmetry introduced by the Rrr bias. The asymmetry of GC-poor CDSs (GC  55%) can be scored with the quantity C = f 1 + f 3 + f 4 . When CDSs are GC-rich (GC  55%) as occurs in O. sativa and C. reinhardtii, a success rate higher by 4%-5% (data  (Figs. 6, 7). Figure 6 shows the performance of the classification of introns and CDSs with increasing sequence size. Three different intron sources were plotted in Figure 6: A. thaliana, H. sapiens and D. melanogaster. The intron distribution of A. thaliana is the most homogeneous among the three and, therefore, A. thaliana is the species with the highest success rate of intron/exon classification among the three species tested. For the purpose of clarity, we group the CDSs of the six species all together. The overlapping area (Fig. 6) concerns the sequences for which the intron/exon classification cannot be trusted. The classification threshold can be chosen according to two strategies: optimize the error rate or maximize true positives. Considering Figure 6, the plain vertical line is for the threshold at 1.05 (see also Fig. 7). With a threshold of 1.05, the proportion of exons that are classified as introns (false negatives) is 10% at 200 bp and 7% at 600 bp. On the other hand, the proportion of introns that are classified as exons (false positives) is between 8% (A. thaliana) and ~15% (H. sapiens, D. melanogaster) at 200 bp and between 0 and 3% at 600 bp (Fig. 7). The error due to false positives decreases more rapidly than that due to false negatives.

not shown) is obtained with the quantity
We found that the largest ORF (LORF) in introns of A. thaliana, D. melanogaster and H. sapiens are between 200 and 250 bp, on average (Fig. 8). The distribution of the largest ORFs showing the purine bias (LcORF) peaks at 100 bp in all three species and trails off towards ~300 bp in Arabidopsis and Drosophila. In humans, the LcORF distribution trails until ~400 bp (the bar at 500 bp in the LORF distribution most probably indicating the dataset contamination by CDSs. According to this speculation, the contamination rate could be as high as 8%). If we consider 2.5% as an acceptable rate of false positives in intron/exon classification, LcORFs   from A. thaliana can be considered coding in 97% of the cases provided that they are 300 bp (Fig. 8).
The size threshold for LcORFs of A. thaliana under the success rate of 95% is ~240 bp, which results in a gain of ~60 bp in sensitivity. According to the same criteria, the size threshold above which LcORF classification is reached with a 95% success rate is (i) between 150 and 200 bp for P. falciparum and C. reinhardtii, (ii) 300 bp for D. melanogaster and (iii) 350 bp for H. sapiens.

Discussion
The methodology presented here is an attempt to understand the features of coding sequences that allow their classification independently of the species.
We investigated a set of model species that cover the entire range of codon usage and sequence complexity in eukaryotes. The unicellular Plasmodium falciparum is extremely rich in AT while Chlamydomonas reinhardtii is, by contrast, extremely rich in GC. This warrants the coverage of the complete codon usage. Arabidopsis thaliana has an average base composition that is representative of the dicots and monocot plant species. Rice is representative of the Gramineae family that has the particularity of having two gene classes one with a codon usage typical of angiosperms in general and one that is extremely GC-rich as in C. reinhardtii. Drosophila melanogaster and Homo sapiens are two species that demonstrate a compositional transition in their respective common ancestor. 11,12 For this reason, they are expected to be more heterogeneous in their sequences.
Despite the enormous genetic distance between these species, we found a common model for their coding sequence (CDS). The model is based on the stop codon distribution and on the purine bias (Rrr) in CDSs. The purine bias has been claimed to be a universal feature of CDSs 4 that could help to classify them in the process of gene finding. However, the purine bias has also the corollary that P C1 P G2 P A3 reaches its minimum value in the coding frame of CDSs. As far as we know, this feature has not been described before, but it is essential for the successful diagnosis of CDSs using the purine bias as proposed by Shepherd. 4 the P C1 P G2 P A3 bias results from the nucleotide compensations that occur in the CDSs with the effect of generating a higher abundance of purine in 1st position of codons than in the two other positions (Rrr). The compensation occurs in such a way that A1 is more abundant in AT-rich (P. falciparum) and G1 is more abundant in GC-rich (C. reinhardtii) genomes. This is obvious from the negative correlation (-0.57) between A1 and GC3 and from the positive correlation between G1 and GC3 (0.20). However, whether AT-rich or GC-rich, G is more abundant in 1st than in 2nd position of codons. 2 This can be regarded as a remnant of the GNC ancestral codon. 2 this feature is essential since it is conserved in P. falciparum. However, in the particular case of this species a substantial number of codons take A1 in place of G1. The absence of correlation between P G1 and P G2 by contrast to the correlation between (i) P A1 and P A2 and (ii) P A1 P G1 and P A2 P G2 suggests that different constraints act on A and G. Reasons for this can be found in the universal correlation. 15 Actually, Rrr is a feature that allows the measure of codon asymmetry in CDSs as does the CSF function. 16 The reason for codon asymmetry in CDSs is not trivial. There is the same number of RNN and YNN codons in the genetic code. The larger frequency of Rrr in CDSs is due to the proteomic code. To sum up, it is the consequence of constraints acting on secondary and 3D protein structures. 17 When used alone, the purine bias Rrr allows coding frame detection with only ~84% success rate (data not shown). The most important source of frame confusion is from frame -1. An explanation for this is found in Biro's review. 17 Complementary codons often encode complementary amino-acids that are involved in 3D protein folding. The balance of sense and antisense codons is close to the equilibrium, which justifies an error rate of ~15% on the coding frame diagnosis by Rrr. For this reason, Rrr should be used only for the coding diagnosis and not for the strand diagnosis.
In AT-rich sequences, the bias of stop codon(s) distribution among frames is sufficient to allow the elimination of most frame ambiguities in sequences 350 bp. In GC-rich sequences (0.55% GC), the introduction of the condition P G1 P C1  P G2 P C2 in combination to the P C1 P G2 P A3 and stop codon(s) biases is necessary. The probability of stop codons is too low in GC-rich ORFs ~350 bp to allow unambiguous frame diagnosis. Fortunately, P C1 P G2 P A3 compensates for this lack of specificity. In addition, the condition P G1 P C1  P G2 P C2 combined with the conditions P A1 P G1  P A2 P G2 and P A1 P G1  P A3 P G3 compensates for the negative correlation between A1 and GC3 with the consequence that the success rate of exon/intron classification remains at a high level.
The purine bias induced by the physico-chemical properties of proteins is sufficient to classify CDSs from introns with a success rate 95% above 350 bp. the threshold of 95% success rate is found at lower ORF size in AT-rich sequences. This suggests a positive correlation between the exon size and their GC level. This correlation has been detected in plants 18 and vertebrates. 19 The different success rates of exon/intron classification between A. thaliana, on one hand, and H. sapiens, D. melanogaster, on the other hand, are apparently due to intrinsic difference of base composition. The difference of GC level between introns and exons was found to be higher, on average, in A. thaliana (5% to 15%-30%), 7 than in H. sapiens, 20 D. melanogaster (5%). In addition, the vast majority of plant introns are GC-poor, 18 which is not the case in H. sapiens and D. melanogaster.
The features analyzed in this study allow an improvement to the sensitivity of exon vs intron classification by 50 to 150 bp at small ORF sizes compared to other methods, i.e. the Average Mutual Information from Grosse et al 14 and the CSF function from Nikolaou and Almirantis, 16 which claim to be independent of codon usage, and which do not need a training step. However, the substantial difference is that these aforementioned methods predict neither the strand nor the coding frame. In consequence, we believe that our method could be helpful in the extraction of coding ORFs from ESTs and/or from metagenomic reads. It could also help in the preparation of training set for ab initio gene prediction with machine learning algorithms in those genomes for which little information is available.