Simulation of Chordate Intron Evolution Using Randomly Generated and Mutated Base Sequences

Introns are well known for their high variation not only in length but also in base sequence. The evolution of intron sequences has aroused broad interest in the past decades. However, very little is known about the evolutionary pattern of introns due to the lack of efficient analytical method. In this study, we designed 2 evolutionary models, that is, mutation-and-deletion (MD) and mutation-and-insertion (MI), to simulate intron evolution using randomly generated and mutated bases by referencing to the phylogenetic tree constructed using 14 chordate introns from TF4 (transcription factor–like protein 4) gene. A comparison of attributes between model-generated sequences and chordate introns showed that the MD model with proper parameter settings could generate sequences that have attributes matchable to chordate introns, whereas the MI model with any parameter settings failed in doing so. These data suggest that the surveyed chordate introns have evolved from a long ancestral sequence through gradual reduction in length. The established methodology provides an effective measure to study the evolutionary pattern of intron sequences from organisms of various taxonomic groups. (C++ scripts of MD and MI models are available upon request.)


Introduction
Introns are nucleotide sequences interrupting the coding regions (exons) in a gene, which are frequently seen in eukaryotic protein-coding genes. The evolution of intron sequences has aroused broad interest in the past decades. However, 2 contrary hypotheses, namely, the "introns-early" versus the "introns-late" theories, put forward to explain the evolutionary mechanisms of introns are still under debate. Introns-late theory proposes that introns are an innovation of eukaryotes and intron gain has been a continuous process during the evolution of eukaryotes. 1,2 This theory is supported by the facts that all current prokaryotic genes are free of spliceosomal introns, and intron number and length in eukaryotes increase with the complexity of organisms. [2][3][4] Introns-early theory holds that introns already existed in ancient ancestor prokaryotes and intron loss allowed the current organisms to have intronless or intronpoor genomes. [5][6][7] This theory is supported by the facts that the ancestral eukaryotic forms contained intron-rich genomes [8][9][10] and the evolution of eukaryotic genes primarily involves intron loss with only a few episodes of intron gain. [11][12][13][14] In recent years, more and more data have been obtained to favor the introns-early theory. [8][9][10][11][12][13] However, the proponents of introns-late theory have still not been persuaded. The main reasons, as we understand, come from 2 aspects. One is that no evidence has been obtained to prove the existence of introns in ancestral bacterial protein-coding genes. The other is that introns-early theory cannot explain why intron number and length increase with the complexity of eukaryotic organisms. It must be confessed that addressing these 2 concerns is confronted with great difficulty. First, all ancestral bacteria are not available today. Thus, no ancient bacterial gene samples are available for examination about intron existence. Yet, this problem could be circumvented to some extent by searching and examining horizontally transferred bacterial genes harbored in eukaryotes. Using this approach, we have found an intron-containing bacterial gene harbored in sea anemone, which suggests possible existence of introns in ancestral bacterial genes. 15 Second, the lengths and base sequences of introns vary greatly across various organisms. With a group of introns having different lengths, currently there is no measure to find out whether they are evolved from a longer ancestral sequence through gradual reduction in length or from a shorter ancestral sequence through gradual increase in length. Phylogenetic analysis has been widely used to infer evolutionary patterns of gene and protein sequences. However, it is inefficient for studying the evolutionary pattern of introns because the phylogenetic tree formed by intron sequences generally has very poor statistical support, based on which no evolutionary pattern can be inferred to explain how intron sequences have evolved.
Sequence simulation is an important measure for evolutionary studies. A considerable number of statistical models and methods have been developed and tested for inferring the evolutionary relationship of nucleotide and protein sequences. [16][17][18] The established models are effective in simulating evolution of real sequences, 19,20 in establishing databases of simulated protein alignments, 21 and in exploring early events in the ecological differentiation of bacterial genomes. 22 However, these methods were mostly developed for simulation of gene or protein sequences which possess high conservatism. They are not suitable for simulation of intron sequences that have undergone high number of 2 Evolutionary Bioinformatics nucleotide mutation and large pieces of nucleotide deletion or insertion. Therefore, in this study, we designed and constructed 2 types of evolutionary models, that is, mutation-and-deletion (MD) and mutation-and-insertion (MI), to simulate the evolution of introns from a chordate gene using randomly generated and mutated sequences. Then, we compared attributes of modelgenerated sequences with real chordate introns. The results show that the MD model with proper parameter settings could generate sequences that have attributes matchable to chordate introns, whereas the MI model with any parameter settings failed in doing so. These results suggest that the surveyed chordate introns should have evolved from a longer ancestral sequence through gradual reduction in length. The established methodology provides an effective measure to study the evolutionary pattern of introns from organisms of specific taxonomic groups.

Chordate introns and their attributes
The gene segment coding for bHLH (basic helix-loop-helix) motif of TF4 (transcription factor-like protein 4) has only 1 phase zero intron in chordates. Exon sequences flanking this intron are highly conserved. This intron was selected as the research subject to ensure that the introns of different chordates come from the common ancestor. This intron is 112 to 1975 bases long in the 14 species chosen to represent various classes of chordates (Table 1). These 14 intron sequences were aligned using Muscle program 23 first and then loaded into MEGA 5.2 software 24 to generate the original phylogenetic tree ( Figure 1A) using maximum likelihood (ML) algorithm.
Afterward, MEGA 5.2 and the constructed ML tree were used to determine 5 attributes of these chordate introns, namely, L MSA (length of multiple sequence alignment), R K2+I (ratio of transition to transversion under K 2+I parameter model 25 ), D ̅ (overall mean distance), SE D ̅ (standard error of the overall mean distance), and TS ML (topology score of the constructed ML tree), which were found to be 2139 bases, 1.53, 1.084, 0.089, and 28, respectively. Detailed steps for determining these attributes are described in Supplementary File 1.

Simulation of intron evolution
To simulate intron evolution, we assume that the 14 chordate introns are evolved from 1 common ancestral sequence (AS 1 ) through gradual reduction/increase in length accompanied by base mutation ( Figure 1A). The AS 1 sequence was generated using a C++ program, in which 4 integers (1, 2, 3, and 4) were generated at random and were referred to bases A, G, T, and C, respectively. After a sequence of demanded length was obtained, its first and last 2 bases were replaced by GT and AG to mimic the structure of an intron. Starting from the AS 1 sequence, the formation of each chordate intron can be divided into separate stages. For instance, intron Cm376 (an intron of 376 bp from the elephant shark, Callorhinchus milii) was evolved from AS 2 , and AS 2 itself was evolved from AS 1 . The evolutionary models we constructed have 5 adjustable parameters, that is, L AS1 (length of ancestral sequence 1), L AS12 (length of ancestral sequence 12), M 1 (mutated bases per 1 branch length), L I/D (length of bases inserted or deleted each time), and M I/D (number of bases mutated each time). Parameters L AS1 , L AS12 , and

Determination of L AS12 length
As shown in Figure 1A, there are 12 ancestral sequences for the 14 chordate introns. Among them, AS 2 to AS 4 , AS 6 , and AS 8 to AS 12 are directly ancestral to certain chordate introns. Because aS indicates ancestral sequence.

4
Evolutionary Bioinformatics our evolutionary models assume consecutive deletion or insertion during evolution, the last ancestral sequence (L AS12 ) should have a valid length to ensure simulation of intron evolution.
That is to say, in the MD model, AS 2 to AS 4 , AS 6 , and AS 8 to AS 12 must be longer than the correspondent introns evolved from them. As such, the minimum value of L AS12 was set to 2000 bases (Table 2) to ensure validity of using it to simulate the formation of intron Am1975 (the longest among 14 chordate introns). And in the MI model, AS 2 to AS 4 , AS 6 , and AS 8 to AS 12 must be shorter than the correspondent introns evolved from them. Thus, the maximum value of L AS12 was set to 140 bases ( Table 2) to ensure that AS 9 is shorter than Rt112 (the shortest among 14 chordate introns).

R value of constructed models
The transition to transversion ratio (R) is adjustable in MD and MI models. However, a fixed value is given to it prior to model running in accordance with the attributes of intron sequences for study. This value is set to 1.5 in both MD and MI models because the R value of the 14 chordate introns is 1.53 as revealed by model testing (see the first paragraph of "Materials and Methods" section).

Statistical analysis
Statistical analyses were performed using SPSS software (version 17.0). For data from orthogonally designed models, "univariate" analysis was conducted to evaluate the effects of various factors on attributes of model-generated sequences.
Significance of variance was analyzed through comparing the main effect of each factor without investigating the interaction between factors. Duncan post hoc test was used to conduct multiple comparisons for observed means, which were then plotted to view the optimal level of each factor. For data from optimized MD and MI models, "independent samples t test" was conducted to compare the difference of attributes between chordate introns and model-generated sequences.

Design and construction of evolutionary models
The ML tree obtained from previous step ( Figure 1A) was referenced for the design and construction of evolutionary models. In this tree, the first ancestral sequence (AS 1 ) was evolved to form AS 2 to AS 12 , which were then evolved to form the 14 chordate introns. The formation of each ancestral sequence or chordate intron sequence was considered as the result of both base mutation and base deletion/insertion, which are related to evolutionary distance shown above or below each branch of the tree. In other words, each sequence in the tree (except AS 1 ) was evolved from its specific AS after a target number of bases was mutated and a target length of bases was deleted or inserted. Nevertheless, base mutation and base deletion/insertion are a continuous process, meaning that a target number of bases to be mutated and a target length of base deletion/insertion should be completed at consecutive stages. Therefore, we have designed and constructed evolutionary models to simulate base mutation and base deletion/insertion alternately before the target number of bases is mutated and the target length of bases is deleted/inserted. In addition, the evolutionary models were designed to receive parameters in batch so that a set of 14 sequences could be generated at one time. Detailed flow charts for the construction of MD and MI models are shown in Figures S1 and S2, respectively. The constructed MD model assumes that the 14 chordate introns were evolved from a fairly long AS 1 sequence through a gradual reduction in length accompanied by base mutation, whereas the constructed MI model assumes that the 14 chordate introns were evolved from a fairly short AS 1 sequence through a gradual increase in length accompanied by base mutation. Both programs were compiled with C++ language.

Orthogonal tests of MD and MI models
The MD or MI model we constructed has 5 adjustable parameters. To understand the influence of different values of these parameters on attributes of the 14 model-generated sequences, we have designed tests using L 16 (4*5) orthogonal table ( Table 2)

Parameter optimization for running the MD model
In Figure 2 (Table 3). Among the 5 attributes of sequences generated by MD 17 , D ̅ value is significantly lower than that of the 14 chordate introns at P < .05 level. Here, it is to be noted that the attributes of 14 chordate introns are also presented as "M ± SD" (n = 10), which are obtained from allowing each of the 14 chordate intron sequences to mutate by only 1 base (Table 3). This was an adjustment after we became aware of the inadequacy in using 28 points as the target value for TS ML . As we have found, after each of the 14 chordate intron sequences is allowed to mutate by only 1 base, the resultant TS ML will generally drop to below 16. So, we repeated such 1-base mutation to the 14 chordate introns (using another C++ program compiled by us) for 10 times and thus obtained their attributes in the "M ± SD" form.
From Figure 2, we can see that D ̅ value is significantly affected by M 1 and L I/D . So, in test MD 18 , we used 2500 bases  (Table 3). Thus, it is concluded that the MD model with proper parameter setting can generate sequences with attributes matchable to chordate introns.  (Table 3). However, 3 attributes (L MSA , D ̅ , and TS ML ) of the model-generated sequences are significantly different from those of chordate introns. So, in test MI 18 , we changed M I/D to 11 to 20 bases while keeping other parameters unchanged. The model-generated sequences have no significant difference in L MSA but have significantly higher D ̅ value and significantly lower TS ML value. After this, we have tried various values for all 5 parameters to run the MI model (tests MI 19 -MI 24 ). It was found that 2 or 3 of the 5 attributes of sequences generated by tests MI 19 -MI 24 are always significantly different from those of the 14 chordate introns (Table 3). Thus, it is concluded that the MI model could not generate sequences with all attributes matchable to chordate introns.

Discussion
Introns are well known for their high variation not only in length but also in base sequence. So far, no common sequence structures/features have been found in introns except those containing transposable elements 26,27 and microRNAs. 28 The unavailability of common sequence features makes it very difficult to study the evolutionary pattern of introns through phylogenetic analysis because intron sequences generally yield a phylogenetic tree with very poor statistical support. This is true even when the introns are from the same location of a gene in closely related organisms. For example, the bootstrap values in ML tree shown in Figure 1A are ranging from 3 to 56 (for . With such low bootstrap support, no evolutionary pattern can be inferred for these intron sequences. As such, it is not clear whether these introns have evolved from a longer ancestral sequence or from a shorter ancestral sequence. In this study, we designed 2 evolutionary models to simulate the evolution of chordate introns. The obtained data (Table 3) demonstrate that the 14 chordate introns should have evolved from a longer ancestral sequence through gradual reduction in length accompanied by base mutation, that is, in a mutation-and-deletion pattern. According to our simulation, the 14 chordate introns probably had a common ancestral sequence of 6000 bp. This ancestral sequence could have undergone 11 to 30 bases mutation and 31 to 50 bases deletion alternately to yield the intron sequences currently existing in various chordate species, and a transition to transversion ratio of 1.5 occurred in base mutation. Although the above simulation is highly dependent on parameter settings, it does provide an effective measure for inferring the evolutionary pattern of intron sequences. At least, it can tell us whether an MD or an MI model better describes the evolutionary pattern of surveyed introns.
So far, studies about evolution of intron sequences have mainly focused on the presence or absence of introns in certain genes across various organisms. Through investigating a large number of intron-gain and intron-loss events, it has been revealed that the evolution of eukaryotic genes primarily involves intron loss. [11][12][13][14] While previous studies have made it clear that intron number has been reduced, it remains unclear whether intron length has been increased or decreased during intron evolution. Our present work provides an example of intron length reduction during the evolution of chordate TF4 gene. Theoretically, the established methodology can be used to study the evolution of introns at different numbers and lengths and from different organisms because the intron sequences for study are only used to construct the original ML tree (eg, Figure 1A), which is then used as a roadmap to set  *, **, and *** indicate significant difference from independent t test compared with FCis at P < .1, P < .05, and P < .01 level, respectively.

Evolutionary Bioinformatics
parameter values accordingly in testing the constructed evolutionary models. Prior to this study, the same models had been applied to simulate the evolution of 11 insect introns with lengths ranging from 299 to 3026 bp. The statistical result showed that the MD model could generate sequences with attributes matchable to insect introns (unpublished data). Our simulations to evolution of introns from other taxonomic groups including fishes, birds, and invertebrates are ongoing. Primary data obtained so far show that the MD model is more likely the pattern followed by intron evolution in these organisms. It is anticipated that more examples of intron length reduction will be found using this approach, showing that introns have been reduced not only in number but also in length during evolution. Then, if intron number and length have been substantially reduced during evolution, why do intron number and length increase with the complexity of eukaryotic organisms? This is the main supportive fact of introns-late theory that has not been given a rational explanation from introns-early theory. Here, we propose our explanation to this question: higher organisms are not as efficient as lower organisms in reducing number and length of introns. This explanation is consistent with frequent intron loss in yeast, 29 higher intron retention rate in human than in fruit fly and nematode, 30 reductive evolution of genomes in complex archaeal ancestor, 31 and intron-rich ancestor of eukaryotes. 32 This higher efficiency in lower organisms could be due to higher reproduction rates of lower organisms than higher organisms because frequent genome replication provides more chances for genome streamlining. 33,34 Therefore, more and longer introns existing in higher organisms are not the result of continuous intron gain, but are the result of low efficiency in reducing intron number and length. This may be considered new evidence to support the introns-early theory and to persuade the proponents of introns-late theory.

Conclusions
In this study, through designing and constructing MD and MI evolutionary models to simulate the evolution of 14 chordate introns, we found that these chordate introns should have evolved from a longer sequence through gradual reduction in length accompanied by random base mutation. Although successful simulation seems to be highly dependent on parameter settings for the constructed models, it does provide an effective measure to infer the evolutionary pattern of introns, especially in view of intron length variation. The established methodology is expected to facilitate more studies on evolutionary pattern of intron sequences from organisms of various taxonomic groups.

Author Contributions
G-DW, J-MM, Q-LH, and YW performed the overall analysis and wrote the article. YW and ZZ designed and constructed models. YW, QY, and K-PC proposed and conceived the work. All the authors discussed and put forward views on the results and approved the final article.

Supplemental material
Supplemental material for this article is available online.