Major Revisions in Arthropod Phylogeny Through Improved Supermatrix, With Support for Two Possible Waves of Land Invasion by Chelicerates

Deep phylogeny involving arthropod lineages is difficult to recover because the erosion of phylogenetic signals over time leads to unreliable multiple sequence alignment (MSA) and subsequent phylogenetic reconstruction. One way to alleviate the problem is to assemble a large number of gene sequences to compensate for the weakness in each individual gene. Such an approach has led to many robustly supported but contradictory phylogenies. A close examination shows that the supermatrix approach often suffers from two shortcomings. The first is that MSA is rarely checked for reliability and, as will be illustrated, can be poor. The second is that, to alleviate the problem of homoplasy at the third codon position of protein-coding genes due to convergent evolution of nucleotide frequencies, phylogeneticists may remove or degenerate the third codon position but may do it improperly and introduce new biases. We performed extensive reanalysis of one of such “big data” sets to highlight these two problems, and demonstrated the power and benefits of correcting or alleviating these problems. Our results support a new group with Xiphosura and Arachnopulmonata (Tetrapulmonata + Scorpiones) as sister taxa. This favors a new hypothesis in which the ancestor of Xiphosura and the extinct Eurypterida (sea scorpions, of which many later forms lived in brackish or freshwater) returned to the sea after the initial chelicerate invasion of land. Our phylogeny is supported even with the original data but processed with a new “principled” codon degeneration. We also show that removing the 1673 codon sites with both AGN and UCN codons (encoding serine) in our alignment can partially reconcile discrepancies between nucleotide-based and AA-based tree, partly because two sequences, one with AGN and the other with UCN, would be identical at the amino acid level but quite different at the nucleotide level.

between the first nine species and the last two species, with the two Archeognatha species (PsaARCHEO for Pedetontus saltator and MbaARCHEO for Machiloides banksi) and a copepod (A369COPE for Acanthocyclops vernalis) being different. However, the last codon in red (Fig. S1) is a lysine codon in all sequences, and the second last is a threonine codon in all but one sequence (A369COPE). The evidence of homology is strong among these codon sites, so they should be aligned as shown in the bottom of Fig. S1.
A similar situation is shown in the top panel of Fig. S2 where the alignment from Regier et al. (2010) introduced an alignment artefact increasing the distance between the first pycnogonid (TorPYCNO for Tanystylum orbiculare) and the three other pycnogonid species. The 3-nt deletion in the first sequence (TorPYCNO) is clearly misplaced, with the alignment in the bottom of Fig. S2 having high alignment scores by any reasonable scoring scheme. For example, we may evaluate these two MSA in Fig. S2 by the sum-of-pairs (SP) criterion (Lipman, et al. 1989;Gupta, et al. 1995;Stoye, et al. 1997;Reinert, et al. 2000;Althaus, et al. 2002) without penalizing shared gaps. With match score of 2, transition and transversion penalized by -1 and -2, respectively, and a gap penalty of -3, we obtain SP of 169 for the top MSA, and of 248 for the bottom MSA in Fig. S2. Thus, the bottom MSA is better than the top MSA. In particular, for the top MSA, the alignment score for TorPYCNO and AeliPYCNO is 13 and that for AeliPYCNO and Col2PYCNO is 29, suggesting that AeliPYCNO is more closely related to Col2PYCNO than to TorPYCNO. In contrast, for the bottom MSA, the alignment score for TorPYCNO and AeliPYCNO increases to 45, and that between AeliPYCNO and Col2PYCNO remains unchanged (29, because shared gaps are not penalized), suggesting that AeliPYCNO is more closely to TorPYCNO than to Col2PYCNO, which is consistent with other parts of the MSA.
Because of the high divergence among arthropod sequences, some parts in the MSA were deemed unalignable by Regier et al. (2010) and removed from the translated amino acid sequences before the final phylogenetic analysis, e.g., the shaded segment in Fig. S3a. This deletion is unnecessary because sequence homology is identifiable as shown in Fig. S3b.
Deleting phylogenetically significant signals reduces phylogenetic resolution. However, the deletion of unalignable segments by Regier et al. (2010) is not consistent. While the shaded segment in Fig. S3a is deleted (Regier, et al. 2010, their Supplemental file nature08742-s4AA.nex), the undesirable alignment in Fig. 1a remains in their degenerated sequence file (nature08742-s3Degen1.nex) used to generate their main results in their Figure 1. Thus, their nucleotide sequences and amino acid sequences are not quite comparable. We took the space to show these contrasts because Regier et al. (2010) is not the only paper with sequence alignment problems. Phylogeneticists often implicitly assume that phylogenetic distortion introduced in sequence alignment will be negligible relative to the true phylogenetic signals that remain (which may be true in most cases, but not always).

II. Measure the degree of sequence alignment improvement
We realigned the 68 gene segments in Regier et al. (2010) with MAFFT (Katoh, et al. 2009) and MUSCLE (Edgar 2004a, b). These two programs produce a better multiple sequence alignment (MSA) than Clustal (Thompson, et al. 1994).