Estimating the relative order of speciation or coalescence events on a given phylogeny.

The reconstruction of large phylogenetic trees from data that violates clocklike evolution (or as a supertree constructed from any m input trees) raises a difficult question for biologists-how can one assign relative dates to the vertices of the tree? In this paper we investigate this problem, assuming a uniform distribution on the order of the inner vertices of the tree (which includes, but is more general than, the popular Yule distribution on trees). We derive fast algorithms for computing the probability that (i) any given vertex in the tree was the j-th speciation event (for each j), and (ii) any one given vertex is earlier in the tree than a second given vertex. We show how the first algorithm can be used to calculate the expected length of any given interior edge in any given tree that has been generated under either a constant-rate speciation model, or the coalescent model.


Introduction
A fundamental task in evolutionary biology is constructing evolutionary trees from a variety of data.These constructed trees show the ancesteral relationship between the species.
Not only the relationship between species is of interest, but also the time between speciation events.When constructing an evolutionary tree from a set of molecular data which satisfies the molecular clock, the edge lengths can be interpreted as a time scale.In many cases, no time scale is obtained when constructing a tree though: • Often, molecular data does not satisfy the molecular clock and so the edge lengths do not represent a time scale.
• Trees can be constructed from morphological data or non-standard molecular data like gene order.This does not provide any edge lengths.
• Having several different trees, one can combine them and construct a 'supertree'.Even though there may have been time scales on the original trees, most supertree methods return a tree without a time scale.
For those trees, we still want to find edge lengths representing the time between speciation events.In this paper, we will estimate the edge lengths from the shape of the tree.The method works for trees which evolved under the Yule model [Yule, 1924, Edwards, 1970, Harding, 1971, Page, 1991].Under the Yule model, in each point of time, each species is equally likely to split.Minor changes to the method for the Yule model give us an edge length estimation for trees under the popular coalescent setting [Nordborg, 2001].
An example for a tree with unknown edge lengths is the primate supertree T p recently published in [Vos and Mooers].Figure 1 shows a part of T p .The primate tree is a supertree on 218 species and was constructed with the MRP method (Matrix Representation using Parsimony analysis, see [Baum, 1992, Ragan, 1992]).Since for most of the interior vertices, no molecular estimates were available, the edge lengths for the tree were estimated.In [Vos and Mooers], 10 6 rank functions on T p were drawn uniformly at random.For each of those rank functions, the expected time intervals, i.e. the edge lengths, between vertices were considered (the expected waiting time after the (n − 1)th event until the nth event is 1/n).The authors of [Vos and Mooers] concluded their paper by asking for an analytical approach to the estimation of the edge length, which we will provide below.
In order to estimate the edge lengths, we developed the algorithms RankProb and Compare.Those algorithms answer questions like: Was speciation event with label 76 in the primate tree (see Fig. 1) more likely to be an early event in the tree or a late event?What is the probability that 76 was the 6th speciation event?Was it more likely that speciation event 76 happened before speciation event 162 or 162 before 76?
The algorithms work for trees where every labeled history is equiprobable.This class of model, which includes the Yule model and the coalescent model, has been popular in macroevolutionary studies [Nee andMay, 1997, Zhaxybayeva andGogarten, 2004].Note that the algorithms here are the same for the Yule model and the coalescent model, whereas the edge length estimation has minor differences for the two models.
Figure 1: Part of the primate supertree.Figure 4 -13 are some subtrees, for details see [Vos and Mooers].
The algorithms RankProb, Compare and an algorithm for obtaining the expected rank and variance for a vertex were implemented in Python, see [Gernhard, 2006].

Probability distribution of the rank of a vertex
Let T be a rooted phylogenetic tree [Semple and Steel, 2003] with |V | = n leaves.The set of interior vertices of T shall be V .For a binary tree, we have The function r is called a rank function for T .A vertex v with r(v) = i is said to have rank i.Note that r induces a linear order on the set V .Further, define r(T ) := {r : r is a rank function on T }.We are interested in the distribution of the possible ranks for a certain vertex, i.e. we want to know the probability of r(v) = i for a given v ∈ V .If every rank function on a given tree is equally likely, we have In the algorithm, we will use the formula [Semple and Steel, 2003] where n v is the number of leaves below v.Note that Equation 2 holds for binary and nonbinary trees.
Examples of stochastic models on phylogenetic trees where each rank function is equally likely include: • The Yule model has the probability distribution which is the uniform distribution [Edwards, 1970, Brown, 1994]. • The coalescent model has the same probability distribution on rooted binary ranked trees as the Yule model.So P[r|T ] is the uniform distribution [Aldous, 2001].
• For some sets of trees (e.g.those drawn from the uniform model [Pinelis, 2003], also known as PDA model), no rank function is induced.If one assumes that all rank functions are equally likely on these trees, one can apply Equation 1 to such trees as well.

A polynomial-time algorithm
The following algorithm calculates the probability distribution of the rank of a vertex v in a rooted binary phylogenetic tree T .The idea of the algorithm is the following (cf.Figure 2).Label the vertices on the path from v to the root ρ by v = x 1 , . . ., x n = ρ.Let T m be the subtree of T containing the vertex x m and all its descendants.Let α Tm,v (i) be the number of rank functions on the tree T m where v has rank i.The values α Tm,v (i . The α-values in the fraction have a lot of factors in common which cancel out.In the following algorithm, we calculate α-values without the unnecessary terms instead, αTm,v (i).We have Algorithm: RankProb(T , v) Input: A rooted binary phylogenetic tree T and an interior vertex v. Output: 1: Denote the vertices of the path from v to root ρ with (v = x 1 , x 2 , . . ., x n = ρ).2: Denote the subtree of T , consisting of root x m and all its descendants, by Figure 3) end for 11: end for 12: for i = 1, . . ., | VT | do 13: Proving the correctness and runtime of RankProb makes use of the following two observations.Remark 1.Let A i be a set containing n i elements with a linear order, i ∈ {1, 2}.There are n1+n2 n1 possible linear orders on A 1 ∪ A 2 which preserve the linear order on A 1 and A 2 .This follows from the observation that the number of such linear orders on A 1 ∪ A 2 is equivalent to the number of ways of choosing n 1 elements from n 1 + n 2 elements, which is n1+n2 i αT ,v (i) which proves the theorem.
The proof is by induction over m.
So it remains to verify that the term ( * ) returns the right values for α T k ,v (i).Assume that the vertex v is in the (i − j − 1)-th position in T k−1 (with i − j − 1 > 0) for some rank function r T k−1 and v shall be in the i-th position in T k .Now combine the linear order in the tree T k−1 induced by r T k−1 with a linear order in T ′ k−1 induced by r T ′ k−1 to get a linear order on T k .The first j vertices of T ′ k−1 must be inserted between vertices of T k−1 with lower rank than v so that v ends up to be in the i-th position of the tree T k .Count the number of possible way to do this as follows.The tree This follows again from Remark 1.The number of rank functions

by the induction assumption. Multiplying all those possibilities gives
The value |{r : r(v) = i, r ∈ r(T )}| is then the sum over all possible j which establishes the correctness of the algorithm.
All that remains is to verify the runtime.Note that the combinatorial factors n k for all n, k ≤ | V | can be calculated in advance in quadratic time, see Remark 2. In the algorithm, those factors can then be obtained in constant time.
The most time consuming part of the algorithm is line 13.Adding up all calculations needed for obtaining α The last inequality holds since the vertices of the T ′ m , m = 1, . . ., n − 1, are distinct.Therefore, the runtime is quadratic.
Remark 4. With P[r(v) = i] from Theorem 3, the expected value µ r(v) and the variance σ 2 r(v) for r(v) can be calculated by Remark 5.The algorithm RankProb can be generalized to non-binary trees [Gernhard, 2006].The runtime is again quadratic.
3 Application of RankProb -Estimating edge lengths

The Yule model
A very common stochastic model for rooted binary phylogenetic trees with edge lengths is the continuous-time Yule model [Edwards, 1970].As in the discrete Yule model, at every point in time, each species is equally likely to split and give birth to two new species.The expected waiting time for the next speciation Figure 4: Labeling the tree for estimating the edge lengths.
event in a tree with n leaves is 1/n.That is, each species at any given time has a constant speciation rate (normalized so that 1 is the expected time until it next speciates).Assume that the primate tree T p evolved under the continuous-time Yule model.In [Gernhard, 2006], the tree shape of T p (i.e. the tree without edge lengths) under the discrete Yule model is tested against the uniform model and accepts the Yule model.
Here, we describe how to estimate the edge lengths for a tree which is assumed to have evolved under the continuous-time Yule model.
Let (u, v) be an interior edge in T with u the immediate ancestor of v. Let X be the random variable 'length of the edge (u, v)' given that T is generated according to the continuous-time Yule model.
The expected length E[X] of the edge (u, v) is given by Since, under the continuous-time Yule model, the expected waiting time for the next speciation event is 1/n it follows that: It remains to calculate the probability . This is equivalent to counting all the possible rank functions where r(u) = i and r(v) = j.The subtree T v consists of v and all its descendants.The tree T u equals the tree T where all the descendants of v are deleted, i.e. v is a leaf in T u , see Fig. 4.
The number of rank functions on T u is |r(T u )|.The probability P[r(u) = i] can be calculated with RankProb(T u , u).So the number of rank functions in The number of rank functions on T v is |r(T v )|.Let any linear order on the trees T u and T v be given.Combining those two linear orders into an order, r, Remark 6. Equation 4 enables the estimation of the length of every interior edge.For pendant edges, the approach above gives no definite answer.All we know is that the time from the latest interior vertex, which has rank n − 1, until today is expected to be at most 1/n where n is the number of leaves.
Suppose that the growth process is stopped as soon as the n− 1-st speciation event occurs.In this case the expected length X of a pendant edge below an interior vertex v is: The expected depth of vertex v from the first branchpoint is: So the depth Y of the leaf in question from the first branchpoint has expectation independent of v: In other words, assigning to each edge of a given tree topology its expected length gives a tree which obeys a molecular clock.
Remark 7. Often, an inferred tree has vertices with more than two descendants, i.e. there is lack of resolution due to, e.g.confliciting data.Our calculation for the expected edge length assumes a binary tree though.
However, the expected edge length may be calculated for each possible binary resolution of the supertree.Assume the supertree T has the possible binary resolutions T 1 , . . ., T m .For an edge (u, v) in T where u is the immediate ancestor of v, the expected edge length is calculated in the trees T i for i = 1, . . ., m.The expected edge length in T i is denoted by e i for i = 1, . . ., m.Note that if u is a vertex with more than two descendants in T then v is in general not a direct descendant of u in T i .The value e i in resolution T i is then the sum of all expected edge lengths on the path from u to v in T i .
Calculate the expected edge length E[X] of (u, v) in the supertree T by rank function on T ρ1 , there is the same number of linear extensions to get a rank function on the tree T .So it is sufficient to calculate the probability P u<v in T ρ1 .If ρ 1 = u then u is an ancestor of v in T , so return P u<v = 1.If ρ 1 = v then v is an ancestor of u in T , so return P u<v = 0. Now assume that ρ 1 = u and ρ 1 = v.The run of RankProb calculates the probability P[r(u) = i] in the tree T u and P[r(v) = i] in T v for all i.Next, combine those two linear orders.Assume that r(v) = i and that j vertices of T u are inserted before v. Inserting j vertices of T u into the linear order of T v before v is possible in i−1+j j ways (see Remark 1).Putting the remaining vertices in a linear order is possible in ways.The probability that the vertex u is among the j vertices which have smaller rank than v is P[r(u) ≤ j] = ucum(j).There are |r(T u )| possible linear orders on T u and |r(T v )| possible linear orders on T v .The number of linear orders where vertex v has rank i in T v , v has rank i + j in T ρ1 and r(u) < i + j therefore equals Adding up the p ′ for each i and j gives the number of linear orders where u has smaller rank than v.
Combining a linear order on T v with a linear order on T u is possible in • ucum(j).This shows that Compare works correct.
Since RankProb has quadratic runtime, Compare also has quadratic runtime.

Figure 2 :
Figure 2: Labeling the tree for the algorithm RankProb