Bulletin of Mathematical Biology Vol. 51, No. 5, pp. 597-603, 1989. Printed in Great Britain.
0092-8240/8953.00 + 0.00
Pergamon Press plc Society for Mathematical Biology
AN OPTIMAL ALGORITHM TO RECONSTRUCT FROM ADDITIVE DISTANCE DATA
TREES
JOTUN J. HEIN
Center for Molecular Genetics, UCSD, La Jolla, CA 92093, U.S.A. (E. mail: H~CHEM6 UCSO.EDU)
In this article the question of reconstructing a phylogeny from additive distance data is addressed. Previous algorithms used the complete distance matrix of the n OTUs (Operational Taxonomic Unit), that corresponds to the tips of the tree. This used O(n2) computing time. It is shown that this is wasteful for biologically reasonable trees. If the tree has internal nodes with degrees that are bounded an O(n* log(n)) algorithm is possible. It is also shown if the nodes can have unbounded degrees the problem has n 2 as lower bound.
Introduction.
In taxonomy and molecular evolution the problem of reconstructing a tree from distance data is very central. If it is possible to find a tree that matches the distance data exactly, i.e. the distance between two O T U (Operational Taxonomic Units) equals the length in the tree of the path that connects them, the distance function is called tree additive (Buneman, 1971). The O T U s are assumed to be sequences, but this is not essential. It can be shown (Waterman et al., 1978) that such a distance matrix uniquely determines the tree. It is also clear, however, that the tree can be constructed with m u c h less than the complete distance matrix. If the topology of the tree is known, the tree can be determined with at most 2 n - 3 distances (maximum n u m b e r of edges), where n is the number of sequences. The problem addressed in this article is: what is the least information necessary to determine a tree, if the phylogeny is unknown? This question is practically motivated by situations where acquiring a pairwise distance is limiting or computationally expensive. Here we consider unrooted trees, only tips will be labelled and all edges are assigned lengths. The degree of a node is the number of incident edges. A tip has degree one. Nodes of degree two will be ignored, since they contain no information about the branching pattern in the tree. A central class of trees, binary trees, has nodes with degree one or three only. These trees correspond to the naturally arising class, that involves only bifurcations, and where the position of the root is not specified. Given three sequences si, sj and sk, the exact dimensions (edge lengths) of the tree can be determined in terms of the three possible distances between them. Let p be the internal node in the tree. The tree is considered fully described if the
598
J.J. HEIN
distances to p from all three sequences are determined. The dimensions of the tree are easily f o u n d as (see Fig. la):
d(si, p) = (d(si, sj) + d(si, sk ) - d(si, sk ))/2 d(sj, p)= (d(sj, si) + d(sj, s k ) - d(si, sk))/2,
(1)
d(sk, p)= (d(sk, sj) + d(sk, si)-d(sj, si) )/2. (a)
k
I
J
P
(b] s2
)
S3
5 1
S4 S5
2 3
1
S7
s6 s8
Figure 1. Part (a) illustrates the relationship between three sequences and an ancestral point. The position of P can be found in terms of the three pairwise distances using equation (1). Part (b) is a tree with 8 tips used to illustrate a sequential algorithm used to reconstruct the tree. The position of s8 is not known and must be found by asking questions about the distance from s8 to different reference sequences.
These equations can be used to m a k e an a l g o r i t h m t h a t reconstructs the whole tree by successively a d d i n g a sequence to a g r o w i n g tree a n d each time d e t e r m i n i n g its position by suitable c o m p a r i s o n s to sequences in the tree a n d using e q u a t i o n (1) ( W a t e r m a n et al., i977).
AN OPTIMAL ALGORITHM TO RECONSTRUCT TREES
599
To be more specific: assume that a tree of size k - 1 (number of tips), Tk_ 1, has been constructed, and now sequence sk is to be added. Pick out two reference sequences, sl and s2, that are tips in Tk_ 1. If the distances to these to sk are calculated, equation (1) can be used to determine exactly where on the path between sl and s2 the branch leading to sk should be attached and also how long it must be. If this point on the path between si and sj does not coincide with the point of attachment of a sub-tree to the path, then the position of sk in the tree has been determined. If the point coincides with a sub-tree, then sk must be located in that sub-tree somewhere, and an additional reference sequence s3 is chosen at some tip in that sub-tree. One reference point is already known, since the distance to the root of the sub-tree is known. Equation (1) is then used to determine where on the path between the root and s3 sk should be attached. If the point coincides with a sub-tree, the comparisons continue, otherwise sk has been correctly positioned. This procedure is illustrated on the tree in Fig. lb. First choose two arbitrary sequences, say s2 and s4, as first reference sequences. Their lengths to s8 are 22 and 24 respectively. Equation (1) will determine that s8's attachment point to the path between s2 and s4 coincides with the sub-tree containing s5, s6 and s7. The distance from the attachment point and s8 is 12. Now s6 is chosen as reference point in the subtree and it is found to have a distance of 6 to s8. Again, equation (1) determines that s8 must be in a sub-tree containing s7 and the distance from the attachment-point of this sub-tree to s8 is 5. Lastly, using s7 as reference point with a distance 3 to s8 the correct position of s8 has been determined. Using this procedure reconstructed the position of s8 in the tree without using all distances involving s8. The problem addressed is how to find an optimal version of this principle. It must be noted that it is not possible to make an algorithm that a priori states a subset of the (n*(n-1)/2) distances that is sufficient. To see this assume dij is not in the set of the n * ( n - 1)/2 pairwise distances, then any tree with i a n d j as brothers cannot be reconstructed. Thus any algorithm must use previously calculated distances, to decide that later ones are uninformative. The Algorithm. The sequential algorithm above used arbitrarily chosen reference sequences to position a new sequence. The basic operation in positioning sk is to take the tree in two tips and determine in which of the subtrees hanging from this path sk is located and if in none of them position sk is positioned by itself by use of equation (1). The problem is to choose a series of paths to discriminate effectively between different possibilities fr positioning sk. The most straightforward approach would be always to take the path with the most nodes on it, since this allows discrimination between the most sub-trees. The sub-trees hanging from a path
600
J.J. HEIN
can have very different sizes and in Fig. 2 a tree is shown where the number of comparisons needed is proportional to the square root of k. The construction of the example is guided by always locating many tips in one sub-tree, so the distinguishing between many sub-trees becomes less powerful. The complexity for this longest path algorithm would be 0(n3/2).
4 6
5 3
Figure 2. n 3/2 Tree for longest path algorithm. An edge with i at its tip will be 2 shorter than the ege with i - 1 at its tip. Any edge will be assumed to have tips attached to them reasonably regularly spaced, so the length of a path is proportional to the number of nodes on it. This is pure for the sake of illustration, since the relevant length of a path is the number of nodes on it and not the sum of the lengths of the edges constituting it.
It is possible to construct a search strategy that positions sk in O(log(k)) time if the degrees of nodes cannot grow unbounded. Several very different algorithm are possible, but only the best, the deepest point algorithm, will be given. First a measure of the complexity of a tree relevant for the positioning of sk will be presented. The measure of relevance here is the smallest number of comparisons needed to guarantee the positioning of a new sequence. This complexity can be calculated recursively. Each node will be associated a depth. The node will have one edge pointing towards the deepest point in the tree. The other sub-trees, TI, T 2 , . . . , Ti, will be regarded as descendants of the node. The tree hanging from the node is called T. The depth of the root of Tis defined as the maximum number of comparisons needed to position sk, if it is known that sk belongs to T. Assume reference sequences are chosen down in T1, then in T2 etc. as shown in Fig. 3. When a reference sequence is chosen in the correct sub-tree, it is unnecessary to investigate the rest of the sub-trees. Thus, the most comparisons needed will be max{depth(Tk)+k-1). The reason 1 is subtracted is it was assumed that it was known sk belonged in T, which implies that one comlzarison had already been made. The ordering of the sub-trees was arbitrary and the above quantity is minimized if they are ordered after decreasing depth.
AN OPTIMAL ALGORITHM TO RECONSTRUCT TREES
601
Deeper part of tree
Z,
TI
Ti
Figure 3. Upward in this small tree is toward the deepest point in the tree and downward is toward external parts of the tree. The tree hanging from this node is called T. It has i descendants. In summary the complexity of the complete tree can be calculated as follows. (i) Assign all tips the value 0 and all others are undetermined. (ii) Internal nodes are tentatively assigned numbers by the above recursion. The new nodes assigned with lowest numbers are assigned these as their depths in the tree. This is repeated until all nodes in the tree has been assigned depths. (iii) In the end the last assigned node will flank an edge. To satisfy the initial condition of the recursion a reference sequence must be chosen in both of the corresponding sub-trees separated by this edge. Thus the complexity of the complete tree will be the value of the highest depth assigned plus two. The result of the procedure on a tree is shown in Fig. 4a. This tree thus has complexity 6, and exhaustive investigation easily demonstrates that it is the smallest number of comparisons needed to guarantee the positioning of an extra sequence. The tree with highest complexity is the star tree with only one internal node is shown in Fig. 4b. If it has n tips it will have complexity n. The trees of lowest complexity for a given n is what has been called a caterpillar (Hendy et al., 1984) and is shown in Fig. 4c. A new sequence can be positioned using three comparisons, and then the resulting tree would not be a caterpillar. The tree of degree three of highest complexity 6 with the fewest tips is shown in Fig. 4d. This can be generated by following growth process. Start with one isolated node, three edges buds from this giving rise to first generation, second generation is generated by letting each node from first generation having two descendants. If the tree to be reconstructed can be of any type the complexity of the algorithm will be O ( n 2 ) . Let T k be the star tree. Then it will take k comparisons to position sk. If Tk+ t is also a star tree then it will take k + 1 comparisons to position the next sequence. In short all possible comparisons must be made. (If it was known that T, would be a star tree, then T, could be reconstructed with n comparisons in total.)
602
J.J. HEIN
(a)
(c)
11111 I (d)
\i
"~f
,(
"~
Figure 4. (a) An arbitrary tree with depths labelled to its nodes. Deepest node is 4, which gives an overall complexity of 6. (b) The most complex tree with n tips is the star tree with only one internal node of degree n. (c) The least complex tree with n tips is the caterpillar. (d) The tree of order 3 and complexity 5 with the least nodes.
It will now be shown that if the degree of internal nodes are bounded an n , log(n) algorithm is possible. Let S(d) be the number of tips in the smallest tree that has a node with depth d and all nodes have degrees less than m. This node will have the highest degree and there will only be one such node. Using the previous recursion S(d) will fulfill the equation: S(d) = m i n ( S ( d - i) 9 i)
1 < i
(2)
with the initial condition S(d) = d < m. The reason is that any minimal tree will have minimal sub-trees as descendants. Second, the complexity of the
AN OPTIMAL ALGORITHM TO RECONSTRUCTTREES
603
descendant will be identical. This is a simple consequence of the recursion determining depths. Assume minimum is attained with thejth sub-tree, then all sub-trees with lower index must have the same depth, otherwise Twould not be of minimal size. Similarly, Tj must also be the last sub-tree otherwise the rest of the sub-trees could be removed without diminishing the depth of T. The only thing that must be decided is how many descendants should Thave. Ifd > m - 1 then m - 1, otherwise d. Thus the tree of a given complexity, C, and smallest number of tips can be obtained by a growth process similar to the process that generated the tree in Fig. 4d: starting with an isolated node, let give this ( ( C - 2) mod m) times. The growth of S(d) is exponential. Thus, the growth of highest complexity as a function of number of tips will be logarithmic. Thus, the total complexity of the tree construction algorithm will be O(n. log(n)).
Discussion. An algorithm was presented that reconstructs a tree from additive distance data in a very economical fashion. Real data are rarely perfectly tree additive. A tree reconstruction algorithm for real data was developed (Hein, 1988) that used only a small fraction of the distance matrix and thus runs much faster, by using the principles described in this article. The robustness necessary against non-tree-additivity was obtained by using several reference sequences and trying several sub-trees, before deciding in which subtree sk belongs. Mathematically rigorous analysis of this algorthm is not feasible, since its performance would depend much on the data. I tried to find the average performances of the deepest point and the longest path algorithms if a tree of order three was randomly chosen, but with little success. Asymptotically, trees with high complexity are fewer than tree with low complexity, but if this trend is strong enough to make the average performance better than the worst case performance is unknown . . . . . .
LITERATURE Buneman, P. 1971. "The Recovery of Trees from Measures of Dissimilarity." In: Mathematics in the Archaeological and Historical Sciences, F.R. Hodson, D.G. Kendall and P. Tautu (Eds), pp. 387-395. Edinburgh University Press. Hein, J.J. 1988. "A Fast Tree Reconstruction Method." Molec. biol. Evol., submitted. Hendy, M.D., C.H.C. Little and D. Penny. (1984. "Comparing Trees with Pendant Vertices Labelled." S l A M J. appl. Math. 44, 1054-1065. Waterman, M.S., T.F. Smith, M. Singh and W.A. Beyer. 1977. Additive Evolutionary Trees." J. theor. Biol. 64, 199-213.
Received 25 October 1988 Revised 22 March 1989