Time optimal left to right construction of position trees

In the following paper we are presenting a new algorithm for the on-line construction of position trees. Reading a given input string from left to rig...

3 downloads 43 Views 666KB Size

Download PDF

Acta Informatica24, 461474 (1987) 9 Springer-Verlag 1987

Time Optimal Left to Right Construction of Position Trees M. Kempf, R. Bayer, and U. Gfintzer TechnischeUniversit/itMiincben,Institut ffir Informatik,Postfach202420, D-8000 Miinchen, Federal Republic of Germany

Summary. In the following paper we are presenting a new algorithm for the on-line construction of position trees. Reading a given input string from left to right we are generating its position tree with the aid of the general concept of infix trees. An additional chain structure within the trees, called tail node connection, enables us to construct the tree within the best possible time (proportional to the number of nodes).

1. Introduction In 1973 Weiner [7] introduced a new interesting auxiliary data structure, the bi-tree, to facilitate a number of important text editing tasks - particularly the string matching problem. Later, the question was resumed by Aho, Hopcroft and Ullman [1], who called the structure a position tree. The position tree for a given text is a trie index spelling out for every position the shortest substring starting at that position and occuring nowhere else in the text. Although this auxiliary structure yields an efficient solution for multiple pattern matching (with complexity proportional to the length of the patterns), which compensates for all additional space requirements of the trie index, there remains a considerable obstacle to practical application: How to find timeefficient methods to construct the position tree ? A number of construction algorithms have been proposed in the literature. The first observation was that processing the input string from the right to the left guarantees the existence of the corresponding intermediate position trees and causes only slight changes in every step. Constructions presented in [7] and [1] have achieved - by means of additional space expensive information - the minimal time complexity O(N), where N is the number of edges of the position tree. In 1976, McCreight [6] introduced a similar text representation structure, the suffix tree, and inverted the string processing direction for the first time:

462

M. Kempf et al.

the positions of the input string were processed more naturally from the left to the right. On the other hand, he paid less attention to on-line reading of the given input, concentrating on the space complexity of the constructed tree. Later, in [5] and [2], on-line solutions were proposed: step by step the interim position tree is extended according to the prefix of the input text read so far. The term "on-line" refers to the processing direction of the input and to the fact that every new input symbol sets off a well defined set of local actions, which do not depend on the right part of the text still to be read. While Majster and Reiser in I-5] still needed a quite complicated bookkeeping, in [2] the former solutions were considerably disentangled by allowing the interim trees to be partial position trees. However, these on-line algorithms did not reach the optimal time complexity; the time spent was proportional to the sum of the length of all paths from the root to the leaves. After a brief explication of the basic notions in Sect. 2, the following sections are aiming at a synthesis of the methods and ideas described above. Its result, the new on-line algorithm for position tree construction (presented in Sect. 5) runs in time and space O(N). Infix trees (Sect. 3) are a useful theoretical concept to describe the progress of the construction ([3]) and thus to prove the correctness of the algorithm. An additional threading of the trees, introduced by the first author in [4], connecting every node with its "tail node" (cf. Sect. 4) is the key to the time improvement.

2. Position Trees and Position Identifiers

For the convenience of the reader we briefly reiterate the basic concepts and conventions: An alphabet I is a finite set of symbols a, b, c. . . . The length n of a string x = x l ...xneI* is denoted by Ixl=n. ~ denotes the empty string of length 0. A position in a string of length n is a natural number between 1 and n; therefore contains no position. A string y = y l ...y,, occurs in x~...xn at position i if y=xi...x~+,,_~; y is then called a substring or infix of x; e is an infix that occurs at every position in x; y is a prefix of x, if it occurs at the first position of x. Furthermore, an I-tree T is a labelled directed tree, whose edges carry labels from the alphabet I, unique in the sense that different edges leaving a node v have distinct labels. Thus, for every ~t~I there is at most one node w connected to v by an edge leaving v and carrying label ~; let us call w the oc-son of v. The path string P(v) of a node v, is the string listing the edge labels starting from the root of T and ending in v. The depth d(v) of a node v is the length of its path string, i.e. d(v)= IP(v)l.

Definition 2.1. (i) A string y is called the position identifier (or substring identifier) for position i in a string x if y is the shortest non empty infix of x that occurs at position i only; if existent, it is denoted by s (i). 1 Changes in the text require the correction of its position tree. Efficient updating methods also including the strength of tail nodes will be the subject of a forthcoming study by the first author

Time Optimal Left to Right Construction of Position Trees

463

(ii) Let x be a string of length n where the position identifiers exist for all n positions: The position tree for string x is a n / - t r e e T, which exactly reflects the position identifiers of x, as follows: T has n leaves, say #t, . . . , f , , such that for i = 1..... n leaf #i is labelled by " i " and the path string P(fi) equals the position identifiers s(i) for position i in x. For sake of simplicity we assume in the following that the text x consists of at least two positions, i.e. n > 2 . 2 We note some basic properties of position identifiers and position trees: Proposition 2.2. (i) Monotony Lemma: Is(i + 1)1> Is(i)l - 1, i.e. s(i + 1) cannot end at a position before the last position of s(i). (Assume that s(i)=xi...x~ and s(i+ 1)=xi+l...x~_l: We know by the definition of s(i) that xg...xj_l occurs at least twice in x; consequently x~+l...xj_ ~ cannot identify position i + 1.) (ii) No position identifier can be prefix of another one. (Assume the contrary: If s(i) is a prefix of s(j) for i#:j, then s(i) occurs a second time in x, namely at position j, a contradiction to the fact, that by definition s(i) must be unique throughout the string x.) (iii) All n position identifiers for string x = x l . . . x . exist if and only if x,~:xl for i= l ..... n - 1 . 3 (Indeed, if x, = x~ for some i < n, then position n cannot be identified by a substring. If x,~-x~, then every position i is at least identified by x~...x,; s(i) is obtained by shortening x~...x, at the right end as long as the condition of exclusive occurence at position i is not violated.) (iv) Every leaf in a position tree has a brother node. (Again assume the contrary: If leaf fi has no brother and v is its father, then the path string P(v) already suffices to identify position i within x.) As a consequence of (ii) and (iii), we can be assured that one and only one position tree does in fact exist for each string x = x ~ . . . x , where x , 4 x ~ for i--1 ..... n - 1 . Figure 1 represents the position tree for x = ababc.

1

3

Fig. 1. The position tree for ababc 2 The position tree for x = c t e I reflecting s ( 1 ) = e is trivial 3 This condition can be fulfilled in any case by automatically adding an endmarker at the right end of the text

464

M. K e m p f et al.

F o r a detailed account how to apply position trees in order to solve efficiently a variety of text processing problems see [7] (p. 8), [1] (ex. 9.18 and 9.19) and [-5] (pp. 785-786).

3. Infix Trees As before let X=X1...XnEI* such that n > 2 and assume x,=l=xl for i = 1..... n - 1 in order to guarantee the existence of the position tree for x. The new position tree construction method will be developed starting from a general construction scheme by stepwise refinements. If, e.g., the string is processed from left to right, then the last symbol xi of a prefix xl...x~ may equal x i for some j < i and thus there is no position tree for the partial string x~...x~ read so far. Therefore we cannot insist on all intermediate trees generated during our algorithm being genuine position trees. But the path strings in the intermediate trees should at least correctly list some infix strings of x and the positions where these infixes occur. This leads us to the more general class of infix trees defined below.

Definition 3.1. A n / - t r e e T is called an infix tree for x, if the following holds: (i) The vertices of T may be labelled by the string positions 1, ..., n such that every position occurs exactly once in the tree, and every leaf carries a label. (ii) If vertex v carries label i, then the path string P(v) is an infix of x occuring at position i; (i.e. P(v) is a prefix of xl xi+ 1...x,.) Note, that leaves must carry at least one position label, whereas interior nodes may carry zero, one or more such labels. As e occurs at every position, the root may be marked by each label. In the following the node v which carries position label i will also be denoted by m(i). As Fig. 2 one of the many infix trees for x = ababc is depicted. E.g. node v carries the labels 1 and 3; consequently re(l)= m(3)= v.

tl a

c

V

2 Fig. 2. An infix tree for ababc

Time Optimal Left to Right Construction of Position Trees

465

Definition 3.2. A node v of an infix tree T for x with P(v)=yl...ya is called "indispensable" if v is the root, or Yl...Yd- ~ occurs at least twice in x. T itself is called " e c o n o m i c a l " if all its nodes are indispensable. All nodes in the infix tree of Fig. 2 are indispensable, except for the leaf m(2). The latter one could be dispensed with, because the substring ba doesn't occur twice in x. If v is indispensable, so are its ancestors. Hence a position tree is economical, because all its leaves are indispensable. Obviously, if node v (v=~root) is indispensable, then P(v)=yl...ya occurs at least at position i and Yl...Yd-1 occurs at i and at some other position j in x; i.e. P(v) must be a prefix of s(i). Hence, for every node v of an economical infix tree (for x) there exists a node v' of the position tree (for x) such that

P(v)=P(v'). Thus, economical infix trees are approximations of the position tree by "smaller trees". We will use this approximation concept for constructing the position tree. In order to describe a termination criterion for the construction process we define the following:

Definition 3.3. An infix tree T for x is called " s e p a r a t i n g " if for all i, je{1, ..., n} with i~=j the node re(i) is neither an ancestor of re(j) n nor re(j) itself. The infix tree of Fig. 2 is not a separating one; r e ( l ) = m(3); m(4) is an ancestor of m(j) f o r j = 1, 2, 3, 5. If T is separating, the path string P(m(i)) can be no prefix of P(m(j)) and vice versa. I.e. P(m(i)) occurs only at position i; s(i) must be a prefix of P(m(i)). Additionally, let T be economical: As re(i) is indispensable, P(m(i)) is a prefix of s(i). Therefore P(m(i))=s(i) for i = 1..... n. The position tree for x is the only infix tree being economical as well as separating. Hence in order to construct the position tree it suffices to produce an economical separating infix tree. One of the possible reasons for an economical infix tree T not yet being separating is the existence of an inner node v carrying some label k. Then P(v) does not suffice to identify position k. (Indeed, there must be a leaf u strictly below v. By the definition of infix trees, u must carry some label j and hence P(v) occurs also at position j.) Hence P(o)=Xk...Xk+d(v)_ 1 m u s t be lengthened. If v has an Xk +dtv)-son s (i.e. s is e-son of v, and c~ occurs at position k + d(v)), then move label k from v down to s along the already existing edge. On the other hand, if v has no Xk+dtv)-son, then it must be created eventually, because this son node is indispensable, and it must take over label k. Furthermore, the only other possible reason for T not being separating is the fact, that there exists a leaf v carrying more than one label, e.g. the two labels j and k. Again P(v) is not sufficient to separate positions j and k, and at least an indispensable Xj+dt~)-son S taking over label j from v must be appended to v. After each one of these operations T is still an economical infix tree for x, but it contains more separation information. In this way we could generate the whole position tree. But if we do so, i.e. if we isolate successively the n

466

M. Kempfet al.

labels by pushing them down the tree, creating new nodes whenever needed, n

then the resulting algorithm will use up O(L) steps, where L--- ~ Is(i)l is the total path length of the position tree. i= 1 But there are other ways to construct the position tree incrementally, which we should not preclude by premature design decisions. Consider Fig. 3 that shows another infix tree of ababc. Instead of pushing down one of the labels 3, 4 or 5, one can make the tree grow at node m(2) by appending an a-son carrying label 2, which is indispensable. Thus we should not exclude a priori the operation of appending a node somewhere, as long as it is indispensable.

b

2

1 Fig. 3. Another infix tree for ababc

Thus we have to consider the following two elementary operations to generate the whole position tree: (i) Moving down a label from an inner node along an already existing edge. (ii) Appending an indispensable node (and marking it with a correct label). At least one of these operations is always possible, as long as T is not yet separating. (That is: If there are two labels j and k with k lying above j, then at least one new leaf has to be created to separate positions j and

k.) Both operations can be executed only a finite number of times, the first one at most L times, where L is the total path length of the final position tree, the second one exactly N times, where N is the number of edges in the position tree. Since we are aiming at complexity O(N), which in general is smaller than O(L), we must concentrate on operations of type (ii). Only that provides a chance of reaching complexity 0 (N). One might argue that it could be rather expensive to search for nodes to append an indispensable son to. To see a possibility to avoid this problem consider Fig. 3: If the algorithm (to be designed) has already extracted enough information from x to " k n o w " that aba is a prefix of s(1), then it should know already that at least ba is a prefix of s(2) (cf. 2.2(i)). Hence, without further search, an a-son can be appended at m(2) and then marked with label 2. This seemingly naive thought shall be exploited in Sect. 4, where the question is addressed, how the information gained while trying to identify a position k can also be used for constructing s(i) for i = k + l , k + 2 . . . . in spite of the fact that labels i and k lie on different paths of the tree.

Time Optimal Left to Right Construction of Position Trees

467

Keeping in mind these complexity considerations, as a first step we present the following algorithm template that terminates and yields the position tree: Start with the smallest infix tree T for x, consisting only of the root carrying labels 1, ..., n. While T is not yet separating do the following: omOVedown an interior label, if possible, append an indispensable node. The operation " m o v e down a label" shall help us to search for new places where an indispensable node can be appended. By using a special relation among the tree nodes (cf. the next section) we apply this operation n times only, n = Ix ], and therefore the searching process for indispensable nodes does not affect the intended complexity O (N).

4. Tail Nodes and TN-Completeness In this section an additional threading for infix trees is introduced.

Definition 4.1. Let v, w be nodes of an infix tree T. w is called tail node of v if P(w)=tail(P(v)) (where t a i l ( ~ y ) = y for ~ I , y~I*), or v and w are both equal to the root of T. Note that not every node of T has a tail node, but a node allows for at most one tail node. If existent, the tail node of v is denoted by TN(v). Then TN is a partial function from the set V of vertices of Tinto itself. Consider again our running example x = ababc. Fig. 4 exhibits an infix tree for x, where all nodes have a tail node, except for leaf re(l). The tail node threads are depicted by broken arrows.

I

1 Fig. 4. A threaded infix tree for ababc

468

M. Kempf et al.

Definition 4.2. An infix tree T is called TN-complete, if all its nodes have a tail node. If T is TN-complete, then for all nodes v of T the sequence

(*)

v, TN(v), TN(TN(v)), ...

exists and leads of the root of T, because d(TN(v))=d(v)-i as long as v is not the root of T. The sequence (*) is called the tail node chain of u. L e m m a 4.3. Every position tree is TN-complete.

Proof. Let x be a string and T its position tree. Then the root of T is its own tail node and simultaneously the tail node of all its sons. Let v be a node of T with d ( v ) > l . We have to show that v has a tail node, There are aEI and y~I* with l y [ > l such that P(v)=ay. ay is a prefix of s(i) for some i. By Proposition 2.2 (i) we know that s(i + 1) cannot end before the last position of s(i); therefore, y must be a prefix of s(i+ i). Thus string y must lead from the root to some ancestor w of#i+x such that y=P(w), i.e. w = TN(v). Corollary 4.4. I f node v of a TN-complete infix tree T is indispensable, so is each node in the tail node chain of v.

Proof As we have shown in the last section, if v is indispensable, then P(v) also represents a path string of the position tree for x. The latter is TN-complete and economical. N o w we shall use the auxiliary tail node structure for reducing the search for appendable nodes. N o d e v has been added to T only if it is indispensable. Therefore the whole tail node chain of v qualifies as candidates for appending. Moreover, we want to prevent several labels to be concentrated on a leaf during tree construction, i.e. except at the start, the only reason for T not yet being separating is the existence of an inner node carrying some label. If label k has been moved down to a leaf v with label j, then we know that an xa~v)+fson s taking over the label j has to be appended to v immediately and, because s is indispensable, all nodes in its tail node chain that are not yet contained in T also have to be appended. Starting with the smallest TN-complete infix tree (consisting only of the root carrying labels 1. . . . . n), appending missing indispensable nodes and always restoring TN-completeness afterwards yields the following refinement of the while-statement of our template: While T is not yet separating do: move down some label k residing on node u to its son v, if possible; if v is a leaf (carrying labels j and k now), append an indispensable node s to v and restore the TN-completeness of T,

I

Odd some other indispensable node and restore the TN-completeness of T.

In this way we aggregate several incarnations of the body of the previous loop. Thereby we do not only avoid the iterated testing of the while-condition, but - what is much more important - also we often need not search where to append new nodes. That is, if s is an indispensable a-son of v, then TN(s) is an indispensable a-son of TN(v).

Time Optimal Left to Right Construction of Position Trees

469

In addition, if s carries label j, then label j + 1 may be moved to TN(s) without violating the definition of infix trees. Consider again Fig. 4: Assume that node s=m(1) has been appended last, thereby violating TNcompleteness. The latter is restored by completing the tail node chain of s as far as necessary. A new a-son for the tail node of the father of s taking over the label 2 is introduced. At this point the procedure terminates because the upper part of the tail node chain of s (the a-son of the root and the root) already exists. The result is shown in Fig. 5. If ~'~!

//~4,5

TN(s)

S

ds / Fig. 5. TN-complete infix tree

By appending a new (indispensable) c-son m(3) to v and completing its tail node chain up to the root with the new leaves m(4) and m(5), we get, without further search, the (TN-complete) position tree for ababc. Compare the complexity considerations in the last section: Exploiting the tail node structure we never had to move down, in the present example, label 4 to the b-son of the root. Appending a node and its whole tail node chain can be specified most naturally by recursion as follows: The procedure C O M P L E T E takes as input a node w and a label p, appends to w an xa~w)+p-son s and marks it with p, under the precondition that s is an indispensable node and legitimately carries label p, i.e. under the precondition that P(w)= xp...xr+a~w )_ 1 and that P(w) occurs twice in x. In order to restore the TN-completeness of T, the tail node chain of s has to be completed. If TN(w) has no xa~w)+p-son, this son must be created and marked with label p + 1, i.e. C O M P L E T E (TN(w), p + 1) has to be invoked recursively. After the missing son has been created, it must be threaded to s as the tail node of s. Only if w is already the root, the root itself must become the tail node of s.

470

M. K e m p f et al.

The global variables c v and e k serve to save the parameters of the innermost call, i.e. the situation at the upper end of the inserted part of the tail node chain, because it shall turn out to be practical in some cases to continue the construction process at that node where the insertion of tail nodes stopped, and with label ek.

procedure C O M P L E T E (node w, posno p) {global result variables: node c v, posno ek}; ((create new Xd~w)+p-son S of W)); ((transmit label p to node s)); cv,=TN(w); ck,=p + 1; if (( T N (w) has no xd ~w)+p-son )) then CO M P L E T E (TN (w), p + 1); if w = root then TN(s)..=w else T N (s),= Xdtw)+p-son of T N (w). From now on v shall be used as a node variable. By our last template it is ensured that, except for the root at the beginning of the tree construction, the leaves of T carry exactly one label when the while-condition is tested. With C O M P L E T E we can write the while-statement as follows: While T is not yet separating do: let v point at an inner node carrying some label k (if the tree consists only of its root, v:=root): if v has an Xdtv)+k-SOn t, then move down label k from v to t and let v,=t and, if v is a leaf with labels j and k, then C O M P L E T E (v,j); or

C O M P L E T E (v, k), if v has no Xd,v)+k-SOn.

5. Algorithm POTREE: On-Line Construction of Position Tree T Although the TN-threading is the necessary precondition for the final gain of efficiency, the templates of the last section also permit inefficient solutions unnecessary moves of inner labels (cf. Fig. 5: label 4 could be moved to the b-son of the root). We still had postponed the design decision which node v (and which label k) to process next. Now we adopt the following strategy: The input string x is processed strictly from left to right. Given the (economical) interim infix tree T~-I (say) of prefix x l . . . x i - 1 , we extend it to T~ by reading xi and taking into consideration the whole "identifying information" of the new prefix xl...xl. The prefixes show a quite natural behaviour. Lemma 5.1. Let Xx...x i be a prefix of x. There is an index k < i + l such that positions 1, ..., k - 1 admit a position identifier (within xl...xi) whereas positions k, ..., i don't. Proof. For i = 0 the lemma holds vacuously with k = 1. Therefore let i > 1. Assuming that Xk...Xi would occur twice, but x~...xi with k < j < i identifies position j, we would get a contradiction. If k = i + 1, then xi=t=x~,j=l ..... i - 1 . We shall try to choose an instance of the more general construction templates of Sect. 4, such that the following invariants hold:

Time Optimal Left to Right Construction of Position Trees

471

Given a prefix xl...xi-~ the interim infix tree T~_ 1 carries labels 1, ..., k - 1 "isolated" on leaves. Label k, k 1, and has satisfied the invariants. Then reading xi leads to one of the following two cases (corresponding to our last template):

Case 1. v has an xi-son: We move down the label k and redirect v to its xi-son; if v is a leaf now, marked by j j, who transmit their labels to their new a-sons. Recursion terminates at the first inner node t in the tail node chain of v, because then t must already have an a-son s. (Indeed, p + 1 < k , and m(p+ 1) must lie in the subtree with root s=a-son(t).) Designate the tree resulting from this process by T~. Then it is easy to see that T~ satisfies the above invariants.

T3

T2

T4

4,5 (Cf.

I

Fig.5)

2

I Fig. 6. The interim infix trees for ab, aba, abab

For an example of case 1, consider the left part of Fig. 6 that shows T2, the interim infix tree of prefix ab of x = ababc, where k = 3. Reading x 3 = a we move down label 3 to the a-son of the root which is now pointed at by v. Immediately, we add to v a new b-son s taking over label 1 and complete its tail node chain by threading s to the already existing 4 xj+d~v~is taken from a buffer with direct access where the actual prefix xl...xi is stored

472

M. Kempf et al.

b-son of the root. This yields the tree in Fig. 6 designated by T3. The next step of processing, x 4 = b, is documented in Fig. 4 and 5. Notice that " m o v i n g down the label k" has highly implicit effects. Although the labels k + 1. . . . , n remain at the root of T~ the infix information for positions k + 1, k + 2 ..... i is contained in the tail node chain of node v. C O M P L E T E uses this fact. The effect will become clearer in the following, when, by completion of a tail node chain, new position labels "suddenly j u m p " from the root onto leaves. Case 2. v has no x~-son: We create the new (indispensable) x~-son, a leaf carrying label k. At the current input level i position k is uniquely identified. The following incarnations of C O M P L E T E , as long as possible, then append new leaves labelled by k + 1, k + 2 . . . . along the tail node chain of node v. This process terminates if in T~_ 1 a node u in the tail node chain of v already has an x~-son (Case 2.1), or if new xi-sons have been added all the way up to the root (Case 2.2). In Case 2.1 redirect v to u (the value of u can be determined from the global variable cv); if p > k was the last new leaf label, then p + 1 is a legitimate new label for v; let k : = p + 1 ( p + 1 is saved in the global variable ck). As v already has an x~-son, at node v with label k we can directly change to the movedownoperation and possible tree expansion of case 1 without reading the next symbol x~+l. Again we combine two incarnations of the while-loop body in the algorithm templates of the last section. Both incarnations put together preserve the invariants. In Case 2.2 we know that c k = i + 1 and c v = r o o t ; we also know that symbol x~ never occured in X l . . . x i _ 1 . Then, putting v:=cv and k : = c k re-establishes the invariants for the interim tree T~just created. F o r an example of this case, let us return to Fig. 5 that shows the situation after reading of X l . . . x g = a b a b . N o d e v with P ( v ) = a b reflects the prefix of the not yet existing identifier for k = 3. Incrementing the input level i to 5 and processing x5 = c yields, by three incarnations of C O M P L E T E , the final T N complete position tree as Fig. 1 shows it (without TN-threads).

In all cases the tree expansion at node v with label k has to (and can) be continued by reading x~+ 1 as long as i < n. Indeed, for the trees T~the condition i < n is equivalent to the loop criterion "T~ not separating" of our previous templates. (Proof If i(n, then there remains at least label n at the root of T~ and hence T~ cannot be separating. On the other hand, if i = n, then x~ never occured in Xl...x~_~ and thus we are in Case 2.2. k = i + 1 = n + 1 tells us that labels i ..... n are isolated on leaves and therefore T, is separating.) Thus the while-loop boils down to a simple for-loop: for i from 1 to n. The algorithm is organized in such a way that whenever C O M P L E T E (w, p) is called, label p could be found either at node w or at the root of the corresponding intermediate infix tree. Since labels at the root carry no information, and the node carrying label k is pointed at by variable v (as long as k isn't isolated on a leaf yet), we omit these labels in the algorithm.

Time Optimal Left to Right Construction of Position Trees

473

Now, moving label k along an already existing edge simply means redirection of v and moving a label p to a new leaf s means not only marking s by p, but also cancelling label p from its previous place if label p existed already in T. Therefore the second statement of the body of the procedure C O M P L E T E (cf. Sect. 4) ((transmit label p to node s)) may be more precisely implemented as follows: ((mark node s by label p)); ((delete label p from the father w of s, if it existed there)). Now we are ready to present the algorithm in a PASCAL-like notation and to determine its time complexity. Given. A text X=Xl...x,, n > 2 , x , # x i for i = 1, ..., n - 1. Data structure. A tree node w comprises: TN(w) {pointer to tail node of w}; d(w) {depth of w}; mark(w) {position label, if w is a leaf; else skip} ; List of pointers to sons of w and corresponding edge labels. Potree:

-((create root, thread it to itself)); v.'=root; k.'= 1; for i:= 1 to n do

read(xi); if ((v has xi-son)) then M O V E D O W N else I COMPLETE(v, k); v:=cv; k,=ck; if k < i + 1 then M O V E D O W N procedure M O V E D O W N I v:=xi-son(v); if ((v is a leaf marked by j)) then COMPLETE(v,j) Concerning the program structure and the time costs of POTREE we observe that the for-loop is executed exactly n times, procedure M O V E D O W N at most n times. C O M P L E T E is called as often as new indispensable nodes have to be added to the interim infix tree. Thus, the total time is kept linear in N = number of nodes of T. Qed. Acknowledgements. The authors should like to thank the anonymous referees for careful reading and helpful suggestions.

474

M. Kempf et al.

References 1. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The design and analysis of computer algorithms. AddisonWesley Reading 1974 2. Bayer, Rudolf: Datenstrukturen. Fernuniversit/it Hagen, Fachbereich Mathematik und Informatik 1981 3. Gfintzer, Ulrich: Datenstrukturen und Datenorganisation. Vorlesung, Technische Universit~it Mfinchen 1984 4. Kempf, Michael: Effiziente Links-Rechts-Konstruktion yon Positionsb/iumen. Diplomarbeit, Technische Universit~it M~inchen, Institut fiir Informatik 1984 5. Majster, Mila E.; Reiser, Angelika: Efficient online construction and correction of position trees. SIAM J. Comput. 9, 785-807 (1980) 6. McCreight, Edward M.: A space-economical suffix tree construction algorithm. J. Assoc. Comput. Mach. 23, 262-272 (1976) 7. Weiner, Peter: Linear pattern matching algorithms. IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1-11 (1973)

Received February 5, 1986 / February 3, 1987

Time optimal left to right construction of position trees

Recommend Documents