Methodol Comput Appl Probab (2010) 12:775–795 DOI 10.1007/s11009-010-9192-9
A New Method of Approximating the Probability of Matching Common Words in Multiple Random Sequences George Haiman · Cristian Preda
Received: 21 September 2009 / Revised: 2 August 2010 / Accepted: 3 August 2010 / Published online: 14 August 2010 © Springer Science+Business Media, LLC 2010
Abstract In this paper we consider R independent sequences of length T formed by independent, not necessarily uniformly distributed letters drawn from a finite alphabet. We first develop a new and efficient method of calculating the expectation E(N R ) = E(N R (m, T)) of the number of distinct words of length m, N R (m, T), which are common to R such sequences. We then consider the case of four uniformly distributed letters. We determine a b R = b R (m, T) ≥ 0 such that the interval [E(N R ) − b R ; E(N R )] contains the probability p R = P(N R ≥ 1) that there exists a word of length m common to the R sequences. We show that b R ≈ 0.07E(N R ) if R = 3 and b R ≤ 0.05E(N R ) if R ≥ 4. Thus, for unusual common words, i.e. such that p R is small, E(N R ) provides a very accurate approximation of this probability. We then compare numerically the intervals [E(N R ) − b R , E(N R )] with former approximations of p R provided by Karlin and Ost (Ann Probab 16:535–563, 1988) and Naus and Sheng (Bull Math Biol 59(3):483–495, 1997). Keywords Matching a common word · Genetic sequences · Longest success run · 1-dependent sequences · Poisson approximation AMS 2000 Subject Classification Primary 60E05; Secondary 60J10
G. Haiman (B) · C. Preda Laboratoire de Mathématiques Paul Painlevé, UMR 8524 CNRS Université Lille 1, Bât M2, Cité Scientifique, 59655 Villeneuve d’Ascq Cedex, France e-mail:
[email protected] C. Preda e-mail:
[email protected]
776
Methodol Comput Appl Probab (2010) 12:775–795
1 Introduction The problem of detecting unusualness of matching common words in multiple sequences is important in studying similarities between nucleotide and protein sequences from several biological sources. Given R independent sequences of T letters drawn from a finite alphabet, let P R (m, T) denote the probability that there exists a word of length m (a sequence of m consecutive letters) that is common to the R sequences. In practice, if a common word of length n is observed, it will be considered as “unusual” if n ≥ m, where m is such that P R (m, T) = 0.05 or 0.01. Under very general hypothesis Karlin and Ost (1988), (see also references therein), have developed asymptotic results for P R (m, T). In the following we suppose that the R sequences are identically distributed and formed by i.i.d. letters. Let the alphabet have σ letters (typically, σ = 4) and let pk be the probability of letter k, k ∈ {1, . . . , σ }. Under these hypotheses, the results of R log T , Karlin and Ost (1988) state that for T sufficiently large and m of order − log λ
P R (m, T) 1 − e−(1−λ)T where λ =
σ
λ
R m
,
(1.1)
( pk ) R .
k=1
Naus and Sheng (1997), (see also references therein), have developed other approximations and have compared them with Karlin and Ost’s (1988) results for a practical range of m, T and R using simulated data. It appears that their approximations are in general very accurate and better than Eq. 1.1, particularly when R ≥ 3. The best Naus–Sheng approximation is based on a two-step heuristic: a first step ˆ (N R (m, T)) of the expectation of the number N R (m, T) provides an approximation E of distinct words of length m common to the R sequences. The second step consists in considering that the distribution of N R (m, T) is approximately Poisson, thus leading to the approximation ˆ
P R (m, T) 1 − P(N R (m, T) = 0) 1 − e−E(N R (m,T)) .
(1.2)
In this paper we first develop an efficient method of exact calculation of E(N R (m, T)) for a large range of m, T and R. Assuming that the alphabet has four letters uniformly distributed, we then prove that E(N R (m, T)) approximates very closely P R (m, T) provided E(N R (m, T)) is small (typically less than 0.05) and R ≥ 3. This fact, since 1 − e−λ λ as λ → 0, justifies Naus–Sheng approximation (Eq. 1.2). Our article is organized as follows. In Section 2 we present our method of the exact computation of E(N R (m, T)). This method is based on a new and fast algorithm of computing, for each word ω of length m, the probability πT (ω) that this word appears in a sequence of length T of i.i.d. letters. Then, E(N R (m, T)) = (πT (ω)) R , (1.3) ω∈ m
where m is the set of all distinct words of length m from an alphabet . Related work concerning the computation of πT (ω) and E(N R (m, T)) was done by Robin
Methodol Comput Appl Probab (2010) 12:775–795
777
and Daudin (1999), Rahmann and Rivals (2000) and Nuel (2008). However, their results are not of practical use in the present framework (see Sections 2.1 and 2.2). We apply this new algorithm to calculate E(N R (m, T)) for several distributions of letters and values of R, m and T. In Section 3 we consider the case of four uniformly distributed letters. We determine a b R = b R (m, T) ≥ 0 such that the interval [E(N R ) − b R ; E(N R )] contains the probability p R = P(N R ≥ 1) that there exists a word of length m common to the R sequences. We show that b R ≈ 0.07E(N R ) if R = 3 and b R ≤ 0.05E(N R ) if R ≥ 4. Thus, for unusual common words, i.e. such that p R is small, E(N R ) provides a very accurate approximation of this probability. We then compare numerically the intervals [E(N R ) − b R , E(N R )] with former approximations of p R provided by Karlin and Ost (1988) and Naus and Sheng (1997).
2 Computation of the Expected Number of Common Words Let L1 , . . . , LT be a sequence of T letters drawn from a finite alphabet of σ ≥ 2 letters. We suppose that {Ln }n=1,...,T are independent, identically distributed and P(L1 = l) > 0 ∀l ∈ . We further suppose that T = (K + 1)m, where K and m are fixed integers. Let ω = (l1 , . . . , lm ), li ∈ , i = 1, . . . , m be a fixed word of length m. We first want to calculate πT (ω) = P(ω appears in L1 , . . . , LT }. 2.1 Computation of πT (ω) For k = 1, . . . , Km + 1 let
νk =
1 0
if (Lk , . . . , Lk+m−1 ) = ω otherwise.
and define Xn = Xn (ω), n = 1, . . . , K as Xn = max {νk | (n − 1)m + 1 ≤ k ≤ nm}
(2.1)
We first observe that {Xn }n forms a 1-dependent stationary sequence of Bernoulli r.v.’s and πT (ω) = 1 − P(X1 = 0, . . . , X K = 0).
(2.2)
Put αn = αn (ω) := P(X1 = 0, . . . , Xn = 0) and βn = βn (ω) := P(X1 = 1, . . . , Xn = 1), n ≥ 1.
(2.3)
Then, (see Haiman 1981) the αk ’s are related to the βk ’s by the recurrence formulas : α1 = 1 − β1 , α2 = α1 − β1 + β2
778
Methodol Comput Appl Probab (2010) 12:775–795
and for k ≥ 3, αk = αk−1 − αk−2 β1 + αk−3 β2 + . . . + (−1)k α1 βk−2 + (−1)k+1 βk−1 + (−1)k+2 βk . (2.4) Our method consists in calculating the βk , k = 1, . . . , K, thus deducing πT (ω) = 1 − α K by the above recurrence formulas. 2.1.1 Computation of the βk , k = 1, . . . K Put Mk = (Lk , . . . , Lk+m−1 ), k = 1, . . . , Km + 1.
(2.5)
Let π(s) = πω (s) = P(M1 = ω, . . . , Ms−1 = ω, Ms = ω), s = 1, . . . , m,
(2.6)
and for any n ≥ 2 and (s1 , . . . sn ) ∈ {1, . . . , m}n let π(s1 , . . . , sn ) = P
n
Ei (si )
i=1
where Ei (si ) = (M(i−1)m+1 = ω, . . . , M(i−1)m+si −1 = ω, M(i−1)m+si = ω). We then have m
β1 =
π(s)
(2.7)
s=1
and for n ≥ 2
βn =
π(s1 , . . . , sn )
(2.8)
π(s, l) , s, l = 1, . . . , m. π(s)
(2.9)
(s1 ,...,sn )∈{1,...,m}n
Proposition 2.1 Let μ(s, l) = μω (s, l) = Then, for n ≥ 2, βn =
m m s=1 l=1
where μk is the k-th power of matrix μ.
π(s)μn−1 (s, l).
(2.10)
Methodol Comput Appl Probab (2010) 12:775–795
779
Proof Observe that for any n ≥ 2 and (s1 , . . . , sn ) ∈ {1, . . . , m}n , n−1 P(En (sn ) E(si )) = P(Mm+1 = ω, . . . , Mm+sn −1 = ω, Mm+sn = ω Msn−1 = ω) i=1
=
π(sn−1 , sn ) . π(sn−1 ) (2.11)
Thus, by induction π(s1 , . . . , sn ) = π(s1 )μ(s1 , s2 ) . . . μ(sn−1 , sn ).
(2.12)
which implies Eq. 2.10.
From formula (2.10) we see that in order to apply the method we have to calculate for each ω ∈ m the values of πω (s) and πω (s, l) from which we deduce μω (s, l). Consider first the expression of π(s) in Eq. 2.6. One can see that depending on the overlapping structure of ω, some of the events (Mt = ω), 1 ≤ t < s are automatically incompatible with (Ms = ω) (we call them “incompatible ancestors” of (Ms = ω)). Let for example m = 8, ω = ABBBBBBB. This word, which has autocorellation (we further say correlation-see Guibas and Odlyzko 1981) v(ω) = (10000000), is non overlapping. Thus , for any 1 ≤ t < s, (Mt = ω) is incompatible with (Ms = ω) and πω (s) = P(M1 = ω, ..., Ms−1 = ω, Ms = ω) = P(M1 = ω). Notice that more than 50% of all words of a given length are non overlapping (Guibas and Odlyzko 1981). Let now ω = ABB ABB AB. This word has correlation v(ω) = (10010010) and if we take s = m = 8, the only compatible ancestors of (M8 = ω) are the events (M5 = ω) and (M2 = ω). Thus, πω (m) = P(M1 = ω, . . . , M7 = ω, M8 = ω) = P(M2 = ω, M5 = ω, M8 = ω) Next, (M5 = ω, M8 = ω) = {(L5 , L6 , L7 ) = (ABB), M8 = ω} {(L5 , L6 , L7 ) = (ABB)}. Thus,
(2.13) and
(M2 = ω) ⊂
πω (m) = P{(L5 , L6 , L7 ) = (ABB), M8 = ω} = [1 − P(L1 = A)P(L2 = B)P(L3 = B)]P(M1 = ω). 0bserve that 2,5 and 8 in Eq. 2.13 are the positions of 1’s when reading the bites of the correlation vector from right to left. Also 8 − 5 = 5 − 2 = 3 is the “fundamental period” q(ω) of ω, which, for this type of word, completely determines the values of πω (s) : • •
πω (s) = P(M1 = ω) if 1 ≤ s ≤ 3 = q πω (s) = [1 − P{(L1 , L2 , L3 ) = (ABB)}]P(M1 = ω) if 4 ≤ s ≤ 8. Let finally ω = A AB A AB A A, which has correlation v(ω) = (10010011). Here πω (m) = P(M1 = ω, ..., M7 = ω, M8 = ω) = P(M1 = ω, M2 = ω, M5 = ω, M8 = ω) = P(M1 = ω, M5 = ω, M8 = ω).
780
Methodol Comput Appl Probab (2010) 12:775–795
The fundamental period is also 8 − 5 = 5 − 2 = 3. But since 5 − 1 = 4 = 3, the event (M1 = ω) is not included in the event (M5 = ω, M8 = ω). Both words A AB A AB A A and ABB ABB AB are called “periodic” or “overlapping”. However we say that A AB A AB A A is “tail overlapping” (t.o.) whereas the previous word ABB ABB AB is “not t.o”. Put for any n and u1 , ..., un ∈ n P L (ui ). P L (u1 , ..., un ) := P(L1 = u1 , ..., Ln = un ) = i=1
For a fixed word ω = (l1 , ..., lm ) of length m, put P(ω) := P L (l1 , ..., lm ). Then, for ω = A AB A AB A A we have • • •
πω (s) = P(ω) if 1 ≤ s ≤ 3 = q πω (s) = [1 − P L (A AB)]P(ω) if 4 ≤ s ≤ 7 and πω (8) = [1 − P L (A AB) − P(A AB A AB A)]P(ω).
We see that for overlapping words (i.e. such that at least another correlation bit than the first is 1) the values of πω (s) changes with s only at certain points related to the correlation v(ω) as follows. Let 1 ≤ a1 < ... < aτ (ω) = m − q,
(2.14)
be the consecutive positions of 1 s when reading the bites of v(ω) from right to left, where q is the fundamental period (we say period) of ω. Let r = m − [ mq ] and observe that if r > 0 then r ∈ {a1 , ..., aτ }. Denote by b i, i = 1, . . . , σ (ω) those of the ai ’s put in the same order, such that 1 ≤ ai < q and if r > 0, b i = r. If the set {b 1 , ..., b σ (ω) } is not empty we say that ω is tail overlapping. In the previous examples ω = (A AB A AB AB) is not t.o. and τ = 2, a1 = 2, a2 = 5 whereas ω = A AB A AB A A is t.o. with τ = 3, a1 = 1, a2 = 2, a3 = 5, σ = 1 and b 1 = a1 = 1. With these notations, the following Proposition 2.2 generalizes the method of calculating π(s) described in the previous examples. Proposition 2.2 Let ω = (l1 , l2 , . . . , lm ). (i) If ω is not overlapping then, π(s) = P(ω), s = 1, . . . , m.
(2.15)
(ii) If ω is periodic with period q and not t.o. then π(s) =
P (ω) 1 − P L (l1 , . . . , lq ) P(ω)
if 1 ≤ s ≤ q if q + 1 ≤ s ≤ m
(2.16)
(iii) If ω is t.o. then ⎧ ⎪ P(ω) ⎪ ⎪ ⎪ 1 − P L (l1 , . . . , lq ) P(ω) ⎪ ⎨ j 1 − P L (l1 , . . . , lq ) − i=0 P L (l1 , . . . , lm−b σ −i ) P(ω) π(s) = ⎪ ⎪ ⎪ ⎪ ⎪ σ −1 ⎩ 1 − P L (l1 , . . . , lq ) − i=0 P L (l1 , . . . , lm−b σ −i ) P(ω)
if 1 ≤ s ≤ q, if q + 1 ≤ s ≤ m − b σ , if m−b σ − j < s ≤ m−b σ − j−1 , j = 1, . . . , σ − 2, if m − b 1 < s ≤ m. (2.17)
Methodol Comput Appl Probab (2010) 12:775–795
781
Proof The proof is given in Haiman and Preda (2010). However, the key elements are contained essentially in the previous examples.
Remark 2.1 From Proposition 2.2 one can see that when the letters are uniformly distributed, the calculation of π(s) depends only on the correlation through a1 , ..., aτ (ω) . Thus, it does not depend on the number of letters of (see Guibas and Odlyzko 1981). We now similarly, in order to calculate μω (s, l), want to decompose {1, ..., m}2 in subsets on which πω (s, l) is constant, where the subsets depend on the previous characterization by the ai , i = 1, ..., τ of the overlapping structure of ω. Again, when ω is non overlapping, the situation is quite simple. For s, l = 1, . . . , m, π(s, l) = P{(M1 = ω, . . . , Ms−1 = ω, Ms = ω) ∩ (Mm+1 = ω, . . . , Mm+l−1 = ω, Mm+l = ω)} = P{(Ms = ω) ∩ (Mm+l = ω)}. Thus π(s, l) = μ(s, l) if 1 ≤ l ≤ s − 1 and π(s, l) = π(s)P(ω) = P(ω)2 if s ≤ l ≤ 1 and then μ(s, l) = π(s, l)/π(s) = P(ω). When ω is periodic the situation is more complicated. In the above expression of π(s, l), the events (M1 = ω, ..., Ms−1 = ω) and (Mm+1 = ω, . . . , Mm+l−1 = ω) can be replaced by only those corresponding to compatible ancestors of (Ms = ω), respectively (Mm+l = ω). – –
if the event (Ms = ω) is not compatible with (Mm+l = ω) then π(s, l) = μ(s, l) = 0. if (Ms = ω) is compatible with (Mm+l = ω) then one has to check if it is compatible with some of the compatible ancestors of (Mm+l = ω) and then π(s, l) = π(s) × μ(s, l), where μ(s, l) = 0.
The general formulas for μ(s, l) are given in the following Proposition 2.3, the proof of which is given in Haiman and Preda (2010). Proposition 2.3 (i) If ω is not overlapping μ(s, l) =
0 P(ω)
if 1 ≤ l ≤ s − 1 if s ≤ l ≤ m.
(2.18)
782
Methodol Comput Appl Probab (2010) 12:775–795
(ii) If ω is periodic with period q , –
if 1 ≤ l ≤ q • if s − 1 ≥ l μ(s, l) =
P(L1 = lai +1 , . . . , Lm−ai = lm ) 0
if s = ai + l, 1 ≤ i ≤ τ otherwise
with a1 , . . . , aτ def ined in Eq. 2.14. • if s − 1 < l, μ(s, l) = P(ω) –
if q < l • if s − 1 ≥ l, μ(s, l) = 0. • if 1 ≤ l − q ≤ s − 1 < l, (1− P(L1 =lai +1 , . . . , Lq−ai =lq ))P(ω) μ(s, l) = P(ω)
if s = ai +l − q, 1 ≤ ‘ai ≤ q otherwise.
In particular, if ω is not t.o., then (1− P(L1 =lr+1 , . . . , Lq−r =lq ))P(ω) μ(s, l) = P(ω)
if r = 0, s =r+l−q otherwise or if r = 0.
• if 1 ≤ s − 1 < l − q a)
if ω is not t.o. or if ω is t.o. and l ≤ m − b σ then μ(s, l) = (1 − P(L1 = l1 , . . . , Lq = lq ))P(ω).
b)
if ω is t.o. and l > m − b σ , put b 0 = 0, and for any j = 0, . . . , σ − 1, let S(i) = {sv | sv = av + b σ −i + l − m, v = 1, . . . , τ }, i = 1, . . . , j. Let S =
j
S(i).
i=0
–
if s ∈ S, let I(s) = {i |s ∈ S(i)}. Then, for each i ∈ I(s) we have s = av(i ) + b σ −i + l − m and μ(s, l) = 1 − P(L1 = l1 , . . . , Lq = lq ) − i ∈I(s) P(L1 =lav(i ) +1 , . . . , Lm−av(i ) −b σ −i =lm−b σ −i ) ) P(ω).
–
if s ∈ S, μ(s, l) = (1 − P(L1 = l1 , . . . , Lq = lq ))P(ω).
Corollary 2.1 If ω is non overlapping then n , n ≥ 1, βn = (P(ω))n Cn+m−1
where Cnk , k ≤ n, stands for the binomial coef f icient.
(2.19)
Methodol Comput Appl Probab (2010) 12:775–795
783
Proof By Eq. 2.18, we have μ(s, l) = P(ω)ν(s, l), with ν(s, l) =
0 1
if 1 ≤ l ≤ s − 1 if s ≤ l ≤ m,
and by induction it can be seen that ⎛
n−1 Cnn−1 Cn−1 n−1 ⎜ 0 Cn−1 ⎜ ⎜ 0 νn = ⎜ 0 ⎜ . .. ⎝ .. . 0 ...
... Cnn−1 n−1 Cn−1 .. . ...
... ... ... .. . 0
⎞ n−1 Cn+m−2 n−1 Cn+m−3 ⎟ ⎟ n−1 ⎟ Cn+m−4 ⎟ , n ≥ 1. ⎟ .. ⎠ . n−1 Cn−1
Thus, since π(s) = P(ω), s = 1, . . . , m, using Pascal’s triangle formula we obtain βn =
m m
n π(s)μn−1 (s, l) = (P(ω))n Cn+m−1 , n ≥ 1.
s=1 l=1
Computational Algorithm Let ω be a fixed word of length m. From the above definitions and propositions, our algorithm of computing πT (ω) is composed of the following steps : 1) Test if ω is overlapping or not. If ω is overlapping (periodic) determine the characteristics q, r, τ , a1 , . . ., aτ , σ and b 1 , . . ., b σ . 2) Use Proposition 2.2 to calculate β1 and Propositions 2.1 and 2.3 to calculate β2 , . . ., β K . 3) Calculate α K using the recurrence formula (2.4). Remark 2.2 When the letters are uniformly distributed, π(·) and the terms of μ depend only on the autocorrelation vector v(ω). Thus, the value of πT (ω) is the same for all ω’s having a given correlation. An Application Related to the Longest Success Run in Bernoulli Trials Let = {0, 1}, 0=“failure”, 1=“success”, and consider the word of length m, ω = (1, . . . , 1). Let P(Li = 1) = p, P(Li = 0) = 1 − p = q, 0 < p < 1. It may be easily seen that π(s)ω =
pm qpm ,
s = 1, 1 < s ≤ m,
784
Methodol Comput Appl Probab (2010) 12:775–795
and ⎛
pm ⎜ pm−1 ⎜ m−2 ⎜p ⎜ μω = ⎜ . ⎜ .. ⎜ ⎝ p2 p
⎞ qpm qpm · · · · · · qpm 0 qpm · · · · · · qpm ⎟ ⎟ 0 0 qpm · · · qpm ⎟ ⎟ . ⎟. · · · · · · · · · · · · .. ⎟ ⎟ 0 · · · · · · 0 qpm ⎠ 0 ··· ··· ··· 0
(2.20)
Let V(T) denote the length of the longest success run in T Bernoulli B (1, p) trials. The following exact formula of Bateman (1948) is of practical use only for a small T/m (see also Haiman 2007): q j−1 (−1) j+1 p + (N − jm + 1) C N− jm p jm q j−1 . j
[T/m]
P(V(T) ≥ m) =
j=1
Next, if T = (K + 1)m, K ∈ N, we have
P(V(T) ≥ m) = 1 − P{ω does not appear in L1 , . . . , LT } = 1 − α K and our Proposition 2.1 provides a new and efficient method to compute the above probability (see numerical examples at the end of the section). 2.2 Computation of E(N R (m, T)) The expectation E(N R (m, T)) of the number of distinct words of length m common to R independent sequences of length T is related to the πT (ω)’s, by the formula E(N R,T ) = (πT (ω)) R = (1 − α K (ω)) R . (2.21) ω∈W
ω∈W
Guibas and Odlyzko (1981) have established recurrence formulas providing for each m the list of all binary vectors v j, j = 1, . . . , νm , which are autocorrelations of words of length m. (this list does not depend on the size σ of the alphabet). They also provide a recursive method of computing, for each m, the population size |v j|, j = 1, . . . , νm of all words having autocorrelations v j. Thus, for uniformly distributed letters, denoting by α K (v j) the common value α K of the words having autocorrelation v j, we have
E(N R,T ) =
ν(m)
|v j|(1 − α K (v j)) R .
(2.22)
j=1
For example (see next section), if σ = 4 and m = 15, there are 415 = 1073741824 distinct words to which corresponds a list of 57 autocorrelation vectors. In order to calculate E(N R,T ) by formula (2.22) one only needs to calculate ν(15) = 57 values of α K , i.e. one by class of ω’s having the same autocorrelation. When the letters are not uniformly distributed one has to calculate α K (ω) for all ω ∈ W. From a practical point of view, this supposes a very efficient method of computing α K (ω), particulary for large m and K (see further numerical examples and comments)
Methodol Comput Appl Probab (2010) 12:775–795 Table 1 Exact πT (ω) values for ω = ADBC ADBC AD and several values of T and distributions of letters
785
P(A)
P(B)
P(C)
P(D)
T
πT (ω)
0.25
0.25
0.25
0.25
0.23
0.27
0.23
0.27
0.20
0.30
0.20
0.30
0.10
0.40
0.40
0.10
100 10,000 60,000 100 10,000 60,000 100 10,000 60,000 100 10,000 60,000
0.0000855 0.0094521 0.0553946 0.0000828 0.0091486 0.0536950 0.0000697 0.0077104 0.0454170 0.0000023 0.0002553 0.0015321
2.3 Numerical Examples 2.3.1 Examples of Computation of πT (ω) Example 1 Let ω = ADBC ADBC AD (here = {A, B, C, D}). This word has autocorrelation 1000100010, period q = 4 and is not tail-overlapping. Using our algorithm we calculate the exact values of πT (ω) for T = 100, 10,000 and 60,000 (K = 9,999 and 5,999) and several distributions of the letters, including the uniform. The results are presented in Table 1. However, for large values of m and K, this calculation may need a quite long computer time when one wants to calculate πT (ω) for all ω ∈ W. In this case one can use (Haiman 1999, Theorem 1) which provides the following approximation of α K = 1 − πT (ω), depending only on βi , i = 1, . . . , 4 : α K ≈ (1 − β2 + 2β3 − 3β4 + β12 + 6β22 − 6β1 β2 ) ×(1 + β1 − β2 + β3 − β4 + 2β12 + 3β22 − 5β1 β2 )−K
(2.23)
with a relative error of about 88Kβ13 . In Table 2 we compare the above exact values of πT (ω) for ω = ADBC ADBC AD and T = 60,000 with their approximations πˆ T (ω) obtained using formula (2.23). Example 2 Let V(T) denote the length of the longest success run in T Bernoulli trials. Using the matrix μω in Eq. 2.20, we calculate P(V(T) ≥ m) = πT (ω) for m = 20, m = 40 and several values of p = P(success). The results are presented in Table 3, where the values of T are chosen such that P(V(T) ≥ m) is close to 0.05 and 0.95. Table 2 Exact and approximated values of πT (ω), ω = ADBC ADBC AD and T = 60,000
P(A)
P(B)
P(C)
P(D)
πT (ω)
πˆ T (ω)
Error
0.25 0.23 0.20 0.10
0.25 0.27 0.30 0.40
0.25 0.23 0.20 0.40
0.25 0.27 0.30 0.10
0.0553946 0.0536950 0.0454170 0.0015321
0.0553821 0.0536823 0.0454282 0.0015422
0.0000785 0.0000454 0.0000932 0.0000237
786 Table 3 Some exact values for P(V(T) ≥ m)
Methodol Comput Appl Probab (2010) 12:775–795 m
T
p
P(V(T) ≥ m)
20 20 20 20 40 40 40 40
113,620 6,620,020 660 37,420 116 × 109 68 × 1011 1,920,040 110,800,040
0.5 0.5 2/3 2/3 0.5 0.5 2/3 2/3
0.050159 0.950156 0.050680 0.950449 0.050131 0.950953 0.050758 0.950518
2.3.2 Example of Computation of E(N R (m, T)) Let = {A, B, C, D} and P(A) = P(B) = P(C) = P(D) = 0.25. For m ∈ {10, 15} we first,using Guibas and Odlyzko (1981) recurrence formulas, determine the correlation vectors v j and their population size |v j|. As an example, Table 4 presents these elements for m = 15.
Table 4 The 57 autocorrelation vectors v j and their populations |v j | for words of length m = 15 vj
|v j |
vj
|v j |
100000000000000 100000000000001 100000000000010 100000000000011 100000000000100 100000000000101 100000000000111 100000000001000 100000000001001 100000000001010 100000000001111 100000000010000 100000000010001 100000000010010 100000000010011 100000000010101 100000000011111 100000000100000 100000000100001 100000000100010 100000000100011 100000000100100 100000000100101 100000000101010 100000000111111 100000001000000 100000001000001 100000001000010 100000001000011
738,478,848 247,205,292 50,071,296 15,710,604 12,576,048 3,131,592 982,692 2,949,120 982,740 195,840 61,428 736,560 233,244 49,104 12,264 12,228 3,840 181,248 60,672 11,520 3,840 3,024 756 768 240 45,312 14,928 3,072 960
100000001000100 100000001000101 100000001000111 100000001001001 100000001010101 100000001111111 100000010000001 100000010000011 100000010000101 100000010000111 100000010001001 100000010010011 100000100000100 100000100000101 100000100000111 100000100001111 100001000010000 100001000010001 100001000010010 100001000010011 100001000010101 100010001000100 100010001000101 100010001000111 100100100100100 100100100100101 101010101010101 111111111111111
576 108 36 240 36 12 15,108 948 192 60 60 12 3,024 744 240 12 720 228 48 12 12 192 36 12 48 12 12 4
Methodol Comput Appl Probab (2010) 12:775–795 Table 5 Exact values for E(N R (m, T))
787
m
R
K
T = (K + 1)m
E(N R,T )
10
2 3 2 3
23 380 489 24,999
240 3,810 7,350 375,000
0.050438 0.049635 0.050106 0.045710
15
We then compute for each correlation class the probability πT (ω), where ω is a representative of the class. Using formula (2.21) for R ∈ {2, 3} and m ∈ {10, 15}, we obtain the exact values of E(N R (m, T)) as presented in Table 5. Other numerical examples with uniformly distributed letters are presented in the framework and at the end of the next section. Tables 6 and 7 present exact values of E(N R ) for m = 10, respectively m = 15, R ∈ {2, 3} and several distributions of letters. Notice that, for uniformly distributed letters, Rahmann and Rivals (2000) construct an algorithm of “coefficient extraction” which provides good approximations of the α K ’s. Using their algorithm they also apply the results of Guibas and Odlyzko (1981) to obtain numerical approximations of E(N R,T ) for R = 1 and R = 2. However, this quite complicated and computational time consuming method cannot be applied when the letters are not uniformly distributed. Another exact method of computing πT (ω) can be obtained from Robin and Daudin (1999). However, if the letters are not uniformly distributed and m and T are large (as in our examples), this method is neither adapted to calculate πT (ω) for all ω ∈ W. Notice that our method can easily be extended to calculate E(N R ) when the sequences have different lengths, T1 , . . ., T R and are not identically distributed.
3 Approximation of P R (m, T) by E(N R (m, T)) In this section we suppose that the letters L1 , . . . , LT , T = (K + 1)m are drawn from an alphabet of four letters and are uniformly distributed. For each ω ∈ W = m let Zω =
Table 6 Exact values for E(N R (m, T)), m = 10
1 0
if ω appears in all R sequences, otherwise.
P(A)
P(B)
P(C)
P(D)
R
T
E(N R )
0.25
0.25
0.25
0.25
0.23
0.27
0.23
0.27
0.20
0.30
0.20
0.30
0.10
0.40
0.10
0.40
2 3 2 3 2 3 2 3
240 3,810 240 3,810 240 3,810 240 3,810
0.050438 0.049635 0.062572 0.082672 0.085676 0.122567 1.548562 3.287721
788
Methodol Comput Appl Probab (2010) 12:775–795
Table 7 Exact values for E(N R (m, T)), m = 15
P(A)
P(B)
P(C)
P(D)
R
T
E(N R )
0.25
0.25
0.25
0.25
0.23
0.27
0.23
0.27
0.20
0.30
0.20
0.30
0.10
0.40
0.10
0.40
2 3 2 3 2 3 2 3
7,350 375,000 7,350 375,000 7,350 375,000 7,350 375,000
0.050106 0.045710 0.056843 0.071045 0.067546 0.182311 0.0943522 4.543662
We then have N R = N R (m, T) =
Zω.
ω∈W
Let P R (m, T) denote the probability that there exists a word of length m common to R sequences of length T. We then have P R (m, T) = P(N R ≥ 1) = P Zω ≥ 1 . ω∈W
Since we are concerned with small E(N R ) and quite large m’s, | m | = 4m is large and one can expect the distribution of N R to be close to the distribution of a Poisson r.v. However, the Bernoulli r.v.’s Z ω do not satisfy conditions allowing us to apply Chen–Stein or other Poisson type approximation results. Notice that P R (m, T) ≤ E(N R ) = P(Z ω = 1). ω∈W
Therefore, in the following we determine a B R = B R (m, T) ≥ 0 such that
E(N R ) − B R ≤ P R (m, T), R ≥ 2, with B R small with respect to E(N R ). Let ω and ω be two distinct words of length m and let ζT (ω, ω ) = {ω and ω appear in (L1 , . . . LT )} = {∃t, t ∈ {1, . . . , T − m + 1} | (Lt , . . . , Lt+m−1 ) = ω and (Lt , . . . , Lt +m−1 ) = ω }
(3.1)
Let N cR = #{distinct couples of distinct words common to the R sequences}. We then have, on one hand,
E(N cR ) =
R 1 P(ζT (ω, ω )) × 2 (ω, ω ) ∈ W 2 ω = ω
and on the other hand,
E(N cR ) = P(N R = 2) + 3P(N R = 3) +
∞ k=4
Ck2 P(N R = k).
(3.2)
Methodol Comput Appl Probab (2010) 12:775–795
789
Thus,
E(N R ) − E(N cR ) = P(N R = 1) + P(N R = 2) +
∞
(k − Ck2 )P(N R = k),
k=4
and since k − Ck2 < 0 for k ≥ 4, we have
E(N R ) − E(N cR ) ≤ P(N R ≥ 1)
(3.3)
The values B R such that E(N R ) − B R ≤ P R (m, T) = P(N R ≥ 1), R = 2, 3, are provided by the following Proposition 3.1. Let π T = P(ω appears in L1 , . . . , L( K+1 +1)m ) 2
2
and πT∗ = max πT (ω) and π ∗T = max π T (ω). ω∈W
2
ω∈W
(3.4)
2
Proposition 3.1 If m ≥ 10, we have i) E(N2c ) ≤ B2 where 2 4 (m − 1)2 B2 = 0.338 + (πT∗ )2 4m + [−5.54 + 7.556m] (πT∗ )2 π ∗T 4m 2 3 4m 1 + m + 1 (πT∗ )2 (π ∗T )2 42m . 2 4 (3.5) ii) E(N3c ) ≤ B3 where m−1 (m − 1)2 (πT∗ )3 4m + + 5.17 + 10.67 (πT∗ )3 π ∗T 4m B3 = 0.0669 + 5.33 m 2 4 4m + [28.029 + 16(m − 1)] (πT∗ )3 (π ∗T )2 4m + (1 + 3 × 4m )(πT∗ )3 (π ∗T )3 4m . 2
2
(3.6) ∗ are obtained via the computations of πT (ω) Remark 3.1 The values of πT∗ and πT/2 and πT/2 (ω) of Section 2. It may be easily seen that in Eq. 3.5 (πT∗ )2 4m E(N2 ) and since m ≥ 10, the ∗ ∗ 4m and (πT∗ )2 (πT/2 )2 42m are negligible with respect to (πT∗ )2 4m . quantities (πT∗ )2 πT/2 Thus 2 4 (m − 1)2 )E(N2 ). B2 (0, 338 + 3 4m
Similarly, since (πT∗ )3 4m E(N3 ), (m − 1) E(N3 ). B3 0, 0669 + 5, 33 4m
Proof The details of the proof are given in Haiman and Preda (2010). The idea is the following.
790
Methodol Comput Appl Probab (2010) 12:775–795
Let ω and ω be two distinct words and denote by (ωb ω ) the event “ω appears before ω in L1 , ..., LT ”. Then 1 E(N cR ) = × (P(ωb ω ) + P(ω b ω)) R . 2
(ω,ω ),ω =ω
In order to obtain our majorants of E(N cR ) we first majorate the terms P(ωb ω ), ω = ω . For fixed ω = ω , our majorants depend on how ω and ω overlap each other. We first decompose W 2 in disjunct sets k , k = 1, ..., m − 1 . k is the set of (ω, ω ), ω = ω such that when we shift ω under ω from right to left, the longest word of consecutive characters that are directly over each other and match has length m − k (we call this word “longest common word”). m are the other (ω, ω ), ω = ω , i.e. those who do not have any sub-word in common. Thus, clearly | k |≤ 4m+k . It can be seen (Guibas and Odlyzko 1981) that the proportion of words of length ≥ 7 with basic period ≥ 4 is ≥ 0.9987. Then , for k = 1, 2, 3 the set
k formed by couples (ω, ω ), ω = ω such that their longest common word has length m − k and period > k is included in k and is such that |
k − k |≤ 4m+k (1 − 0.9987). With these notations we then obtain the following Lemma 3.1, which is the key of the proof.
Lemma 3.1 (a) If m ≥ 10 and k = 1, 2, 3, one can construct subsets
k of k such that |
k −
k | ≤ 4m+k (1 − 0.9987) and –
if (ω, ω ) ∈
k , then
1 P(ωb ω ) ≤ 1 + 3 × 45
–
1 ∗ π + πT∗ π ∗T , 2 4k T
(3.7)
if (ω, ω ) ∈ k −
k , then
P(ωb ω ) ≤
4 1 ∗ π + πT∗ π ∗T . 2 3 4k T
(3.8)
(b) If m ≥ 4 and (ω, ω ) ∈ k , 4 ≤ k ≤ m − 1, then inequality (3.8) is also satisf ied. Furthermore, if (ω, ω ) ∈ m
P(ωb ω ) ≤ πT∗ π ∗T .
(3.9)
2
Proof See Haiman and Preda (2010).
Sketch of Proof of Proposition 3.1 (i) We have 1 E(N2c ) = × (P(ωb ω ) + P(ω b ω))2 . 2 (ω, ω ) ∈ W 2 ω = ω Thus
E(N2c ) = S1 + S2
(3.10)
Methodol Comput Appl Probab (2010) 12:775–795
where
S1 =
791
P2 (ωb ω ) and S2 =
(ω,ω )∈W 2
P(ωb ω )P(ω b ω).
(3.11)
(ω,ω )∈W 2
Next, S1 = +
3
k=1
(ω,ω )∈
k
P2 (ωb ω ) +
k=4
(ω,ω )∈ k
P2 (ωb ω ) +
m−1
3 k=1
(ω,ω )∈ k −
k
(ω,ω )∈ m
P2 (ωb ω )
P2 (ωb ω )
(3.12)
= T1 + T2 + T3 + T4 . We then majorate each term Ti , i = 1, . . . , 4. Using inequality (3.7) we have 3
1 1 ∗ 2 2 (1 + 3×4 5 ) (π T ) 42k 1 1 +2 3k=1 (ω,ω )∈
k (1 + )(πT∗ )2 π ∗T 5 k 4 2 3 × 4 + 3k=1 (ω,ω )∈
(πT∗ )2 (π ∗T )2 = T11 + T12 + T13 .
T1 ≤
(ω,ω )∈
k
k=1
k
Next,
T11 ≤ 1 + |
k |
(3.13)
2
1 3 × 45
2
(πT∗ )2
3 1 |
| 42k k k=1
≤ | k | ≤ 4 we get and since 2 2 1 1 1 1 1 21 ∗ 2 m ∗ 2 m + + (πT ) 4 = 1 + (π ) 4 . (3.14) T11 ≤ 1 + 3 × 45 4 16 64 3 × 45 64 T m+k
1 2 21 Notice that (1 + 3×4 5 ) 64 = 0.328 which is close to the term 0.338 in Eq. 3.5. The difference 0.01 comes (see Haiman and Preda 2010) from the majorations of the terms T2 , T3 and T4 . Next, using similar arguments we obtain 2 56 4 (m − 1)2 ∗ 2 m (πT ) 4 + (m − 1) (πT∗ )2 π ∗T 4m + 42m (πT∗ )2 (π ∗T )2 , (3.15) S2 ≤ m 2 2 3 4 9
)(πT∗ )2 4m provides the contribution ( 43 )2 (m−1) in whose dominating term ( 43 )2 (m−1) 4m 4m Eq. 3.5.
2
2
Sketch of Proof of Proposition 3.1 (ii) We have 1 E(N3c ) = × (P(ωb ω ) + P(ω b ω))3 = S1 + 3S2 2 (ω, ω ) ∈ W 2 ω = ω where here S1 =
(ω, ω ) ∈ W ω = ω
P3 (ωb ω ) 2
792
Methodol Comput Appl Probab (2010) 12:775–795
and
S2 =
P2 (ωb ω )P(ω b ω).
(ω, ω ) ∈ W 2 ω = ω Next, decomposing S1 as in Eq. 3.12, where P2 (ωb ω ) is replaced by P3 (ωb ω ), we similarly obtain S1 = T1 + . . . + T4 where T1 =
3
k=1
(ω,ω )∈
k
P3 (ωb ω ).
Using again inequality (3.7), we similarly majorate T1 by a sum T1,1 + T1,2 + T1,3 + T1,4 , where the dominating term is T1,1 =
3
k=1
(ω,ω )∈
k
1 43k
1 1+ 3 × 45
3
(πT∗ )3 .
Next, 3 1 1 T1,1 ≤ 1 + (πT∗ )3 3k=1 3k | k | 5 3×4 4 3 1 1 1 1 + 4 + 6 (πT∗ )3 4m = 0.0667(πT∗ )3 4m , ≤ 1+ 3 × 45 42 4 4 which is almost the term 0.0669(πT∗ )3 4m in Eq. 3.6. The difference 0.002(πT∗ )3 4m is due to the contributions of the terms T2 , T3 and T4 . As for B2 , the term 5.33 m−1 (πT∗ )3 4m is the contribution of the dominating term 4m when majorating S2 .
3.1 Majoration of E(N cR ) when R ≥ 4 We have 1 × (P(ωb ω ) + P(ω b ω)) R 2 (ω, ω ) ∈ W 2 ω = ω R−1 1 = (P(ωb ω )) R + × ClR Pl (ωb ω )P R−l (ω b ω) 2 (ω, ω ) ∈ W 2 l=2 (ω, ω ) ∈ W 2
ω = ω ω = ω
E(N cR ) =
Methodol Comput Appl Probab (2010) 12:775–795
793
If we again want to majorate E(N cR ) using Lemma 3.1, the calculations become more and more complicated. However, we can see that similarly to cases R = 2 and R = 3, we will have, under the application conditions (i.e. E(N R ) small) E(N cR ) ≈ (P(ωb ω )) R (ω, ω ) ∈ W 2 ω = ω But here it is sufficient to apply only inequalities (3.8) and (3.9) and then
(P(ωb ω )) R =
(ω, ω ) ∈ W 2 ω = ω
m
(P(ωb ω )) R
(ω, ω ) ∈ k ω = ω R m−1 4 1 ∗ ∗ ∗ ≤ π + πT π T + (πT∗ π ∗T ) R 2 2 3 4k T k=1 (ω, ω ) ∈ (ω, ω ) ∈ k k ω = ω ω = ω m−1 R 4 1 4 1 ∗ ≈ πT ≤ R× 4m (πT∗ ) R . 1 k 34 3 1 − R−1 4 k=1 (ω, ω ) ∈ k
ω = ω (3.16) k=1
Thus, if R ≥ 4
E(N cR ) ≤ B R with BR ≈
4 1 4m (πT∗ ) R . 1 R 3 1 − 4 R−1
(3.17)
If R = 4 the last term in Eq. 3.17 is approximately 0.05 × 4m ∗ (πT∗ )4 . In the following we give for m = 10 and m = 15 numerical examples of the interval [E(N R ) − B R , E(N R )] containing P(N R ≥ 1) when E(N R ) ≈ 0.01 or 0.05. As mentioned in Remark 3.1 we have: 2 4 (m−1)2 (πT∗ )2 4m ≈ 0.338E(N2 ), B2 ≈ 0.338 + 4m 3
Table 8 Bounds for P(N R ≥ 1) m
R
T = (K + 1)m
BR
E(N R ) − B R
E(N R )
K. and O. app.
10
2 2 3 3 2 2 3 3
110 240 2,240 3,810 3,300 7,350 226,515 375,000
0.00329 0.01802 0.00076 0.00474 0.0032 0.01760 0.00072 0.00388
0.00623 0.03241 0.00928 0.04489 0.00675 0.03250 0.00935 0.04182
0.00953 0.05043 0.01005 0.04963 0.01005 0.05106 0.01007 0.04570
0.00861 0.04036 0.00953 0.04606 0.00757 0.03703 0.00940 0.04197
15
794
Methodol Comput Appl Probab (2010) 12:775–795
Table 9 Exact values for E(N R ) and comparison with Karlin and Ost’s (1988) and Naus and Sheng’s (1997) approximations m
R
T = (K + 1)m
E(N R )
K. and O. app.
N. and S. app.
10
5 8 10 5 8 10
36,700 136,010 215,000 9,270,000 56,400,000 104,500,005
0.05042 0.05021 0.05022 0.05030 0.05470 0.05051
0.05337 0.08057 0.12865 0.05000 0.06032 0.07859
0.04897 0.04901 0.04901 0.04896 0.04922 0.04926
15
(m − 1) B3 ≈ 0.0669 + 5.33 (πT∗ )3 4m ≈ 0.0669E(N3 ) 4m and from Eq. 3.17, if R ≥ 4, B R ≤ 0.05 × 4m × (πT∗ ) R ≈ 0.05E(N R ). 3.2 Numerical Examples For m = 10 and m = 15 we first, using the algorithms of Section 2.2, determine the values of K (T = (K + 1)m) such that E(N R ) = E(N R,T ) is close to 0.01 and 0.05. We then calculate πT∗ , π ∗T and then B2 and B3 . The results are presented in Table 8 where 2
we also give Karlin and Ost’s (1988) corresponding approximations of P(N R ≥ 1), quoted by “K. and O. app.” In Table 9 we compare, for R ∈ {5, 8, 10} and m ∈ {10, 15} the values of E(N R ) and Karlin and Ost’s (1988) and Naus and Sheng’s (1997) approximations of P(N R ≥ 1), quoted by “K. and 0. app.”, respectively by “N. and S app.”. As in Table 8, we first determine K such that E(N R ) = E(N R,T ) is close to 0.05. The above results show that the Naus and Sheng (1997) approximation of P(N R ≥ 1) provides good results also for large values of m and T (their evaluation by simulation was made for m ≤ 5 and T ≤ 222). The results of Table 8 confirm that the Karlin and Ost (1988) asymptotic approximation deteriorates as the number of sequences increases, even for large m and T’s. Acknowledgements suggestions.
We wish to thank an anonymous referee for his valuable comments and
References Bateman GI (1948) On the power function of the longest run as a test for randomness in a sequence of alternatives. Biometrika 35:97–112 Guibas LJ, Odlyzko AM (1981) Periods in strings. J Comb Theory Ser A 30:19–42 Haiman G (1981) Valeurs extrêmales de suites stationnaires de variables aléatoires m-dépendantes. Ann Inst Henri Poincaré 17:303–330 Haiman G (1999) First passage time for some stationary processes. Stoch Process Their Appl 80:231– 248 Haiman G (2007) Estimating the distribution of one-dimensional discrete scan statistics viewed as extremes of 1-dependent stationary sequences. J Stat Plan Inference 137(3):821–828 Haiman G, Preda C (2010) Matching common words in multiple random sequences, Technical Report of Laboratoire Paul Painlevé, Université Lille 1
Methodol Comput Appl Probab (2010) 12:775–795
795
Karlin S, Ost F (1988) Maximal length of common words among random sequences. Ann Probab 16:535–563 Naus JI, Sheng K-N (1997) Matching among multiple random sequences. Bull Math Biol 59(3):483– 495 Nuel G (2008) Pattern Markov chains : optimal Markov chain embedding through deterministic finite automata. J Appl Probab 45:226–243 Rahmann S, Rivals E (2000) Exact and efficient computation of the expected number of missing and common words in random texts. In: Sankoff D, Giancarlo R (eds) Proceedings of the 11th symposium in combinatorial pattern matching (CPM2000). Springer, Berlin, pp 375–387 Robin S, Daudin JJ (1999) Exact distribution of word occurrences in a random sequence of random letters. J Appl Probab 36:179–193