Molecular Biology, Vol. 39, No. 3, 2005, pp. 372–386. Translated from Molekulyarnaya Biologiya, Vol. 39, No. 3, 2005, pp. 420–436. Original Russian Text Copyright © 2005 by Laskin, Kudryashov, Skryabin, Korotkov.
BIOINFORMATICS UDC 577.212.2
Latent Periodicity of Serine/Threonine and Tyrosine Protein Kinases and Other Protein Families A. A. Laskin1,2, N. A. Kudryashov2, K. G. Skryabin1, and E. V. Korotkov1,2 1 Bioengineering
Center, Russian Academy of Sciences, Moscow, 117312 Russia e-mail:
[email protected] 2 Moscow Physical Engineering Institute, Moscow, 115409 Russia Received October 26, 2004
Abstract—A method of noise decomposition has been developed. This method allows for the identification of a latent periodicity with symbol insertions and deletions that is specific for all or most amino acid sequences belonging to the same protein family or protein domain. The latent periodicity has been identified in catalytic domains of 85% of serine/threonine and tyrosine protein kinases. Similar results have been obtained for 22 other protein families. The possible role of latent periodicity in protein families is discussed. Key words: latent periodicity, alignment, profile, repeat, protein kinase
INTRODUCTION The development of mathematical methods and algorithms for studying the organization of symbol sequences is becoming increasingly important as more and more amino acid and nucleotide sequences are being identified [1–4]. What information can be obtained from symbol sequences with the use of modern mathematical approaches? The answer to this question determines the possibility of extracting biologically important information from genetic texts, understanding gene evolution and evolutionary genomic rearrangements, constructing a dynamic model of cell genetic regulation, and producing artificial proteins with predetermined characteristics. The analysis of the periodicity of a symbol sequence is one of the methods of studying its organization. The analysis of the periodicity of a symbol sequence has an obvious biological meaning, because multiple tandem duplications of DNA fragments followed by base substitutions, as well as insertions and deletions of symbols, may underlie gene and genome evolution [5–7]. If a periodicity in enzyme active centers were to be detected, this would indicate that the genes encoding these proteins resulted from simple repetition of comparatively short DNA fragments in the evolutionary past. We also may presume that this periodic structure of amino acid sequences of the active sites of proteins plays a role in the stabilization of the conformation of protein globules. Dynamic programming [8–14] or Fourier-transform-based algorithms [15–24] are usually used to detect periodicity in amino acid sequences. Earlier,
we developed a new mathematical method based on information decomposition (ID) to detect periodicity in symbol sequences [25–29]. The main idea of this approach is that the information content of any symbol sequence can be decomposed into mutually nonoverlapping components. Each of these components contains mutual information between the symbol sequence studied and an artificial periodic sequence with a definite period length. The dependence of mutual information on period length can be represented as a plot that is the information analog of a selfcorrelation function but also has specific properties [29]. The ID makes it possible to avoid some limitations inherent in dynamic programming and Fourier transform [29]; moreover, it allows revealing a latent periodicity, i.e., the periodicity that other mathematical methods developed to date cannot detect. However, similar to the Fourier transform, the ID method in its present form cannot identify statistically significant latent periodicity in the presence of numerous insertions and deletions of symbols. Therefore, a considerable portion of the latent periodicity that could be found in amino acid sequences remains undetected by both ID and all other currently used algorithms and approaches. In this case, a combination of ID and modified profile analysis may be the simplest method for revealing latent periodicity with symbol insertions and deletions. In this combination, the ID method may serve for identifying the latent periodicity and constructing a latent-periodicity matrix [29], which may be used afterwards for determining the weights of each amino acid residue for
0026-8933/05/3903-0372 © 2005 Pleiades Publishing, Inc.
LATENT PERIODICITY OF PROTEIN KINASES
each location of the period. After this, modified profile analysis can reveal latent periodicity in all amino acid sequences from the Swiss-Prot databank determined by the specific matrix of weights, but now in the presence of amino acid insertions and deletions. These results, in turn, can be used for reconstructing the original weight matrix in order to improve the specificity and selectivity of the search for latent periodicity. The first purpose of this study was to develop the noise decomposition (ND) method. For many protein families, periodicity may be disturbed by symbol insertions and deletions, so that only cyclic alignment allows the latent periodicity inherent in these families to be revealed. The ND method also makes it possible to discriminate between different types of latent periodicity that have the same period length. Here, we demonstrate that our algorithm can discriminate between two similar types of latent periodicity detected in serine/threonine and tyrosine protein kinases. The second purpose of this study was to find a latent periodicity specific for serine/threonine and tyrosine protein kinases. First, we detected latent periodicity by the ID method in only seven protein kinases. The use of ND, iterative profile analysis, and cyclic alignment allowed us to modify the original periodicity matrix and detect the latent periodicity determined by this matrix in as many as 1215 protein kinases from the Swiss-Prot databank. These results indicate that the detected latent periodicity is specific for serine/threonine and tyrosine protein kinases and makes it possible to identify serine/threonine and tyrosine protein kinases in the Swiss-Prot databank with a probability higher than 80%, with false identification being almost completely precluded. The third purpose of our study was to demonstrate that latent periodicity of various lengths and types can also be detected in many other protein families. To demonstrate this, we applied the ID and ND methods to several protein families from the Swiss-Prot databank. Here, we discuss these results and hypothesize that latent periodicity reflects the evolutionary origin of proteins via numerous tandem duplications.
373
Consider a set of sequences S 1, S 2, …, S N with the same length L. We will estimate the similarity between these sequences. For this purpose, consider the alignment of these sequences without insertions or deletions. 1
1
1
2
2
2
S1 S2 … S L S1 S2 … S L … … … … N
N
The general form of this alignment can be represented as a sum of position weights: L
W =
∑W .
Here, Wi ≡
1 Wi ( S i ,
Vol. 39
No. 3
1
the degree of similarity between the amino acids S i , N
…, S i . In dynamic programming, this value is usually calculated on the basis of the weights of amino acid pairs: Wi =
∑ ∑ P(S , S α i
β i ),
(2)
α β>α
where P is a matrix of the weights of amino acid coincidence, for example, a PAM or a BLOSUM matrix. Then, Eq. (2) can be written in the form of the following sum: 1 W i = --2
∑ m ( i, l ) ( m ( i, l ) – δ )P ( l, k ), k l
(3)
l, k
where j and k are types of amino acid, and m(i, l) shows the frequency of the amino acid of type j in the position of the period i. Earlier, we suggested another measure of similarity based on information theory, which we termed information content [25–29]: W i' =
2005
N
…, S i ) is a function estimating
20
MOLECULAR BIOLOGY
(1)
i
i=1
METHODS AND ALGORITHMS First, let us define the form of symbol-sequence periodicity that may be called latent, or hidden. In the general case, a periodicity may be called latent if its statistically significant identification by determining homology between repeats is impossible. The homology between repeats is often specified by the method of the calculation of self-correlation functions [17] or weight matrices for amino acid pairs, such as the PAM and BLOSUM matrices [11–14]. In these matrices, weights are always higher for two identical than for two different amino acids.
N
S1 S2 … S L
Km ( i, l )
-, ∑ m ( i, l ) ln ------------------x ( i )y ( l )
(4)
l=1
where K = NL, xi =
∑
L m ( i, i=1
∑
20 m ( i, l=1
l ) , and yl =
l ) . These measures are different; therefore, the alignment may have a large weight if the informational similarity measure (Eqs. (1) and (4)) are used and a small weight if the measure based on sequence homology (Eqs. (1) and (3)) is used, and vice versa. However, large or small weights determined from Eqs. (1), (3) and (1), (4) are themselves
374
LASKIN et al.
Table 1. The matrix used for simulating an artificial sequence with a latent periodicity of seven amino acids Position of the period
1
2
3
4
5
6
7
The set of amino acids found AWDVPGLM RKLPWV DWYEVH KYRGSLFH SDTMHKIC DQVHPRMK RHFQINKD at the given position of the period Note: At each position of the period, the frequencies of amino acids used are equal to one another.
meaningless, especially if measures that have different distribution functions are compared. The comparison of the aforementioned two measures is only possible in terms of probabilities or values unequivocally related to them. We used the Monte Carlo method to calculate the probability that the latent periodicity of the amino acid sequence was determined by stochastic factors. For this purpose, we randomly mixed the amino acid sequence analyzed and determined the weights from Eqs. (1), (3) and (1), (4). We performed this procedure 500 times to obtain the arithmetic means and variances of W and W '. Afterwards, we introduced the value Z calculated as W – E(W ) Z = ------------------------- . D(W )
(5)
The higher Z, the lower the probability that the periodicity of the analyzed amino acid is determined by stochastic factors alone. Let us designate this probability α for the Z value calculated by Eqs. (3), (1), and (5) and β for the Z value calculated by Eqs. (4), (1), and (5). Let us assume that we found a latent periodicity if the probability α is comparatively large and is higher than a certain threshold value, while the probability β is comparatively small and is lower than this threshold value. The threshold probability is usually chosen in
such a way that it shows whether or not the periodicity found is statistically significant. Therefore, periodicity may be sometimes obvious if the informational measure calculated by Eq. (4) is used but remain undetected if the weight is calculated by Eq. (3) based on the matrix of the weights of amino acid pairs. Earlier, we demonstrated that this latent periodicity was present in many biologically important sequences [25–29]. Let us illustrate the idea of latent periodicity by an example. Let us construct the amino acid sequence with length K that contains a latent periodicity seven amino acid residues in length and does not display any noticeable homology between individual periods. To construct this amino acid sequence, we use Table 1. We will generate the amino acid sequence in such a way that amino acids from the sets shown in Table 1 may be found with the same probability at each position of the period in the sequence. The positions of the amino acid sequence that are equal to t = i + 7 × n are generated with the use of the set of amino acids shown in the ith column of Table 1. Here, i assumes values 1, 2, …, 7, n = 1, 2, …, and t varies from 1 to K. This method can be used to create a sufficient amount of amino acid sequences containing the given latent period. One of these sequences is shown below.
ARDKSDRWKWYSQHDLYRDVFVPDRSDHPKEGTQQGWVSMHIDPYLM DFGRHFSHQVKERSDNWKEHHPKDKVFMRHWKHRSDHLWDSHMDDP WHKKKLWHHMVRALWGIRFMLHFHHFLRHLDKHDVDLCQDRLWRSHD Obviously, this sequence lacks a homogeneous tandem periodicity, because there is no distinct homology between any seven-amino-acid periods. On the other hand, the sequence contains a latent periodicity, because only a specific amino acid set (shown in Table 1) is present at each position of the period. Let us show the characteristics of this sequence in terms of probability. To do this, we estimated the probabilities α and β. First, we calculated the alignment weight (W) by Eqs. (3) and (1) with the use of the BLOSUM50 matrix of the weights of amino acid pairs and by Eqs. (4) and (1) with the use of mutual information [29]. Then, we generated 500 sequences with the same amino acid composition via random mixing. For each random sequence, we also
determined the weight W by Eqs. (3) and (1) and by Eqs. (4) and (1). After that, we calculated the arithmetic mean and variance and determined Z values using Eq. (5). For the sequence shown above, the calculated Z values were 3.2 and 8.8. We used a normal approximation of the distribution of Z to estimate the probabilities α and β at 0.02 and lower than 10–12, respectively. The obtained α and β values showed that the algorithm of the search for periodicity based on the search for homologies (e.g., with the use of PAM250 matrices of the weights of amino acid pairs) was unable to identify a statistically significant periodicity in the sequence analyzed. MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
LATENT PERIODICITY OF PROTEIN KINASES
Detection of Periodicity in the Swiss-Prot Databank Figure 1 shows the principle of the search for latent periodicity with symbol insertions and deletions. Note that the original profile for iterative search was taken from our latent-periodicity database that was created using the ID method when searching for latent periodicity in the Swiss-Prot database [27]. The ID method allowed us to detect 12 × 103 amino acid regions with a periodicity of different lengths and types. About 20% of the amino acids found contained homologous periodicity, i.e., the periodicity for which the α value was lower than 10–6. This periodicity is easily identifiable by the RADAR [13] and REPRO [30] programs. In this study, we only dealt with several cases of homologous periodicity and only in order to perform comparative analysis. We focused on studying the periodicity for which α > 5 × 10–3 and β < 10–6. We also used the RADAR and REPRO programs in our analysis; however, these programs could not detect the periodicity with the above values of α and β. In this study, we used the amino acid sequences with latent periodicity detected by means of ID to obtain the latent-periodicity matrix that, in turn, served as the original data set for the search for a periodicity similar to the former one but containing deletions and insertions of symbols. For this purpose, we selected the cases of latent periodicity with the same length found in functionally equivalent domains of different proteins. For all these cases, we constructed the latent-periodicity matrix [27] that showed the number of amino acids found in each position of the period. Then, all latent-periodicity matrices were pooled into a single matrix, which was used to calculate a position–weight matrix (profile). The Noise Decomposition Method The position–weight matrix is a profile showing the weight of each symbol for each position of the period. The profile usually results from the multiple alignment of amino acid sequences that have a common characteristic; it is mainly used to search for similarities between sequences contained in databanks [31]. The common characteristic may be, e.g., the belonging of the amino acids to the same protein family (the common biological function), identical or similar secondary or tertiary structures, or the presence of statistically significant alignments between the sequences. However, we may also consider the inverse problem, namely, how the results of the profile analysis of the databank can be used to create a new profile. The new profile must possess the following properties: (1) sensitivity; i.e., it should permit detecting at least most (ideally, all) amino acid sequences that have a common property; MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
375
Database of the profiles of latent periodicity detected using ID Scanning the Swiss-Prot databank with the use cyclic alignment Determining true and false alignments Weighting the sequence by pairwise alignment
Recalculation of the latent-periodicity profile
Stopping iterations if the number of proteins with latent periodicity has not changed Database of amino acid sequences with latent periodicity
Fig. 1. Scheme of the search for amino acid sequences with latent periodicity with the use of iterative profile analysis.
(2) optimality; i.e., the weight of these amino acid sequences should be as large as possible, and the probability of finding a sequence lacking the common property should be as low as possible; and (3) specificity; i.e., one should find as few amino acid sequences lacking the common property as possible (ideally, such sequences must not be found at all). The difference between points (2) and (3) is that point (2) deals with a statistical model of databank sequences (usually, in the form of a long sequence with certain probabilities of symbols or groups of symbols), whereas point (3) deals with a real databank. As we will show below, the difference between optimality and specificity is essential. If we are creating a new profile based on the result of the search in a databank, and we repeat this process, then we can construct a new profile after each scanning, i.e., iteratively alter the original profile. This procedure only has sense if the iterative method is asymptotically stable, i.e., if every new profile becomes sufficiently similar to the previous one after a certain number of iterations. This condition is met if the profile is calculated as [32]: p i, j + ε -, W i, j = C ln ---------------fi + ε
(6)
376
LASKIN et al.
where Wi, j is the element of the position–weight matrix for symbol i at position j; pi, j = m(i, j)/y( j) shows the proportion of symbols of type i in the scanned sequence, namely, in the databank; and ε is a small number for excluding zero values of the variables. We took ε to be 10–5. The scale coefficient C in the formula can be chosen arbitrarily (the multiplication of all weights by a certain factor does not change the alignment path or the statistical value of the alignment provided that the insertion and deletion weights are multiplied by the same coefficient). Therefore, we may take a sufficiently large coefficient C in order to accelerate calculations by using weights in the form of integers in the computer program. If the C is sufficiently large, this choice will not substantially affect the accuracy. In terms of information theory, the calculation of the new profile on the basis of the results of profile analysis is similar to the isolation of a signal from an information channel where noise is present [33, 34]. It is obvious that certain motifs corresponding to frequent structural or functional units and resembling the profile used will be prevalent among the results of the profile analysis of the databank (compared to the results of scanning random texts). Profile analysis is a very sensitive method; therefore, it is likely that some of these motifs will be found in evolutionarily remote proteins. If we include these proteins in the multiple alignment for deducing the new profile, the amino acid sequences of these proteins may shift the original profile toward frequent structural or functional motifs. Probably, this will considerably change the original profile matrix [32]. If we search for the common profile of a family of proteins or protein domains, this may yield unexpected results known as profile wandering. Information theory has a means to solve this problem. The main idea is that the sequences that may be regarded as noise should be identified, and these results should be used to separate the sequences regarded as the signal from those regarded as noise. Thus, we classify each amino acid sequence found in the Swiss-Prot databank with the use of profile analysis as either signal (the class of true alignments) or noise (the class of false alignments). In turn, noise may be subdivided into two parts. The first part comprises the sequences of “uncorrelated noise,” which have nothing in common with the protein family that we deal with. The uncorrelated noise is described by amino acid probabilities fi . The second part is “correlated noise” comprising undesirable sequences significantly similar to the profile studied that are detected at a statistically significant level along with amino acid sequences from the protein family studied. For the purposes of our study, the main difference between correlated and uncorrelated noises is that the distribution of amino acids among the positions of the original profile for correlated noise is nonrandom, i.e., corre-
lated noise is position-specific. Let us define, for each amino acid sequence k from the set of correlated k
noise, the periodicity matrix q i, j showing the number of amino acids of type i at position j of the period. Summarizing two types of noise, we obtain π i, j = c 0 f i + c 1 q i, j ,
(7)
where
∑q
k i, j
k -. q i, j = ----------------------
∑∑ i
(8)
k q i, j
k
Taking this into account, Eq. (6) may be rewritten for the elements of the new position-specific matrix as r i, j W i, j = C ln -------. π i, j
(9)
Here, c1 and c2 are the relative contributions of different types of noise, i.e., their contributions to the resultant correlated noise. We chose the coefficients c0 and c1 on the basis of the results of the search for the latent periodicity of serine/threonine and tyrosine protein kinases. The coefficients c0 and c1 varied from 0.99 to 0.01 and from 0.01 to 0.99, respectively. We determined the values of c1 and c0 that yielded the lowest number of amino acid sequences in the set of correlated noise. We demonstrated that the volume of correlated noise was insignificantly decreased at c0 > 0.95, whereas both the position–weight matrix and the sensitivity of the method changed significantly at c0 < 0.5. Our calculations showed that c0 and c1 values of 0.8 and 0.2, respectively, were optimal in terms of preserving the sensitivity of finding amino acid sequences from the original protein family and decreasing the volume of the correlated-noise set. We calculated ri, j on the basis of the pairwise global alignment of all sequences in the sample of true alignments. We designated the weight of the alignment of two sequences (k and l) as S(k, l) and used this weight to calculate T(k) by the equation T (k) =
∑ max ( 0, S ( k, l )/ { max ( S ( k, k ), S ( l, l ) ) } ).
(10)
l
The index l runs over the entire set of sequences in the sample of true alignments. The elements in the sum (10) may assume the following values: the element of the sum for l = k is always 1 (each sequence is similar to itself), the elements of the sum for unrelated sequences are equal to 0, and the elements of the sum for similar sequences assume values between 0 and 1. As a result, T(k) = 1 if there are no sequences similar MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
LATENT PERIODICITY OF PROTEIN KINASES
to the sequence with index k, T(k) = N if all N sequences in the sample of true alignments are identical, and T(k) is between 1 and N if there are similar sequences. For each amino acid sequence k from the set of true alignments, we calculated the periodicity matrix M k, and these matrices were summed with weights of 1/T(k). Thus, we calculated the weighted periodicity matrix for the entire set of amino acid sequences from the set of true alignments as m ( i, j ) =
∑ m ( i, j )/T ( k ), k
(11)
k
where the index k runs over all amino acid sequences from the set of true alignments, mk(i, j) is an element of the matrix M k, and m(i, j) is an element of the matrix M. Then, we calculated the values of ri, j as m ( i, j ) r i, j = ------------------------ . m ( i, j )
377
bal alignment to perform pairwise comparisons of the sequences found [36]. After that, the matrix of distances between the sequences was constructed using the formula Distance(A, B) = (AlignmentScore(A, A) + AlignmentScore(B, B)) / 2 – AlignmentScore(A, B). This distance matrix was used to divide the sequences found into two large classes. After that, we estimated to what extent the resultant two classes were related to the two types of protein kinases and found a correlation between the classes and types of about 90%. Therefore, both classes included serine/threonine and tyrosine protein kinases, but the degree of their separation reached 90%. We concluded from the data obtained that information from the Swiss-Prot databank was more preferable than cluster analysis for the discrimination between true and false alignments (in other words, the signal and noise).
(12)
When the classes were formed, we calculated the
The results of these calculations allowed us to take into account the possible multiple presence of any type of sequences in the set of true alignments when iteratively constructing the new profile.
values of mk(i, j) and q i, j , i.e., the frequencies of amino acid residues at different positions of the period for the corresponding amino acid sequences from the sets of true and false alignments, similarly to Ni corresponding to the number of sequences in each set. We took fi values to be equal to the amino acid frequencies in the complete Swiss-Prot databank. Then, we used Eqs. (7), (8), (10)–(12), and (9), respectively, to obtain new values of Wi, j , i.e., a new profile matrix.
∑
k
i
Iteration Profile Analysis and Cyclic Alignment The iteration procedure was performed as follows. First, we used the periodicity matrix determined by the algorithm described under Detection of Periodicity in the Swiss-Prot Databank to calculate Wi, j by Eq. (6). Then, we used the obtained position–weight matrix in the cyclic alignment procedure [35]. We performed the cyclic alignment of all amino acid sequences in the Swiss-Prot databank with the use of the position–weight matrix calculated by Eq. (6) and selected all statistically significant results (Z > 6.0). After that, we divided the results obtained into the classes of true and false alignments and used Eqs. (7)– (12). Two approaches were used for the division of the results into classes: the classification based on key words of the databank and that based on the results of sequence clustering. The Swiss-Prot databank contains some information on the amino acid sequences accumulated in it. To obtain information on the proteins identified by cyclic alignment, we used their descriptions (field DE), key words (field KW), and the table of properties (field FT). On the basis of this information, we divided the results obtained into two classes: true and false alignments. If information on the amino acid sequences found was absent in the databank, we used clustering for the subdivision of these sequences into classes. Protein kinases were clustered as follows. First, we used gloMOLECULAR BIOLOGY
Vol. 39
No. 3
2005
The new Wi, j values were used to repeat the scanning of the Swiss-Prot databank as we did earlier [35] and to obtain a new set of amino acid sequences with Z > 6.0. After that, we repeated the procedure of the choice of true and false alignments described above in this section and recalculated the Wi, j values. We repeated this iterative procedure until the set of alignments after an iteration was the same as before this iteration (a 95% or higher coincidence). The results of our experiments demonstrated that five to eight iterations were always enough to reach the 95% or higher level of coincidence. We performed cyclic alignment and estimated the statistical significance of alignment using the Monte Carlo method as we did earlier [35]. RESULTS First, we searched for previously known types of tandem repeats to test our approach. For this purpose, we used ankyrins and leucine-rich repeats. The original profile for these repeats (with periods of 33 and 24 amino acid residues, respectively) was obtained using the ID method. The result is shown in Table 2 (nos. 1 and 2). Our approach allowed us to detect, in the SwissProt databank, 146 out of 150 amino acid sequences previously known to contain at least three ankyrin repeats (we assumed that a sequence contains a peri-
378
LASKIN et al.
Table 2. The list of protein families where various types of latent periodicity with symbol insertions and deletions have been found N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Protein family Proteins with ankyrin repeats Proteins with leucine-rich repeats Acetolactate synthases Granule-bound glycogen synthases tRNA synthases Acyl and adenyl transferases, EIF Potassium channels Acyl-transferring protein synthases, active center Dethiobiotin synthases Various GTP-binding proteins Tryptophan-2-monooxygenases Various CoA-related proteins, probably, CoA-binding center Aspartyl proteases MHC I antigens Methyl transferases ACC oxidases Polymerase core Interleukin-12, growth factors ATP synthases, active center Hemagglutinins Hedgehog proteins Actins Lyases Pyridoxal phosphate–binding proteins
Period length, amino acid residues
Number of proteins with latent periodicity found in the family
33 24 25 41 17 6 25 25 30 23 28 30 41 12 31 32 13 14 14 15 16 17 10 16
203 261 28 11 71 136 55 124 15 270 6 42 54 148 30 17 103 12 82 92 59 256 428 76
Note: The first two classes are shown as an example that ND identifies previously known repeats. Cyclic alignment as a means for detecting amino acid residues is shown on the website http://bioinf.narod.ru/periodicity/new.
odicity if there are at least three repeats). In addition to these 146 sequences, we found 57 other amino acid sequences containing at least three previously unknown ankyrin repeats each. Thus, we identified 203 amino acid sequences containing three or more ankyrin repeats each; i.e., we found more ankyrin repeats than were previously known. Only one false alignment (in a protein with the code P50938) was identified in the entire Swiss-Prot databank, and it was related to the distinct periodicity of this amino acid sequence. We also detected 261 out of 270 known sequences containing leucine-rich repeats, with false alignment being absent altogether. As in the previous case, we identified many previously unknown leucine-rich repeats. For example, we found an additional (fourth) repeat, which had not been found by the methods used previously, in protein P09661 (involved in the forma-
tion of the U2 of the small nuclear ribonucleoprotein). We tested this using the PFam and SMART databases, as well as the programs that are used for detecting multiple tandem repeats, such as REP [14], REPRO [30], and RADAR [13]. The 3D structure of this protein is known, and the presence of the fourth leucinerich repeat is obvious. Another example was found in protein P16473 (a hormonal receptor), in which we identified ten leucine-rich repeats, whereas other algorithms permitted the identification of only six repeats. The simulated 3D structure of this protein confirms that the identified region entirely consisted of leucinerich repeats. Thus, the use of ID and cyclic alignment makes it possible to detect new copies even for the families of multiple tandem repeats that have already been studied in detail. We also detected previously unknown copies in many other known families of multiple tandem repeats. We do not show these data MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
LATENT PERIODICITY OF PROTEIN KINASES
Z
6 äê ë1_ASPNG 5 4 3 2 1 0 –1 –2 –3 0 10
379
20
30
40 50 Period length
Fig. 2. The ID of the amino acid sequence from locus KPC1_ASPNG (residues 954–1056). The complete protein kinase domain of this protein contains amino acid residues 770 to 1030. 7 8 9 a b c d e f gh i 123456789 a b c d e f gh i 123456789 a b c d e f ghi123456789 a b c d e f gh i 123456789 a b c de WWAFGVL I YQML LQQS P F RGEDEDE I YDA I LADE P LYP I HMP RDSVS I LQKL L TRE P E LRLGSGP TDAQEVMSHAF F RN I f gh i 123456789abcde f gh i 1 NWDD I YHKRVP P P F L PQ I S S P TD
The upper sequence shows the positions of the 18-amino-acid period. The numbers and letters show the positions of the period. The pariodicity matrices for serine/threonine and tyrosine protein kinases are available at the website http://bioinf.narod.ru/periodicity. The lower sequence is the amino acid sequence from locus KPC1_ASPNG in the Swiss-Prot databank.
here, because the study of homologous periodicity was beyond the scope of our study. Originally, we identified seven protein kinases characterized by the latent periodicity with a period of 18 amino acids in the absence of deletions or insertions of symbols (Table 3, Fig. 2). All protein kinases except for that with the identifier M3KA_HUMAN (which has a double specificity) were from the serine/threonine class. The periodicity was located in the catalytic domains of protein kinases. This suggests that the 18-amino-acid periodicity is a characteristic feature of the protein kinase active center. However, the following should be demonstrated to draw this conclusion. First, the periodicity, even in the presence of symbol insertions and deletions, should be charac-
teristic of most proteins from the protein kinase family. Second, we should demonstrate that serine/threonine and tyrosine protein kinases are characterized by different periodicity matrices, even if they are similar in some respects. The division of protein kinases into two classes should improve the sensitivity and selectivity of the search for serine/threonine and tyrosine protein kinases. We used the frequency of the occurrence of each amino acid at each position of 18-amino-acid period averaged over all seven cases of latent periodicity found and calculated the original position–weight matrix using Eq. (6). This matrix was used to scan the Swiss-Prot databank (release 41) using cyclic alignment [35, 37]. The weight for opening an insertion or
Table 3. Amino acid sequences of protein kinases from the Swiss-Prot databank where regions with latent periodicity without insertions or deletions have been found Swiss-Prot ID KDBE_SCHPO KEMK_MOUSE KPC1_ASPNG KPCL_RAT M3KA_HUMAN CC22_XENLA
Start of the region End of the region 400 85 954 526 97 85
478 181 1056 565 181 148
88
160
CC2_CARAU
MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
Protein description in the Swiss-Prot database Putative serine/threonine protein kinase C22E12.14C (EC 2.7.1.) Putative serine/threonine protein kinase EMK (EC 2.7.1.) Protein kinase, C-like (EC 2.7.1.) Protein kinase C, ETA type 2.7.1.- (NPKC-ETA) (PKC-L) Mitogen-activated protein kinase 10 (EC 2.7.1.) Cell-division-controlling protein, homolog 2 (EC 2.7.1.-) (P34 protein kinase) Cell-division-controlling protein, homolog 2 (EC 2.7.1.-) (P34 protein kinase) (cyclin-dependent kinase 1) (CDK1)
380
LASKIN et al.
Table 4. The use of ND for identifying serine/threonine protein kinases Profile type
Serine/threonine protein kinases
Tyrosine protein kinases
Other proteins
63 615
41 109
18 58
554 528 407 229 101 866 866
29 2 0 0 0 4 4
69 61 80 99 214 33 33
Original profile found using ID Original profile after five iterations without the use of ND Original profile after one ND procedure c1 = 0.1 c1 = 0.25 c1 = 0.5 c1 = 0.75 c1 = 1 Original profile after five ND procedures c1 = 0.25
Note: The results of the search using different profile matrices are shown. The choice of different c1 coefficients (c0 = 1 – c1) led to different results of the search for serine/threonine protein kinases. The value c1 = 0.25 was chosen as optimal for protein kinase recognition.
Table 5. The results of searching the Swiss-Prot databank (release 41) for given amino acid sequences displaying similarity to the latent-periodicity matrix Matrix type Total number of protein kinases in the Swiss-Prot databank (release 41) Total number of protein kinases with Z > 6.0 False alignments Other types of protein kinase found among false alignments
Serine/threonine protein kinases
Tyrosine protein kinases
1116 (62 ones have a double kinase activity) 903 37 4 (tyrosine kinases) + 1 (an unidentified kinase)
348
deletion and the weight for continuing an insertion or deletion were chosen in such a way as to ensure the best sensitivity and specificity of the iterative search. For this purpose, we tested the weight of opening in the interval from 1.0 to 10.0 at a step of 0.2 and the weight of continuing a deletion or insertion in the interval from 0.03 to 1.0 at a step of 0.1. The optimal sensitivity and specificity were obtained if the weight for continuing an insertion or deletion was 0.7; so we used this value in our study. This scanning yielded about 100 statistically significant amino acid sequences from the sets of serine/threonine and tyrosine protein kinases cyclically similar to the profile used. It was demonstrated earlier that the catalytic domains of serine/threonine and tyrosine protein kinases were highly homologous to one another at the amino acid level and had similar 3D structures [38]. Although this similarity between serine/threonine and tyrosine protein kinases at the amino acid level may hinder the formation of independent profiles, we tried to form two periodic profiles according to two types of protein kinase catalytic domain. We divided the amino acid sequences found into two classes and formed two
312 11 4 (serine/threonine kinases) + 6 (tyrosine-like kinases)
new position-specific matrices with the use of Eqs. (7)–(8), (10)–(12), and (9) in such a way that serine/threonine protein kinases were regarded as true alignments and tyrosine protein kinases were regarded as correlated noise (or false alignments) for the first matrix; for the second matrix, the assumptions were inverse. The use of ND yielded two position–weight matrices, which were then optimized using iteration analysis in order to find as many serine/threonine and tyrosine protein kinases as possible while preserving the search specificity to the maximum possible extent. We performed iterations until the profile became stable (usually, five to eight iterations were enough). Table 4 shows the results of ND at different coefficients c0 and c1 for the set of serine/threonine protein kinases. As seen from the table, if the ND method was not used, the number of detected serine/threonine protein kinases increased approximately tenfold after five iterations; however, the accuracy of the discrimination between two types of protein kinases increased by a factor of only 5.64. The use of Eq. (9) considerably decreased the number of identified tyrosine protein kinases; however, the high values of the coefficient c1 considerably changed the profile and decreased its sensitivity and specificity. As noted above, the optimal MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
MOLECULAR BIOLOGY
Vol. 39
P53894 10.7 449–610
Z
Coordinates of the start and end of the sequence
No. 3
2005
Q16288 16.3 663–764
ADGMAY - I ERMNY I HRDLRAANI LVGDNLVCK I ADFGLAR- L I E DNEYTARQGAKFP I KWT APEAALYG R F T I K- - S DVWS F G I LLTELVTK
567 8 9 a b c d - e f g h i 1234 5 6789abcde f g h i 123 4 5 6 78 9 a b c d - e f gh - i 123456 7 8 9abcde f g h i 1 - 234 5 67 8 9 a b c de f gh i 12
MYRKF T T E - SDVWSFGVI LWE I FT
e f g h i 1 2 3 45 6 7 8 9abcde f gh i 1
ASGMVY L ASQHFVHRDLATRNCLVGANLLVK I GDFGMS RDVYS T DYYRLFNPSGNDF - - - - CI - - - - - WC E VGGHTMLP I RWM P PES I
567 8 9 a b c de f g h i 12345 6 789abcde f gh i 1 234 5 6 7 89 a b c d e f gh i - - 1 - - 2345 6 7 89abcde f g - h i 12 - - 34 5 6 7 8 9 abcd
DLKP EN L MLDE RGYVKI VDF GFAKQI GTSSKT WTFCG - T PEYVAP E I I LNKGH- DRAVDYWALGI L I HE L L NGTPP F
234 5 6 7 8 9 ab c d e f gh i 12 3 456789abcde f g h i 1 2 3 4 56 7 8 9 a b cde f gh i 123456 7 8 9abcde f g h i 12345 6
SERH I ML S SR S P F I - - - - CRLYRTFRDEKYVYMLLEACMGGE I WTMLRDRGSFEDNAA- - - - QF I I GCV L - - - - QAF EYL HA R G I I YR
cde f g h i 1 23 4 5 6 789abc d e f gh i 123456 7 8 9ab - c d e f g - - - h i 12345 - - 6789a b c de f gh i 1 2 3 4 56789 a b - - c d e f gh i 1
TANKRQ T MV- - - VDS I SLTMSNRQQI - - QT - - WR- - - KS RRL MAYS TVGTPDYI APE I F L YQ- GYGQEC DWWSLGA I MYE CL I GWPPF
bcd e f g h i 12 3 4 5 6789ab c de f gh i 12345 6 7 89a b c d e f g h i 1 2 3 - - 456789abcd e f gh i 1234 5 6 7 89abc d e f g h i 1 2 3456
DVTR FYMA - EC I L AI ET I HKLGF I HRDI KPDN I L I D I RGHI KL S DF GLSTGFHKTH- DSN YYKKLLQQD E ATNGI SKPGT YN A NTTD
234 5 6 7 8 9 ab c d e f g - - - - - - - - h i 123456 7 8 9ab c d e f g h i 1 2 3 456 - - - - 789abc d e f gh i - - - - - - 12 - - 3 4 5 - - - - 6 7 89a
Alignment
Note: The upper sequence in each alignment is an 18-amino-acid profile. The lower sequence in each alignment is the amino acid sequence where periodicity has been found. The corresponding latent-periodicity matrices are available on the website http://bioinf.narod.ru/periodicity.
Tyrosine P00527 16.4 370–457 protein kinase, Yes transforming protein
Growth factor NT-3 receptor precursor, TrkC tyrosine kinase
Serine/threo- Q03042 16.4 505–655 nine protein kinase, cGMPdependent protein kinase
Serine/threonine protein kinase CBK1
Description of the amino Swissacid sequence Prot where the access region with code latent periodicity was found
Table 6. Examples of cyclic alignments for two serine/threonine and two tyrosine protein kinases
LATENT PERIODICITY OF PROTEIN KINASES 381
382
LASKIN et al.
c0 was about 0.75; in this case, the coefficient of discrimination between two types of kinases after one iteration was increased to 264. Table 5 summarizes the final results of the iteration analysis. In both cases, we found latent periodicity in more than 80% of protein kinases in each of the two families. We obtained a specificity of 95% and a level of kinase discrimination of 99%. Table 6 shows the results of alignments. The resultant profiles and alignments are available at the website http://bioinf.narod.ru/periodicity. After the study of protein kinases, we started scanning other protein families for latent periodicity. Note that not all 100% of the periodicities that we detected by ID could be found using ND in the same protein families. Actually, we only were able to find 70% of the periodicities identified by ID. In 30% of cases, latent periodicity was specific for a few proteins from the family or a few protein families with different biological functions. These results indicate that about 30% of the periodicities detected by the ID method were not specific for the protein family where they had been found. Subsequent use of cyclic profile analysis for studying latent periodicity identified by ID allowed us to detect 22 more protein families where the periodicity was present in most (>75%) proteins. Table 2 shows the list of the protein families. The respective profiles for this periodicity are available at the website http://bioinf.narod.ru/periodicity/new. We also calculated the number of amino acid sequences that did not belong to the protein family where the periodicity was originally found. For all protein families shown in Table 2, this number was no more than 7% of the total amount of proteins in the corresponding family; for about one-third of them, this number was zero. We assumed that the presence of these sequences may have been caused by incomplete annotation of the SwissProt databank or the existence of additional functions in previously known proteins. DISCUSSION The concept of latent periodicity and the approach to the search for it were first presented in [25] and used in subsequent studies [26–29]. However, the functional role of the detected latent periodicity and its correlation with protein structure remains an open question. Here, we report the first data that latent periodicity exists in a family of amino acid sequences fulfilling the same biological functions. We obtained this result through developing the ND method and cyclic alignment that allowed us to detect, at a statistically significant level, a given type of hidden periodicity (identified by the ID method) in the presence of deletions and insertions of symbols. The use of the ND method and iterative analysis did not break the latent periodicity originally detected
by the ID method in protein kinases and other proteins. This is illustrated by Fig. 3, where the ID of several amino acid sequences of protein kinases after the use of ND is shown. The periodicity of 18 amino acid residues is distinctly seen from the plot of ID. Such spectra can be obtained for all other amino acid sequences of protein kinases (as well as all amino acid sequences of other protein families that were aligned with the corresponding profile). Protein kinases, i.e., the proteins whose function is to transfer the phosphoric acid residue from ATP to proteins, play a key role in signal transmission within cells. The protein kinase class is subdivided into many subfamilies within which the homology is 90% or higher. The homology between families is usually lower than 30%. Two main families of protein kinases, structurally similar serine/threonine and tyrosine protein kinases (named after the amino acid residues that they phosphorylate) are especially numerous; there are also protein kinases with a double effect [39]. Other protein kinases phosphorylating other amino acid residues in proteins have also been discovered. The analysis of these protein kinases by means of ID has not shown latent periodicity in them; nor have they displayed the presence of the latent periodicity that we found in serine/threonine and tyrosine protein kinases. We presume that the numerous, large deletions and insertions in these protein kinases precluded the finding of latent periodicity in their structure. It is known [40, 41] that 12 subdomains can be identified within the protein kinase catalytic domain, where the periodic sequences found by us are located. These subdomains, on the one hand, are highly conserved evolutionary and, on the other hand, approximately correspond to elements of the spatial structure of the catalytic center that are practically identical in all protein kinases whose secondary structure has been studied [42, 43]. Subdomains I–IV are responsible for binding ATP and form antiparallel β structures. Subdomains VIa–XI bind the substrate and initiate the transfer of the phosphate ion to the protein being phosphorylated. They are somewhat more variable, which ensures the specificity of the enzyme, and their secondary structure mainly consists of α-spiral regions. The subdomains are separated from one another with less conserved regions that are usually identified with loop structures, with more and less conserved regions alternating; the period of this alternation is close to 18 amino acids. The periodic regions that we found by cyclic alignment in different protein kinases are about 100 amino acid residues in length and are located in subdomains VIb, VII, VIII, and IX. They contain elements important for protein kinase function, such as the catalytically active aspartic acid residue (in subdomain VIb) and the activation loop (between subdomains VII and VIII). Many amino acid residues within these domains are crucial for preservMOLECULAR BIOLOGY
Vol. 39
No. 3
2005
LATENT PERIODICITY OF PROTEIN KINASES After alignment
8
(a)
383
Before alignment
6 z
4 2 0 –2 8
0 10 20 30 40 50 60 70 80 (b)
0 5 10 15 20 25 30 35 40 45
6
z
4 2 0 –2 –4 6
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 (c)
4 z
2 0 –2 –4
0
10
20
30
40
50
6
(d)
0 5 10 15 20 25 30 35 40 45 50 55
z
4 2 0 –2 0 5 10 15 20 25 30 35 40 45 Period length
0 5 10 15 20 25 30 35 40 45 Period length
Fig. 3. The ID of some amino acid sequences of serine/threonine and tyrosine protein kinases after and before alignment relative to the periodicity matrix. The respective amino acid sequences aligned relative to the periodicity matrix are shown in Table 6. As seen from the figure, ID did not reveal statistically significant latent periodicity in the amino acid sequences before the alignment (the statistical significance of all period lengths was lower than 4.0). After the alignment relative to the cyclic profile, the main 18-aminoacid period is distinct in all of the four samples. Periods multiple of 18 amino acids are less significant statistically than the main 18-amino-acid period. (a) P53894; (b) Q03042; (c) Q16288; (d) P00527.
ing the structure and function of the catalytic center. In addition to the aforementioned aspartic acid residue, this group of functionally important amino acid residues include those of valine (interacting with the adenine of ATP), lysine (interacting with phosphate ion), asparagine and aspartic acid in subdomain VII (binding the inhibitory and activating magnesium MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
ions), aspartic acid in subdomain IX (stabilizing the catalytic loop), as well as several residues responsible for the formation of ion bonds with other parts of the catalytic core and the regulation of enzyme activity via their phosphorylation or self-phosphorylation [44]. Note that, for example, the aforementioned aspartic acid residues are located 18 and 36 amino
384
LASKIN et al.
acid residues apart (the distances are indicated for mammalian cAMP-dependent protein kinase A, which usually serves as a model of the structure of this enzyme class; for other proteins studied, the values are the same or close to these) and, hence, occupy the same position in the period (position 2), although they fulfill different functions. Thus, the period found was about the same size as one subdomain. To test this finding, we compared the result of cyclic alignment with the subdomain structure in the case of protein kinase A (sequence P05132). We found that the boundaries of subdomains lay between positions 14 and 15 of the period. Thus, there is a distinct relationship between the periods and subdomains. Earlier, it was hypothesized that tyrosine protein kinases evolved from serine/threonine kinases through extraction of the nucleotide sequences which code for catalytic domains from their genes by intron insertion followed by the insertion of these transposable elements, slightly altered by mutations, into the genes of other proteins possessing kinase activity (they were termed kinase precursors) [45, 46]. One of the consequences of these events would be an increased variability of the lengths of the catalytic domains of tyrosine protein kinases, because the restrictions on insertions and deletions that, in the genes of functional proteins, are determined by the necessity to preserve their function are absent in the case of transposable elements. If so, our results confirm this suggestion, because their analysis showed that insertions and deletions in tyrosine protein kinases were almost twice as frequent as in serine/threonine ones (on average, 5.96 versus 3.05 per periodic region detected); in other words, we observed considerable deviations from perfect periodicity in them. The limitations of our ND method are of special interest. First, note that we used ND in combination with ID, which served to find periodicity matrices for ND, in this study. This means that ND can detect, in the presence of symbol insertions and deletions, latent periodicity of only the same type as has been originally detected by ID in the absence of insertions or deletions [29]. Therefore, ND can detect both latent periodicity and explicit periodicity if it has previously been detected using ID. This indicates that ND used in combination with ID may miss a homologous periodicity characterized by numerous deletions and insertions of amino acid residues, because such a periodicity may be missed by the ID method [29]. From this viewpoint, RADAR [13], REPRO [30], and similar programs may better detect a homologous periodicity in the presence of numerous amino acid deletions and insertions than the combination of ID and ND. However, the search for homologous tandem repeats, even in the presence of amino acid deletions and insertions, has been analyzed in detail; so it was
beyond the scope of our study. Here, we tried to demonstrate that the latent periodicity originally detected by ID in several proteins from the given family was common for all or most proteins from this family. The ID–ND tandem was used precisely to find the fuzzy latent periodicity that could not be detected by any method or algorithm developed previously. In addition, note that ND can be also used for studying any profile. From this viewpoint, it is an entirely independent algorithm. Second, we restricted the sizes of the possible insertions and deletions to the length of the period studied. Calculations in the case of large insertions and deletions require a more powerful computer cluster than that used in this study. This restriction on the lengths of insertions and deletions explains why we did not find latent periodicity in 100% of proteins from different families. Probably, some proteins from a given family contain longer amino acid insertions and deletions than we could take into account in this study, or the periodicity diverged to such an extent that even our new approach did not allow us to detect it. Third, statistical experiments demonstrated that our estimation of the statistical significance of alignment with the use of Eq. (5) was only correct for alignments longer than 40 amino acids [47]. This means that, if this estimation of statistical significance is used, the ND method permits the identification of either tandem repeats or dispersed repeats for which the length of the amino acid sequence occupied by them is more than 40 amino acid residues. Therefore, this algorithm may not always detect short dispersed repeats (shorter than 40 amino acids). However, we developed the ND method exclusively to search for multiple latent tandem repeats (with three or more repeats in an alignment), where the alignment length is always more than 40 amino acid residues, and this restriction is immaterial for the results obtained. We attempted to detect periodicity in the amino acid sequences of the protein families shown in Table 2 with the use of the RADAR [13] and REPRO [30] programs. None of these computer programs detected any periodicity in the protein families shown in Table 2. It is still unclear why protein families contain latent periodicity; however, we can suggest several possible explanations for this phenomenon. One of them is that catalytic domains were originally much shorter than now. However, duplications that occurred in the course of evolution may have formed the structure with a higher catalytic activity. It is known that the presence of repeats in the DNA nucleotide sequence increases the probability of replication errors in the given site, thereby facilitating the formation of additional tandem repeats. We believe that the number of repeats increased as long as it provided an advantage, i.e., caused an increase in the catalytic activity of the precursor of currently existing protein kinases. The mutations that occurred afterwards led to the formation of MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
LATENT PERIODICITY OF PROTEIN KINASES
a more closed domain structure and a further increase in catalytic activity; at the same time, they further fuzzed the periodicity and eliminated homology between individual repeats. Then, we can regard the observed latent periodicity as traces of the protein formation that occurred at early evolutionary stages [5–7]. Latent periodicity may also be important for stabilizing the spatial conformation of proteins and their normal folding. Protein folding is known to be controlled by chaperone proteins binding with the growing polypeptide chain [48, 49]. This binding is not strictly specific; however, the electric charge and hydrophobicity of amino acids in the binding site may determine some specificity [50, 51]. We believe that the periodic distribution of these characteristics along the sequence promotes the homogeneous distribution of chaperones, and this homogeneity is necessary (or preferable) for more rapid or more correct protein folding. In many cases, we can observe structurally determined latent periodicity, i.e., the periodicity in which different preferences for the formation of elements of secondary structures correspond to different positions in the period. For example, a period may consist of two parts, one of which is prone to the formation of α helices and the other, β layers [35]. As a result, the periodic motif itself can determine the spatial organization of protein domains or one-domain proteins. It cannot be determined how common latent periodicity is for structural and functional units of proteins until a database is created on the latent-periodicity profiles and their possible structural and functional interpretation. Apparently, its creation will require considerable computational capacities and the development of new methods of searching for latent periodicity. In this study, we did not attempt to find latent periodicity in all protein families. This is a rather technical problem requiring large computational capacities. Our goal was merely to show that the combination of ND and ID makes it possible to detect latent periodicity in most or all members of a protein family where ID originally detects latent periodicity in only a few members. In our opinion, we have managed to demonstrate this; we found latent periodicity in the amino acid sequences where it had not been detected by any method of analysis used previously. ACKNOWLEDGMENTS This study was supported by the Ministry of Industry and Education of the Russian Federation within the framework of a governmental contract (project no. 01.106.0002) and MNTTs (project no. 1379). REFERENCES 1. Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Rapp B.A., Wheeler D.L. 2000. GenBank. Nucleic Acids Res. 28, 15–18. MOLECULAR BIOLOGY
Vol. 39
No. 3
2005
385
2. Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C., Kulikova T., Lombard V., Lopez R., Parkinson H., Redaschi N., Sterk P., Stoehr P., Tuli M.A. 2001. The EMBL nucleotide sequence database. Nucleic Acids Res. 29, 17–21. 3. Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., et al. 2000. The genome sequence of Drosophila melanogaster. Science. 287, 2185–2195. 4. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., et al. 2001. The sequence of the human genome. Science. 291, 1304–1351. 5. Ohno S. 1970. Evolution by Gene Guplication. Berlin: Springer. 6. Ohno S., Epplen J.T. 1983. The primitive code and repeats of base oligomers as the primordial proteinencoding sequence. Proc. Natl. Acad. Sci. USA. 80, 3391–3395. 7. Ohno S. 1984. Repeats of base oligomers as the primordial coding sequences of the primeval earth and their vestiges in modern genes. J. Mol. Evol. 20, 313–321. 8. Heringa J. 1994. The evolution and recognition of protein sequence repeats. Comput. Chem. 18, 233–243. 9. Heringa J. 1998. Detection of internal repeats: How common are they? Curr. Opin. Struct. Biol. 8, 338–345. 10. Heringa J. 1998. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Curr. Opin. Struct. Biol. 8, 338–345. 11. Benson G. 1999. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27, 573−580. 12. Benson G. 1997. Sequence alignment with tandem duplication. J. Comput. Biol. 4, 351–367. 13. Heger A., Holm L. 2000. Rapid automatic detection and alignment of repeats in protein sequences. Proteins. 41, 224–237. 14. Andrade M.A., Ponting C.P., Gibson T.J., Bork P. 2000. Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol. 298, 521–537. 15. Taylor W.R., Heringa J., Baud F., Flores T.P. 2002. A Fourier analysis of symmetry in protein structure. Protein Eng. 15, 79–89. 16. Lobzin V.V., Chechetkin V.R. 2000. Order and correlations in genomic DNA sequences. The spectral approach. Usp. Fiz. Nauk. 170, 57–81. 17. Dodin G., Vandergheynst P., Levoir P., Cordier C., Marcourt L. 2000. Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. J. Theor. Biol. 206, 323–326. 18. Jackson J.H., George R., Herring P.A. 2000. Vectors of Shannon information from Fourier signals characterizing base periodicity in genes and genomes. Biochem. Biophys. Res. Commun. 268, 289–292. 19. Rackovsky S. 1998. Hidden sequence periodicities and protein architecture. Proc. Natl. Acad. Sci. USA. 95, 8580–8584.
386
LASKIN et al.
20. Chechetkin V.R., Lobzin V.V. 1998. Nucleosome units and hidden periodicities in DNA sequences. J. Biomol. Struct. Dynamics. 15, 937–947. 21. Coward E., Drablos F. 1998. Detecting periodic patterns in biological sequences. Bioinformatics. 14, 498–507. 22. Voss R.F. 1992. Evolution of long range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 25, 3805–3808. 23. Silverman B.D., Linsker R. 1996. A measure of DNA periodicity. J. Theor. Biol. 118, 295–300. 24. McLachlan A.D. 1993. Multichannel Fourier analysis of patterns in protein sequences. J. Phys. Chem. 97, 3000−3006. 25. Korotkov E.V., Korotkova M.A. 1995. DNA regions with latent periodicity in some human clones. DNA Sequence. 5, 353–358. 26. Korotkov E.V., Korotkova M.A., Tulko J.S. 1997. Latent sequence periodicity of some oncogenes and DNA-binding protein genes. CABIOS. 13, 37–44. 27. Korotkova M.A., Korotkov E.V., Rudenko V.M. 1999. Latent periodicity of protein sequences. J. Mol. Modelling. 5, 103–115. 28. Chaley M.B., Korotkov E.V., Skryabin K.G. 1999. Method reavealing latent periodicity of the nucleotide sequences modified for a case of small samples. DNA Res. 6, 153–163. 29. Korotkov E.V., Korotkova M.A., Kudryashov N.A. 2003. Method of information decomposition of symbolical texts. Phys. Lett. A. 312, 198–210. 30. George R.A., Heringa J. 2000. The REPRO server: Finding protein internal sequence repeats through the Web. Trends Biochem. Sci. 25, 515–517. 31. Gribskov M., McLachlan A.D., Eisenberg D.B. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl Acad. Sci. USA. 84, 4355–4358. 32. Karlin S., Dembo A., Kawabata T. 1990. Statistical composition of high scoring segments from molecular sequences. Ann. Stat. 18, 571–581. 33. Schmidt J.P. 1998. An information theoretic view of gapped and other alignments. Proc. Pac. Symp. Biocomput. 561–572. 34. Wilbur W.J., Neuwald A.F. 2000. A theory of information with special application to search problems. Comput. Chem. 24, 33–42. 35. Laskin A.A., Chalei M.B., Korotkov E.V., Kudryashov N.A. 2003. Identification of NAD-binding sites in amino acid sequences of different proteins. Mol. Biol. 37, 663–674. 36. Needleman S.B., Wunsch C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.
37. Bairoch A., Apweiler R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 25, 45–48. 38. Taylor S.S., Radzio-Andzelm E., Hunter T. 1995. How do protein kinases discriminate between serine/threonine and tyrosine? Structural insights from the insulin receptor protein-tyrosine kinase. FASEB J. 9, 1255−1266. 39. Kentrup H., Becker W., Heukelbach J., Wilmes A., Schurmann A., Huppertz C., Kainulainen H., Joost H.G. 1996. Dyrk, a dual specificity protein kinase with unique structural features whose activity is dependent on tyrosine residues between subdomains VII and VIII. J. Biol. Chem. 271, 3488–3495. 40. Hanks S.K., Quinn A.M., Hunter T. 1988. The protein kinase family: Conserved features and deduced phylogeny of the catalytic domains. Science. 241, 42–52. 41. Hunter T. 1991. Protein kinase classification. Methods Enzymol. 200, 33–37. 42. Taylor S.S., Knighton D.R., Zheng J., Ten Eyck L.F., Sowadski J.M. 1992. Structural framework for the protein kinase family. Annu. Rev. Cell Biol. 8, 429–462. 43. Goldsmith E.J., Cobb M.H. 1994. Protein kinases. Curr. Opinion Struct. Biol. 4, 833–840. 44. Taylor S.S., Radzio-Andzelm E. 1994. Three protein kinase structures define a common motif. Structure. 2, 345–355. 45. Kruse M., Muller I.M., Muller W.E. 1997. Early evolution of metazoan serine/threonine and tyrosine kinases: Identification of selected kinases in marine sponges. Mol. Biol. Evol. 14, 1326–1334. 46. Muller W.E., Kruse M., Blumbach B., Skorokhod A., Muller M.I. 1999. Gene structure and function of tyrosine kinases in the marine sponge Geodia cydonium: Autapomorphic characters in Metazoa. Gene. 238, 179–193. 47. Chaley M.B., Korotkov E.V., Kudryashov N.A. 2003. Latent periodicity of 21 bases typical for MCP II gene is widely present in various bacterial genes. DNA Sequence. 14, 37–52. 48. Ruddon R.W., Bedows E. 1997. Assisted protein folding. J. Biol. Chem. 272, 3125–3128. 49. Thulasiraman V., Yang C.F., Frydman J. 1999. In vivo newly translated polypeptides are sequestered in a protected folding environment. EMBO J. 18, 85–95. 50. Knarr G., Modrow S., Todd A., Gething M.J., Buchner J. 1999. BiP-binding sequences in HIV gp160. Implications for the binding specificity of bip. J. Biol. Chem. 274, 29850–29857. 51. Takenaka I.M., Leung S.M., McAndrew S.J., Brown J.P., Hightower L.E. 1995. Hsc70-binding peptides selected from a phage display peptide library that resemble organellar targeting sequences. J. Biol. Chem. 270, 19839–19844.
MOLECULAR BIOLOGY
Vol. 39
No. 3
2005