Aligning protein sequence and analysing substitution pattern using a class-specific matrix
295
Aligning protein sequence and analysing substitution pattern using a class-specific matrix HAI SONG XU, WEN KE REN, XIAO HUI LIU and XIAO QIN LI* College of Life Science and Bioengineering, Beijing University of Technology, Beijing 100124, China *Corresponding author (Email,
[email protected]) Aligning protein sequences using a score matrix has became a routine but valuable method in modern biological research. However, alignment in the ‘twilight zone’ remains an open issue. It is feasible and necessary to construct a new score matrix as more protein structures are resolved. Three structural class-specific score matrices (all-alpha, allbeta and alpha/beta) were constructed based on the structure alignment of low identity proteins of the corresponding structural classes. The class-specific score matrices were significantly better than a structure-derived matrix (HSDM) and three other generalized matrices (BLOSUM30, BLOSUM60 and Gonnet250) in alignment performance tests. The optimized gap penalties presented here also promote alignment performance. The results indicate that different protein classes have distinct amino acid substitution patterns, and an amino acid score matrix should be constructed based on different structural classes. The class-specific score matrices could also be used in profile construction to improve homology detection. [Xu H S, Ren W K, Liu X H and Li X Q 2010 Aligning protein sequence and analysing substitution pattern using a class-specific matrix; J. Biosci. 35 295–314] DOI 10.1007/s12038-010-0033-3
1.
Introduction
Previous studies have repeatedly confirmed the rule that proteins with similar sequences have similar structures and/or common ancestors. These findings have led sequence comparison to become an important method for biologists to understand structural or homological information from proteins that have only known sequences. This fact has continually been proven true as more and more proteins are sequenced without their structures being resolved. The technique in comparing protein sequences generally has two major parts: a score matrix used to score amino acid pairs of the alignments, and an algorithm used to generate
alignments based on the score. Many different matrices have been suggested since Schwartz and Dayhoff issued their matrix (Schwartz and Dayhoff 1978). Among these matrices (Schwartz and Dayhoff 1978; Risler et al. 1988; Henikoff and Henikoff 1992; Rice and Eisenberg 1997), the log-odds ratio introduced by Schwartz and Dayhoff in the evaluation of amino acid substitution probability remains the most commonly used. It has been argued that any kind of log-odds score matrix has an implicit target distribution for paired amino acids and that in the construction of an optimal score matrix for identifying or aligning a particular sort of region, the implicit target frequencies of the score matrix must have no statistical difference with the frequencies of that region (Karlin and Altschul 1990). Thus, in order to obtain optimal
Keywords. Amino acid clustering; score matrix; secondary structure context; secondary structure-specific; sequence alignment; substitution pattern Authors’ contributions: XQL and XHL did some preliminary work on sequence data collection and classification. WKR participated in part of the programming work. HSX conducted most of the programming work, the statistical analysis and result analysis. HSX drafted the manuscript. The whole work was conceived by HSX, WKR, XHL and XQL and was supervised by XQL. All the authors read and approved of the final manuscript. Abbreviations used: DSSP, Database of Secondary Structure assignment Program; GEP, gap extension penalty; GOP, gap opening penalty; HMM, hidden Markov model; PD, pattern distance; SCOP, structural classification of proteins Supplementary tables and figures pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/ jbiosci/June2010/pp295-314/suppl.pdf http://www.ias.ac.in/jbiosci
J. Biosci. 35(2), June 2010, 295–314, © Indian Academy Sciences 295 J. Biosci.of35(2), June 2010
296
Hai Song Xu et al.
alignments, it is important to select a proper training dataset while constructing a score matrix. Two aspects must be considered while preparing the training dataset of a score matrix. First of all, all substitutions counted from the training set must represent ‘real’ substitutions that occur in protein structures. That is, an amino acid substitution pair must be located at the corresponding positions of structures. Second, all substitutions generated by the training set must be capable of being represented properly by just one score matrix. They must likewise have the same substitution pattern to some extent. With due regard to the first aspect, the counting of substitutions based on structural alignment results is a solution, especially when the sequences are distantly related. Some score matrices based on structure alignment have performed fairly well (Johnson and Overington 1993; Prlic et al. 2000). For the second aspect, a question may be raised regarding at which level (family, superfamily or class of proteins) or using which kind of definition of structural environment should matrices be constructed. Numerous efforts have been made to understand this aspect. Some studies have tried to construct the matrix from a larger sequence dataset (Gonnet et al. 1992; Jones et al. 1992) to reduce bias from the training set. Other studies, including those by Overington et al. (1990, 1992), have studied the influence of different structural environments (‘amino acid type’, ‘secondary structure’, ‘side-chain accessibility’, and ‘hydrogen bond formation’) on amino acid substitution patterns based on the idea that substitution patterns are influenced by the local structural environment. Studies (Rice and Eisenberg 1997; Shi et al. 2001; Gelly et al. 2005; Huang and Bystroff 2006) that have been conducted after those mentioned above have tried to construct score matrices based on different definitions of the structural environment to improve alignment. Besides the studies mentioned above, some studies have constructed matrices based on different protein families (Fan 2002; Vilim et al. 2004), while assuming that different protein families should have different substitution patterns. Apart from the efforts exerted in the construction of optimized matrices, many studies have also tried to modify the alignment algorithm to improve sequence alignment. However, these modified alignment algorithms have still been based on the dynamic programming algorithm. The difference between these modified algorithms and traditional dynamic programming is that the modified algorithms penalize different sequence regions differently (Shi et al. 2001; Baussand et al. 2007) and/or score different sequence regions with different matrices (Shi et al. 2001; Gelly et al. 2005; Huang and Bystroff 2006). Other approaches using profiles (Gribskov et al. 1987; Tatusov et al. 1994; Tang et al. 2003; Zhou and Zhou 2005) J. Biosci. 35(2), June 2010
or hidden Markov model (HMM)-based profiles (Krogh et al. 1994; Bystroff et al. 2000) could be viewed as combined methods that construct matrices based on environmentspecific sites of family-specific proteins and generate alignment using modified algorithms. Although many efforts have been made to improve sequence alignment, it remains a challenging issue in the ‘twilight zone’ (Doolittle 1981). Compared with traditional score matrices (e.g. PAM and BLOSUM series), methods using environment-specific matrices (Shi et al. 2001; Huang and Bystroff 2006) could improve alignment by scoring amino acid substitutions in diverse environments differently. However, the construction of environment-specific matrices sometimes suffers from the problem of sparse data (Huang and Bystroff 2006). Moreover, substitution patterns can sometimes be different within the same local environment (see Discussion section). The good performances of familyspecific matrices and methods using profiles or the HMM suggest that the construction of matrices at the protein family level can improve alignment. However, not all protein families have enough sequences to be able to construct a family-specific matrix or profile. As has been reviewed by Dunbrack (2006), generalized score matrices (e.g. BLOSUM series) are still needed and useful in pairwise alignment and/or some steps of generating profiles. Following these arguments, we are interested in the discovery of an optimal alignment that could be achieved using a generalized score matrix with a more consistent substitution pattern. To maintain the generalization of the matrix while improving the consistency of the substitution pattern, the construction of a matrix at the protein class level appears to be a good option. A class-specific matrix is more general as compared with a family-specific matrix. Moreover, a classspecific matrix would be more consistent than traditional matrices (e.g. BLOSUM series). Different protein classes (i.e. α, β and α/β) have different secondary structure components. Some different substitution patterns may exist between different protein classes (here we regard α+β protein as a simple additive of α and β proteins). This would make traditional matrices (constructed at the all-protein level) less consistent than a matrix constructed at the protein class level. The use of a class-specific matrix would also maintain the simplicity of the alignment algorithm if the homology relation of sequences aligned is confirmed and the class of one of the sequences is known. Some secondary structure-specific matrices based on all proteins (i.e. proteins of different classes) have been used to construct the profiles of protein families (Lüthy et al. 1991). The transmembrane-specific (i.e. based on a special subset of all-alpha proteins) substitution matrix has also been constructed for homology searching of transmembrane protein (Ng et al. 2000). In our study, we construct three
Aligning protein sequence and analysing substitution pattern using a class-specific matrix class-specific score matrices (ALPHASUM, BETASUM and AFBETASUM) based on the structure alignment of low identity (<25%) all-alpha, all-beta, and alpha/beta proteins, respectively, to test whether an optimal matrix could be built at the class level. The low identity level is likewise set to deal with the alignment in the ‘twilight zone’ (Doolittle 1981), which continues to be a challenging issue. We also use a special subset (including up–down helical bundle proteins only) of the all-alpha training dataset to construct another matrix, ABUNDLESUM, in order to compare the alignment accuracy of ALPHASUM with a lower-level matrix and to determine whether constructing a matrix at an all-alpha level is reasonable. The performance test is conducted between class-specific matrices and the four other published matrices (BLOSUM30, BLOSUM62, Gonnet250 and HSDM) using the training datasets and some low-identity sequences extracted from BAliBASE (Thompson et al. 1999). To determine the kind of amino acid substitution pattern differences that exist between different protein classes, the substitution patterns of different protein datasets are also analysed based on the classification of amino acids using the corresponding matrices. We also construct secondary structure-specific matrices based on different protein classes in order to investigate the consistency of the substitution patterns in the same or different protein classes. 2. 2.1
Methods
Training datasets of class-specific matrices
A training dataset used to construct ALPHASUM (an allalpha protein-specific matrix) was selected from ASTRAL1.65 (Brenner et al. 2000). The ASTRAL database was constructed to augment the SCOP database (Brenner et al. 1996). It contains classified information of protein domains in SCOP and PDB-style files with coordinates for each SCOP domain. In order to deal with the alignment problem in the ‘twilight zone’, an identity cut-off value (25%) was set to filter domain sequences from ASTRAL. By this means, a subset of ASTRAL-1.65 containing 4357 proteins was obtained. According to the definition of protein class in SCOP, this subset was divided into four main parts: allalpha, all-beta, alpha/beta and alpha+beta. The all-alpha part of the subset containing 1022 sequences was used to construct the all-alpha training dataset. The 1022 all-alpha protein subset was grouped into ‘blocks’ similar to that done in previous research (Henikoff and Henikoff 1992) according to the family definition of SCOP (Brenner et al. 1996). To ensure that the substitutions counted from the alignment represented ‘real’ substitutions that occurred in protein structures, we did not group the subset according to the superfamily definition, although members of the superfamily have similar structures and
297
the counted number of substitution pairs would increase by doing this. Moreover, structures in a superfamily cannot always be aligned very well. After the badly aligned structures belonging to families were excluded or adjusted, an all-alpha training dataset containing 580 sequences was obtained and grouped into 175 ‘blocks’. The structures of these sequences were also grouped correspondingly. In this all-alpha training dataset, the biggest ‘block’ contained 22 sequences and the smallest contained just two sequences. The alignments of the grouped sequences were then obtained by aligning the corresponding structures. The structural alignment algorithm chosen was MUSTANG (Konagurthu et al. 2006). MUSTANG refers to a multiple structural alignment algorithm, making it suitable for aligning the ‘blocks’. In addition, MUSTANG performs more reliably on distantly related proteins. After all the 175 ‘blocks’ were aligned, substitution pairs were counted and a structure-derived matrix was calculated from it. The training datasets of BETASUM (an all-beta proteinspecific matrix), AFBETASUM (an alpha/beta proteinspecific matrix), and ABUNDLESUM (an up–down helical bundle protein-specific matrix) were constructed in the same manner as ALPHASUM. The training dataset of BETASUM contained 701 sequences and was grouped into 149 ‘blocks’. The training dataset of AFBETASUM contained 726 sequences in 176 ‘blocks’, while the training dataset of ABUNDLESUM contained 109 sequences in 31 ‘blocks’. Calculation of matrix
2.2
The statistical method we used to generate class-specific matrices was similar to that in earlier research (Henikoff and Henikoff 1992) except that: (1) the training dataset was aligned according to structure alignment; (2) the clustering step was omitted; (3) the probability of occurrence of amino acid was directly calculated from the training dataset; and (4) the entries of the matrix were in the unit of 0.1 bit. Substitution pairs were counted from each ‘block’, and the observed probability of occurrence for each amino acid i, j pair was: 20
i
qij = fij / ∑ ∑ fij
(1)
i =1 j =1
where fij is the total number of the amino acid pairs i, j counted from the ‘blocks’. Assuming a random protein sequence model, the expected probability of occurrence eij for each amino acid pair i, j is calculated by: ⎧ ⎪pi2 , i=j eij = ⎪ ⎨ ⎪ 2p p , i ≠ j, ⎪ ⎪ ⎩ i j
(2)
where pi, pj are the occurrence probability of the ith and jth amino acid in the training dataset, respectively, and could J. Biosci. 35(2), June 2010
Hai Song Xu et al.
298
be calculated from ni (the number of times the amino acid i occurs in the training dataset) directly by: 20
pi = ni / ∑ n j .
(3)
j =1
After acquiring the observed probability of occurrence qij for each i, j pair and using the expected probability of occurrence eij for each pair i, j as a control, a log-odds ratio matrix is then calculated in 0.1 bit units, with each entry sij equal to: sij = 10 × log2 (qij / eij ).
(4)
The relative entropy H of the matrix is calculated in 0.1 bit units by: 20
i
H = ∑ ∑ qijsij .
(5)
i =1 j =1
2.3
Performance test
Matrices or profiles constructed based on different protein families always perform better than traditional matrices (e.g. PAM series and BLOSUM series). In this paper, we constructed matrices based on special training datasets, the protein class level of which was between family level and all-protein level. Thus, the question that emerged was whether these matrices could perform better than those that were ‘lower’ and ‘higher’. In order to answer the first aspect of this question, we compared ALPHASUM with ABUNDLESUM (a matrix based on the up–down helical bundle subset of the all-alpha training dataset) with the up– down helical bundle subset. To answer the second aspect, the class-specific matrices were tested from two aspects. First, we tested it to determine whether the training datasets were better represented by ALPHASUM, BETASUM and AFBETASUM than four other published matrices (BLOSUM30, BLOSUM62, Gonnet250 and HSDM). Second, we chose some aligned results from BAliBASE (Thompson et al. 1999) as test sets and compared the alignment accuracy of class-specific matrices with the four other matrices on the test sets. The two BLOSUM matrices (Henikoff and Henikoff 1992) were chosen because they are popularly used. BLOSUM30 was chosen because it clusters at 30% identity, which is close to the 25% of the three class-specific matrices. BLOSUM62 was chosen because its performance is the best among the BLOSUM matrices. Gonnet250 (Gonnet et al. 1992) was chosen because its performance is the best in a comprehensive matrix performance test conducted in an earlier study (Vogt et al. 1995). HSDM (Prlic et al. 2000) was chosen because it is also a structure-derived matrix from different protein classes. J. Biosci. 35(2), June 2010
BAliBASE is a benchmark alignment database. BAliBASE 2.0 contains eight references. Each reference is designed to test a different aspect of the performance of the score matrix and alignment program. Considering that the class-specific matrices are designed to align sequences in the ‘twilight zone’, the test sets were extracted from reference 1 in BAliBASE, which was divided into smaller groups according to sequence length and percentage identity. Since different class-specific matrices were constructed to score specific protein classes, the test sets mainly contained class-specific proteins. In order to conduct a reliable significance test, some proteins with alpha helix as the main part and with few beta strands as the minor part were also included in the BAliBASE alpha test set. Some sequences with >25% identity occurred in the test sets for the same reason. The test sets extracted from BAliBASE 2.0 are shown in tables 1–3 of Supplementary I. CLUSTALW (Thompson et al. 1994) was used to generate global alignments. The accuracy of the alignments was evaluated using Qdeveloper and Qmodeler (Dunbrack 2006), where Qdeveloper refers to correctly aligned amino acid pairs divided by the number of pairs in structure alignment, and is effectively a measure of sensitivity. Qmodeler refers to correctly aligned amino acid pairs divided by aligned pairs and is effectively a measure of specificity. Only aligned amino acid pairs were used in the calculation, and ‘deletion amino acid’ pairs were not included. Qdeveloper and Qmodeler were also used to evaluate optimal gap penalties of the matrices. By evaluating all combinations of gap openings between 2 and 30 in step length of 1 and extensions between 0.5 and 5 in step length of 0.5, the gap penalties with which the matrix had maximal average Qdeveloper and Qmodeler on the training dataset were regarded as the optimal gap penalties. If Qdeveloper and Qmodeler could not be optimized simultaneously, only Qdeveloper was optimized (it is reasonable for global alignment). The significance of the performances of different matrices was evaluated using two statistical methods: paired samples mean test and Wilcoxon signed ranks test. Paired samples mean test is a type of parametric test. Compared with the non-parametric test, it can provide some meaningful parameters (e.g. mean value) and evaluate their significance. However, the paired samples mean test requires that the sample has a normal distribution, which cannot be totally fulfilled by the Qdeveloper and Qmodeler of different matrices. Therefore, the Qdeveloper and Qmodeler of different matrices on different sequence pairs of the test set were also tested using Wilcoxon signed ranks test. All the significance tests were conducted using SPSS 12.0 (statistical software). 2.4 Amino acid clustering according to entry scores of matrices In order to analyse the difference in amino acid substitution patterns between different protein classes, amino acids were
Aligning protein sequence and analysing substitution pattern using a class-specific matrix clustered according to different matrices. The clustering method we used was the hierarchical cluster function in SPSS 12.0 (statistical software). The clustering was conducted using the between-groups linkage method and measured by Pearson correlation. The cluster method we adapted here, in fact, clustered the amino acids according to their substitution behaviour to all the other amino acids and not merely according to their substitution behaviour to one another. The common substitution patterns of different matrices extracted by this method are more reliable and even some subtle differences of substitution patterns can be extracted. To analyse the substitution differences of different protein classes in detail, several secondary structure-specific matrices were constructed based on the training datasets of the class-specific matrices. The secondary structures were classified into alpha helix, beta strands and coils according to the Database of Secondary Structure assignment Program (DSSP)’s definition (Karsch and Sander 1983). Amino acid substitution frequencies within each kind of secondary structure in the training datasets were counted and the corresponding secondary structure-specific matrices were calculated based on these counts. The amino acid cluster trees were also built based on these matrices. 2.5
Comparison of sign matrices of different secondary structure-specific matrices
For any matrix similar to the BLOSUM series, a positive entry value means that the substitution is favourable and a negative one means that it is unfavourable, while zero means that it is neutral. Therefore, the derived sign matrix of a matrix roughly catches the substitution pattern characteristic of that matrix. The sign matrix of a matrix is derived by setting corresponding entry values of ‘1’ for positive values of the matrix, ‘–1’ for negative values, and ‘0’ for zero values. In order to show substitution pattern differences of different/same secondary structures in the same/different protein classes explicitly, the sign matrices of different secondary structure-specific matrices were compared by multiplying corresponding entry values of the different sign matrices. The resulting ‘sign comparison matrix’ contained four different kinds of values: ‘1’, ‘–1’, ‘@’ and ‘0’. A value of ‘1’ indicates that the signs are the same for the corresponding entry values of the two sign matrices; ‘–1’ indicates that the signs are different; ‘@’ indicates that only one of the signs is zero; and ‘0’ indicates that both the signs are zero. For every ‘sign comparison matrix’, a numerical value (termed as ‘pattern distance’) was calculated to roughly evaluate the substitution pattern differences of different
299
matrices. The ‘pattern distance’ of two matrices equals the Euclidean distance of the corresponding sign matrices. For a 20x20 score matrix, the maximum value of ‘pattern distance’ is 40 and the minimum is 0. 3. 3.1
Results
Class-specific matrices
Based on the training datasets of class-specific proteins, three class-specific matrices were calculated using the method described above. Since they were derived from the training dataset of class-specific proteins, we named these matrices ALPHASUM (table 1, upper right diagonal), BETASUM (table 2, upper right diagonal), and AFBETASUM (table 2, lower left diagonal), respectively. A total of 146 867 amino acid pairs were counted from the aligned 175 ‘blocks’ for the all-alpha training dataset. The scores shown in tables 1 and 2 were rounded to an integer. On the other hand, there were a total of 379 453 pairs in 149 ‘blocks’ and 421 849 pairs in 176 ‘blocks’, respectively, for the all-beta and alpha/beta training datasets. 3.2
Test of the class-specific matrices
3.2.1 Matrices of different levels: To investigate whether matrices constructed at the protein class level are better than or equivalent to those matrices constructed at the ‘lower level’, a matrix based on the up–down helical bundle subset of the all-alpha training dataset was constructed and compared with ALPHASUM. We named this matrix ABUNDLESUM (see table 1, lower left diagonal). A total of 23 474 amino acid pairs were counted from the aligned 31 ‘blocks’ of the up–down helical bundle subset. The ‘representations’ (i.e. alignment quality of the training dataset of matrix) of ABUNDLESUM and ALPHASUM on the up–down helical bundle subset were measured using the Qdeveloper and Qmodeler (Dunbrack 2006). The results are shown in table 1.1 of Supplementary II. The gap penalties of ABUNDLESUM and ALPHASUM were optimized based on the up–down helical bundle subset and the allalpha training set, respectively. The Qdeveloper and Qmodeler of ABUNDLESUM are only marginally better than those of ALPHASUM. When the gap penalties of ALPHASUM are optimized based on the up–down helical bundle subset, the difference between ABUNDLESUM and gap-optimized ALPHASUM (shown as ALPHASUM_OP in table 1.1 of Supplementary II) is even smaller; the mean test of 95% confidence interval (2-tailed) was also not significant. All these findings indicate that the ‘representations’ of ABUNDLESUM and ALPHASUM on the up–down helical bundle subset have no statistical difference. J. Biosci. 35(2), June 2010
Hai Song Xu et al.
300 Table 1. A 15
Entry values of ALPHASUM (upper right diagonal) and ABUNDLESUM (lower left diagonal) C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1
–3
–1
–6
1
–4
–2
2
–5
–2
–3
–3
–2
–5
2
–1
2
–7
–4
A
38
–10
–13
–2
–10
–5
–1
–10
–2
–3
–7
–16
–9
–12
–3
–2
–1
–5
–4
C
6
–17
–2
0
–14
1
–17
–14
5
–3
0
–6
1
0
–12
–18
–13
D
15
–17
–5
–2
–11
4
–13
–9
1
–5
5
–2
–1
–3
–9
–14
–12
E
29
–13
–3
5
–13
10
6
–13
–14
–11
–9
–13
–9
3
13
14
F
20
–4
–15
–1
–16
–11
0
–4
–5
–8
0
–3
–9
–13
–10
G
32
–10
1
–7
–5
1
–7
5
0
–1
–4
–4
–6
5
H
17
–9
11
10
–10
–12
–8
–9
–9
–4
13
–4
–3
I
16
–11
–4
1
–2
7
7
0
0
–6
–16
–9
K
11
–13
–13
–8
–8
–11
–7
7
0
0
L
21
–8
–11
–4
–6
–6
–4
6
–5
–1
M
A
13
C
0
36
20
D
–3
–11
15
E
–3
–9
6
13
F
–7
2
–11
–14
23
G
0
–8
–3
–2
–4
22
H
–5
–3
–1
–2
–2
–2
25
I
–3
–7
–14
–10
2
–10
–8
15
K
–1
–10
3
4
–14
–2
0
–9
L
–3
–3
–11
–11
9
–9
–6
9
–8
16
M
–2
–2
–13
–7
4
–13
–4
9
–8
10
23
N
–4
–12
4
0
–10
0
1
–8
1
–7
–8
17
P
–2
–10
–5
–7
–11
–5
–11
–8
–5
–14
–11
–8
28
Q
–2
–11
1
5
–7
–2
5
–7
5
–5
–6
3
–9
R
–6
–5
–1
0
–10
–6
1
–6
7
–5
–5
3
–11
5
18
S
1
–4
0
–2
–7
0
1
–7
–2
–7
–6
0
–3
1
–2
T
–1
–4
–1
–1
–9
–2
–9
–5
1
–5
–6
3
0
–1
–1
5
12
V
3
1
–7
–7
1
–7
–9
9
–8
6
4
–4
–9
–4
–4
–6
–1
W
–11
–11
–6
–2
13
–1
–3
–11
–8
–5
–1
–7
–4
0
–6
–9
–5
0
47
Y
–7
2
–6
–7
15
–7
3
1
–4
0
–4
–3
–17
–1
–2
–7
–2
1
11
18 12
16
–7
1
–1
2
1
–8
–12
–7
N
28
–6
–9
–1
–3
–9
–16
–15
P
16
4
0
–3
–6
–10
–6
Q
–3
–6
–7
–8
–4
R
10
5
–6
–13
–10
S
–1
–11
–6
T
16
–1
0
V
44
13
W
24
Y
19 12
14 6
13
27
This table gives the entry values of the amino acid score matrices of ALPHASUM and ABUNDLESUM. Amino acids are represented by single-letter symbols in this table. The relative entropy and expected score of ALPHASUM are equal to 3.439 and –0.902, respectively (in a 0.1 bit unit). The relative entropy and expected score of ABUNDLESUM are equal to 2.409 and –0.535, respectively (in a 0.1 bit unit).
3.2.2 Self-consistence test of the class-specifi c matrices: To investigate whether class-specific matrices could perform better than ‘higher level’ generalized matrices, the performances of ALPHASUM, BETASUM and AFBETASUM on the corresponding training datasets were compared with four other published matrices, BLOSUM30, BLOSUM62 (Henikoff and Henikoff 1992), Gonnet250 (Gonnet et al. 1992) and HSDM (Prlic et al. 2000). We did not conduct a Jack-knife test partly because there were too many ‘blocks’ in the training datasets. Moreover, conducting a Jack-knife test in this condition may only be done to test whether different ‘blocks’ have consistent amino acid substitution patterns or not. The results of the comparison between ALPHASUM and ABUNDLESUM provide some evidence which suggests that the up–down helical bundle subset and the all-alpha training dataset have consistent substitution patterns. Moreover, if the performances of the class-specific matrices on the training datasets are statistically better than those of the other matrices, we would be able to prove that the substitution pattern of the training J. Biosci. 35(2), June 2010
datasets could be better represented by class-specific matrices. The statistical results of the alignment performances of the three class-specific matrices on the training datasets are shown in figure 1 (details are listed in tables 1.1–1.3 of Supplementary II). Gap penalties were optimized for each matrix based on the respective training datasets. The results indicate that the performances of class-specific matrices on corresponding training datasets are statistically better (significant in mean test of 95% confidence interval) than those of the four other matrices. 3.2.3 Performance test of the class-specific matrices: To further test the performance of the class-specific matrices, we compared the alignment performances of ALPHASUM, BETASUM and AFBETASUM with the four other matrices on three subsets of BAliBASE 2.0. The results are shown in figure 1 (for details, see tables 1.1–1.3 of Supplementary II). To emulate sequence alignment that occurs in a real-world situation, the gap penalties used were optimal values on the corres-
Aligning protein sequence and analysing substitution pattern using a class-specific matrix Table 2.
301
Entry values of BETASUM (upper right diagonal) and AFBETASUM (lower left diagonal)
A
C
D
E
F
8
–1
–5
–3
53
–12
–7
14
G
H
I
K
L
M
N
–3
–1
–3
–2
–4
–3
–1
–5
5
–10
–7
1
–9
0
–1
–4
2
–12
–2
0
–13
–2
–13
–9
4
10
–8
–4
1
–7
3
–9
–5
0
19
–12
–2
6
–8
6
3
19
–3
–14
–5
–12
–9
25
–5
2
–6
14
–7 8
A
12
C
5
27
D
–4
–13
20
E
–2
–11
5
13
F
–3
1
–12
–11
19
G
–1
–4
–4
–7
–11
22
H
–4
–2
–2
–2
0
–5
26
Q
R
–1
–3
–10
–10
–2 0
–9 –1
–4
1
9
4
–8
–4
16
P
S
T
V
W
Y
–6
1
–3
1
–6
–4
A
–6
–4
–4
4
–2
0
C
–1
–4
1
–3
–12
–12
–8
D
5
2
4
3
–6
–3
–4
E
–11
–7
–7
–5
–4
2
6
12
F
–6
–5
–7
–1
–7
–12
–12
–9
G
–4
3
3
4
3
–5
–2
4
H
–10
–8
–6
–5
–6
–3
10
0
1
I
0
–1
4
5
2
2
–6
–4
–5
K
5
–12
–6
–7
–6
–6
–3
6
4
1
L
11
–7
–5
–1
–4
–3
–2
2
1
–2
M
I
–1
2
–13
–13
6
–12
–6
17
K
–2
–10
0
3
–10
–6
0
–10
L
–1
3
–13
–10
7
–11
–5
9
–8
14
M
0
2
–10
–7
6
–8
–3
7
–6
8
19
N
–3
–4
4
–1
–8
0
4
–10
1
–9
–7
21
P
–5
–7
–2
–4
–7
–5
–6
–9
–4
–10
–9
–4
21
Q
–1
–7
2
6
–7
–4
3
–9
5
–6
–3
3
–6
13
R
–3
–9
–2
1
–8
–6
0
–8
7
–7
–5
0
–5
4
17
S
1
1
0
–2
–7
0
–1
–9
–2
–9
–5
3
–3
0
–2
15
14
–3
0
–1
3
1
–11
–11
–6
N
20
–2
–5
1
–2
–7
–6
–7
P
12
4
4
4
–5
–6
–2
Q
2
3
–5
–4
–1
R
12
6
–5
–3
–2
S
15
11 13
0
–3
–2
T
14
–1
–2
V
8
W
29
Y
T
0
2
–2
–3
–3
–5
–1
–2
–2
–4
–1
1
–4
0
–2
6
15
V
1
5
–13
–11
4
–10
–5
13
–10
7
5
–8
–6
–7
–8
–6
1
16
48
W
–2
–3
–8
–7
10
–7
3
1
–10
3
4
–3
–7
–3
–5
–4
–3
–1
35
Y
–3
–2
–7
–8
11
–8
5
1
–6
1
2
–3
–6
–3
–4
–5
–3
0
11
22
This table gives the entry values of the amino acid score matrices of BETASUM and AFBETASUM. Amino acids are represented by single-letter symbols in this table. The relative entropy and expected score of BETASUM are equal to 2.436 and –0.537, respectively (in a 0.1 bit unit). The relative entropy and expected score of AFBETASUM are equal to 2.502 and –0.577, respectively (in a 0.1 bit unit).
ponding training dataset and not those on the BAliBASE subsets. The alignment result of ALPHASUM was significantly better than that of BLOSUM30 and HSDM, but it was at the same level as BLOSUM62 and Gonnet250 (see table 1.1 of Supplementary II, BAliBASE alpha test set). When the BAliBASE alpha test set was reduced to a set that contained only all-alpha proteins (see table 1.1 of Supplementary II, short subgroup of BAliBASE alpha test set), the advance of ALPHASUM to BLOSUM62 and Gonnet250 increased, but still did not indicate any significance. We speculate that this is partly because of the small sample size (only 47 sequence pairs included). However, the Gonnet250 still stood at almost the same level as ALPHASUM (with average Qdeveloper of 0.692 to ALPHASUM’s 0.703, and average Qmodeler of 0.686 to ALPHASUM’s 0.696). Apart from the small sample size mentioned above, a bigger value range of the Gonnet250 matrix (from –52 to 142) relative to that of ALPHASUM (from –18 to 44) might have also provided Gonnet250 with
a more subtle difference for different amino acid pairs and enhanced its performance. The performances of BETASUM and AFBETASUM on the BAliBASE test subsets are shown in figure 1 (details are listed in tables 1.2 and 1.3 of Supplementary II, respectively). Similar to the result of ALPHASUM, the performance of BETASUM was significantly better than those of BLOSUM30 and HSDM, but not significantly better than those of BLOSUM62 and Gonnet250, partly because of the small sample size (only 76 sequence pairs were included). The performance of AFBETASUM was significantly better than those of BLOSUM30 and HSDM, but only marginally better than those of BLOSUM62 and Gonnet250. The alignment accuracies (i.e. Qdeveloper) of the matrices on the BAliBASE test sets were also classified and averaged according to the sequence pairs’ percentage identity. The sequence pairs of each test set were classified into four parts according to their percentage identity: 0–20%, 20–30%, J. Biosci. 35(2), June 2010
302
Hai Song Xu et al.
Figure 1. Performance test of class-specific matrices. In this figure, the average Qdeveloper (A) and average Qmodeler (B) results of the matrices performance test on the training datasets and the corresponding BAliBASE test sets are illustrated briefly (for details see tables 1.1–1.3 of Supplementary II). The gap penalties used are optimal values on the corresponding training datasets (see text). The ‘CLASSSUM’ in the legend is used to represent the three class-specific matrices according to the datasets (e.g. ‘all-alpha’ refers to the all-alpha training dataset; ‘alpha test set’ refers to the BAliBASE alpha test set, then CLASSSUM represents ALPHASUM). The values of average Qdeveloper and average Qmodeler of ‘CLASSSUM’ and the highest values of the other four matrices are given. A superscript ‘-’ means that the differences between the two average values are not significant.
30–40% and 40–100%. The result of ALPHASUM on the alpha test set is listed in table 3 as an example. Other results are listed in tables 4.1–4.3 of Supplementary II. The class-specific matrices have some advances in aligning sequence pairs with less than 30% sequence identity, except for AFBETASUM in the 20–30% identity range of the alpha/beta test set and ALPHASUM in the 0–20% identity range of the alpha test set_short. We did not conduct any further significance tests because of the small sample size. A similar test result using an environment-specific alignment J. Biosci. 35(2), June 2010
method, FUGUE (Chelliah et al. 2005), is given in table 4.4 of Supplementary II as an indirect comparison. We did not conduct a direct comparison between the class-specific matrices and FUGUE because the test set of FUGUE was heavily duplicated and generally disagreed with our training dataset. Moreover, we obtained a different percentage identity classification on the test set than Chelliah et al.’s result (Chelliah et al. 2005). An indirect comparison shows that the class-specific matrices and FUGUE were comparable with the alignment sequences at 20–30% identity. However,
Aligning protein sequence and analysing substitution pattern using a class-specific matrix Table 3.
303
Classified alignment accuracy of the matrices on alpha test set Average Qdeveloper a (0–20%, 29)b
Average Qdeveloper (20–30%, 57)
Average Qdeveloper (30–40%, 28)
Average Qdeveloper (40–100%, 27)
ALPHASUM
51.40%
81.30%
93.30%
96.10%
Gonnet250
50.90%
80.10%
92.80%
96.50%
Blosum62
49.70%
79.90%
93.40%
96.30%
Blosum30
41.50%
75.90%
92.90%
96.10%
HSDM
51.00%
75.60%
93.10%
95.50%
Matrix
a
For the definitions of Qdeveloper see Methods section. The values of average Qdeveloper are calculated by averaging all pairs within the identity range of the test set. b The numbers in brackets are the identity range and the number of sequence pairs of the test set in this range. Table 4.
Performance comparison of matrices using different gap penalties
Matrix
GOPa
GEPa
Average Qdeveloperb
Average Qmodelerb
Average relative Qdeveloper (P values)c
Average relative Qmodeler (P values)c
BLOSUM30_Vogt
10.0
1.50
0.571
0.526
--
--
BLOSUM30_OP
5
0.5
0.603
0.560
0.0306 (0.000, 0.000) 0.0303 (0.000, 0.000)
BLOSUM 62_Vogt
7.5
0.90
0.623
0.578
--
BLOSUM62_OP
4
0.5
0.658
0.614
0.0303 (0.000, 0.000) 0.0313 (0.000, 0.000)
Gonnet250_Vogt
14.0
0.20
0.617
0.570
--
Gonnet250_OP
5
0.5
0.655
0.611
0.0311 (0.000, 0.000) 0.0320 (0.000, 0.000)
---
In this table, the performances of three matrices (BLOSUM30, BLOSUM62 and Gonnet250) on the all-alpha training dataset (average sequence identity is 0.161) using different gap penalties are compared. a The gap opening penalty (GOP) and gap extension penalty (GEP) of matrices are optimized on the all-alpha training dataset for matrices having the suffix ‘OP’ (e.g. BLOSUM30_OP) or just extracted from Vogt et al.’s paper for those having the suffix ‘Vogt’ (e.g. BLOSUM30_Vogt). The entry values of the corresponding matrices are the same (e.g. BLOSUM30_OP and BLOSUM30_Vogt). b For the definitions of Qdeveloper and Qmodeler see Methods section; the values are equal to the total correctly aligned pairs in prediction divided by total aligned pairs in structure alignment and total aligned pairs in prediction, respectively. c The average relative values are calculated by averaging the relative values of two matrices over all pairs of the dataset. The P values indicate the significance of the difference between the performances of the same matrices using different gap penalties (i.e. optimal values on the all-alpha training dataset and from Vogt et al.’s paper). Two P values listed in brackets in this column have been obtained using different test methods; the first is derived from paired samples mean test of 95% confidence interval (2-tailed), and the second is derived from Wilcoxon signed ranks test.
they were not as good as FUGUE in aligning sequences with less than 20% identity. 3.3
Effect of gap penalties on performance
3.3.1 Optimal gap penalties on different datasets: It has been stated that gap penalties have a great influence on alignment results (Vogt et al. 1995). Some new protocols of gap penalties have also been proposed (Shi et al. 2001; Baussand et al. 2007) to improve alignment accuracy. In this paper, we wanted to investigate how much the alignment accuracy of class-specific proteins could be improved by using gap penalties optimized on class-specific proteins. We compared the performances of three matrices (BLOSUM30, BLOSUM62 and Gonnet250) on the all-alpha training dataset using different optimized gap penalties. The optimized gap penalties of the three matrices on
all proteins were extracted from table 7 of Vogt et al.’s paper (Vogt et al. 1995). The optimal gap penalties from Vogt et al. were optimized for proteins from different protein families and different percentage identities. The statistical results of the alignment performance are shown in table 4. The Qdeveloper and Qmodeler of the matrices using all-alpha protein-specific gap penalties outperformed those of the corresponding matrices using all protein-specific gap penalties by about 3%. 3.3.2 Advantage of using class-specifi c matrices and gap penalties: The results shown in figure 1 and table 4 suggest that the alignment of class-specific protein sequences using corresponding class-specific matrices and gap penalties would improve alignment significantly. In order to show the combined influence of matrices and gap penalties to alignment result, we compared the alignment performances of the method using the three class-specific matrices and J. Biosci. 35(2), June 2010
304
Hai Song Xu et al.
Figure 2. Advantage of using class-specific matrices and gap penalties. The performances of alignments using the three class-specific matrices and gap penalties are compared with those using the other three matrices (BLOSUM30, BLOSUM62 and Gonnet250) and optimized gap penalties from the paper by Vogt et al. (1995). Average Qdeveloper (A) and average Qmodeler (B) results are illustrated briefly (for details see tables 2.1–2.3 of Supplementary II). Legend is similar to figure 1.
corresponding class-specific gap penalties against those using the other three matrices (BLOSUM30, BLOSUM62 and Gonnet250) and their optimal gap penalties from Vogt et al. (1995). The comparisons were conducted over the training datasets and the BAliBASE test sets, and the results are shown in figure 2 (for details see tables 2.1–2.3 of Supplementary II). The results indicate that aligning protein sequences in the ‘twilight zone’ using class-specific matrices and gap penalties significantly outperformed those that used the other three generalized matrices and gap penalties optimized on proteins from different protein families and of different percentage identities. J. Biosci. 35(2), June 2010
3.4 Application of the class-specific matrices Based on the above results, we suggest that a new scheme of protein sequence alignment could be advanced. We recommend that proteins of different classes should be aligned using class-specific matrices and gap penalties to improve alignment accuracy. To implement this new scheme, the classes or families of protein sequences to be aligned should be predicted first. This can be done by using different methods. The family information of sequences being aligned can be obtained using different folding recognition technologies
Aligning protein sequence and analysing substitution pattern using a class-specific matrix
305
Figure 3. Results of the cross-alignment test. In cross-alignment test, ALPHASUM, BETASUM, and AFBETASUM were used to align protein sequences in the training datasets other than their own training sets (for details see text). The average Qdeveloper (A) and average Qmodeler (B) results of the ‘cross-alignment test’ are illustrated briefly (for details see tables 3.1–3.3 of Supplementary II). Legend is similar to figure 1.
(Tang et al. 2003; Zhou and Zhou 2005). In instances where none of the sequences have family information, an alternate way may be applied by predicting the class of the sequences. The high accuracies that have been achieved in recent years in predicting protein structural classes (Cai et al. 2001) (average prediction accuracies reached up to 79.4% and 93.2%, respectively, in two jack-knife tests) indicate that it is feasible. While the accuracy of predicting a protein’s class is rather high, there still remains the issue of instances when the protein sequences to be aligned are assigned to a wrong class. To examine the influence of wrong class assignment to alignment result, we conducted a ‘cross-alignment
test’ for the three class-specific matrices. ALPHASUM, BETASUM and AFBETASUM were used to align protein sequences in the training datasets other than their own training sets. For example, ALPHASUM was used to align protein sequences in the all-beta training dataset and the alpha/beta training dataset we constructed. The alignment results were compared with those of the four other matrices (BLOSUM30, BLOSUM62, Gonnet250 and HSDM). The gap penalties of the four other matrices were the optimal values on the training dataset of the corresponding classspecific matrices that were compared. The statistical results are shown in figure 3 (for details see tables 3.1–3.3 of J. Biosci. 35(2), June 2010
306
Hai Song Xu et al.
Supplementary II). The results of the ‘cross-alignment test’ indicate that the performance of the class-specific matrices was still better than or at same level as those of the four other generalized matrices despite the wrong class assignments, especially when compared with those using the gap penalties from Vogt et al. (1995) (see All-beta dataset in table 2.1 of Supplementary II). Searching aligned protein sequences against the protein sequence database using BLAST or FASTA is an alternative method that may predict the structural classes of proteins. After the structural class or family is assigned, the protein sequences could then be aligned using the class-specific matrices and gap penalties we present here. 3.5 Amino acid substitution pattern of different protein classes Some researchers have stated (Karlin and Altschul 1990) that any score matrix has implicit target frequencies and that these frequencies reflect the amino acid substitution patterns of the target sequences (i.e. the training dataset of the matrix). The performance difference between the class-specific matrices and the four other matrices indicate that different protein classes have different amino acid substitution patterns. To investigate the difference in substitution pattern of the different structures, the amino acids were clustered according to the different matrices and the resulting cluster trees of amino acids were compared. The resulting cluster
tree and corresponding cluster procedure for ALPHASUM are presented in figure 4 as an example (those of the others are given in Supplementary III). 3.5.1 Common patterns of different protein classes: It was easier to find common patterns than to find distinctly different patterns between these cluster trees. The clustering of [I, L, M, V] occurred with rather high correlation coefficients in every tree. Cluster [F, Y] also occurred in different trees, but the correlation coefficients were not so high and different trees had subtle differences (from 0.782 to 0.905, average 0.835). This indicates that the substitution behaviours of F and Y had some differences in the same or between different protein classes, although substitution between F and Y was favourable. For example, in ALPHASUM, substitution score values of F and Y with H were –3 and 5, respectively, meaning that F–H substitution was unfavourable while Y–H substitution was favourable in all-alpha proteins. W was also clustered to [F, Y] in different trees, but the correlation coefficients between W and F (from 0.555 to 0.779, average 0.665) and W and Y (from 0.532 to 0.818, average 0.683) were even smaller than that of F and Y. [K, Q] or [K, R] and [D, E] or [D, N] were the four other clusters that occurred in different trees, with the correlation coefficients at the same level of cluster [F, Y]. For example, the correlation coefficients of cluster [K, Q] were from 0.741 to 0.945 (average 0.849) in the six cluster trees. Another hydrophilic cluster [S, T] also occurred in every tree, but the correlation coefficients were smaller (from 0.543 to
Figure 4. Clustering of amino acids based on ALPHASUM. The left of the figure is the amino acid cluster tree. Amino acids are identified by one-letter symbols, and each amino acid is indexed by a number on the right. The scale on the top of the cluster tree is a measure of the distance of different clusters. The right of the figure is the cluster procedure table for the cluster tree on the left: The ‘Cluster combined’ column lists which two branches are clustered at each stage. The ‘Coefficients’ column gives the average Pearson coefficient of the two clustered branches at each stage. The ‘Stage cluster first appears’ column shows at which stages each of the two clustered branches are clustered. If the clustered branch is a single amino acid, the stage value is zero. J. Biosci. 35(2), June 2010
Aligning protein sequence and analysing substitution pattern using a class-specific matrix 0.880, average 0.717), indicating that although substitution between S and T was favourable, the substitutions of S and T using other amino acids were differently favourable/ unfavourable. Other amino acids (A, C, G, H, and P) did not cluster with the amino acids mentioned above by high correlation coefficients. They also did not cluster with one another by high correlation coefficients. All these indicate that their substitution behaviours were more independent than with the others and were also independent with each other. In the six cluster trees, [I, L, M, V] and [F, Y, W] clustered together and formed a hydrophobic amino acid group. [K, Q, R], [D, E, N] and [S, T] clustered together and formed a hydrophilic amino acid group. All the other amino acids (A, C, G, H and P) clustered independently with the two groups, and then merged into one group. 3.5.2 Different patterns of different protein classes: Although many common patterns were easily found from the cluster trees, some different patterns also existed between these trees. These different patterns were demonstrated through two different aspects, which we named ‘direct’ and ‘indirect’. The ‘direct difference’ of the amino acids’ substitution behaviour between different score matrices can be identified by the sign difference of the corresponding entries. The ‘indirect difference’ of amino acids’ substitution behaviour between different score matrices refers to amino acid pairs that have the same entry value signs in different matrices, whereas their corresponding entry signs to other amino acids in different matrices are not the same. Compared with ‘direct difference’, ‘indirect difference’ revealed some subtle differences between amino acids in different matrices. To identify the distinct substitution behaviours of the various protein classes, we compared the class-specific matrices with Gonnet250 (Gonnet et al. 1992), which was chosen as the background control. Gonnet250 was chosen because it was derived from an exhaustive matching of all possible subsequence pairs in a sequence database of different classes. This means that Gonnet250 contains the common substitution patterns of different protein classes. If class-specific proteins have distinct substitution patterns, these could be identified by comparing the corresponding class-specific matrix with Gonnet250. For the cluster trees of ALPHASUM and Gonnet250, the different clustering modes of A and P with other amino acids indicated that their substitution behaviours were both different in the training datasets of the corresponding matrices (see table 1 of Supplementary IV). The detailed substitution behaviour differences of P in the two training datasets can be analysed by comparing score values concerning P in the matrices. In ALPHASUM, score values concerning P were all negative except for P-P entry, which indicates that P was rather conserved in all-alpha proteins. However, in Gonnet250, several entries concerning P (P-
307
A, P-S and P-T) were found positive aside from P-P entry, indicating that the substitution of P with A, S and T was favourable in the training set of Gonnet250 (i.e. proteins of different classes). In fact, P is well known as a helix breaker, which may cause great change (e.g. prolonging the helix) in the structure of the alpha protein if replaced by other amino acids. Obviously, ALPHASUM caught this substitution character of all-alpha proteins better than Gonnet250. The substitution behaviour difference of A between ALPHASUM and Gonnet250 included A-T and AK substitution. Substitution A-T was unfavourable in ALPHASUM but favourable in Gonnet250. On the other hand, substitution A-K was favourable in ALPHASUM but unfavourable in Gonnet250. According to other researchers’ results (Kumar and Bansal 1998), T is preferred in the Ncap site of the helix while A is preferred in the mid of the helix, and A, K are preferred in the C1 and the Ccap site of the helix. Again, ALPHASUM caught these substitution characters of all-alpha proteins better than Gonnet250. For cluster trees of BETASUM and Gonnet250, amino acids A, T and W were clustered differently. Some ‘direct differences’ of these amino acids between two matrices were identified (see table 2 of Supplementary IV) and some of these differences reflected the character of all-beta proteins. For example, A-G and A-P substitutions were found to be weakly unfavourable in BETASUM, while they were weakly favourable in Gonnet250. The different position preferences between A and [G, P] in the beta turn (Hutchison and Thornton 1994) partly proved the weakly unfavourable substitution of A-G and A-P in beta protein, whereas these properties appeared to be concealed in Gonnet250. For the cluster trees of AFBETASUM and Gonnet250, the substitution behaviour differences between them were to some extent a combination of ALPHASUM’s and BETASUM’s differences with Gonnet250 (see table 3 of Supplementary IV). In some cases, the combination made AFBETASUM represent the common properties of ALPHASUM and BETASUM (e.g. P-A and P-T substitutions). In some other cases, the combination made AFBETASUM represent only properties of ALPHASUM and BETASUM alternately (e.g. W-I, W-L and W-M substitutions). However, the character of AFBETASUM was not a simple integration of ALPHASUM and BETASUM. For example, the signs of the scores of C-L, C-M, C-S and C-T in AFBETASUM were different from those of ALPHASUM and BETASUM (see table 4 of Supplementary IV), indicating that the substitution behaviour of C in alpha/ beta proteins was different from those of all-alpha and allbeta proteins. Aside from these ‘direct difference’ aspects of amino acids’ substitution behaviour between the various score matrices mentioned above, some subtle differences in amino acids were also demonstrated by ‘indirect difference’. For example, although substitution S-T was favourable in both J. Biosci. 35(2), June 2010
Hai Song Xu et al.
308
Figure 5. Sign comparison matrix of ALPHASUM_alpha and ALPHASUM_coil. The top line and left line of the matrix are the single letter representations of amino acids which are arranged according to the results of amino acid classification in this paper. For the definition and construction of the sign comparison matrix, see Methods. This figure roughly shows the substitution behaviour differences between the two represented matrices. ‘–1’ is marked as grey and ‘@’ as light grey to to illustrate the differences clearly. The pattern distance (PD) of ALPHASUM_alpha and ALPHASUM_coil is equal to 15 (for the definition of PD, refer to Methods).
ALPHASUM and BETASUM (with entry values equal to 5 and 6, respectively), the substitutions of S and T with P in ALPHASUM and BETASUM were different (–1 and –3 for S-P and T-P, respectively, in ALPHASUM, 1 and –2 for S-P and T-P, respectively, in BETASUM), which means that substitution S-T was not equally favourable in ALPHASUM and BETASUM to some extent. Table 5 of Supplementary IV lists some of the ‘indirect difference’ aspects between ALPHASUM and BETASUM. 3.6
Substitution pattern consistence of different protein classes
The amino acid substitution pattern analysed above indicates that different protein classes have some common and different substitution patterns. However, different protein classes still have different secondary structures. These influence the consistency of the substitution pattern in different protein classes. In this study, we are interested in two aspects of the consistency problem. The first aspect is whether different secondary structures in the same protein J. Biosci. 35(2), June 2010
class have consistent substitution patterns. The second aspect is whether the same secondary structures in different protein classes have consistent substitution patterns. Overington et al. (1990) had investigated the influence of different structural environments (including secondary structure) on amino acid conservation by directly comparing substitution probabilities of structural environment-specific matrices. We investigated the consistency problem by comparing sign matrices of the secondary structure-specific matrices (for the construction of secondary structure-specific matrices and the corresponding sign matrices, please refer to the Methods section). We constructed a total of seven secondary structure-specific matrices. For all-alpha proteins, two secondary structure-specific matrices were constructed (named ALPHASUM_alpha and ALPHASUM_coil) based on the alpha part and the coil part of the all-alpha training dataset of ALPHASUM. All-beta proteins have two matrices (named BETASUM_beta and BETASUM_coil) while alpha/beta proteins have three (named AFBETASUM_ alpha, AFBETASUM_beta and AFBETASUM_coil). The resulting secondary structure-specific matrices are given in tables 1.1–1.4 of Supplementary V. We compared the
Aligning protein sequence and analysing substitution pattern using a class-specific matrix sign matrices of the secondary structure-specific matrices (refer to Methods section) because they roughly catch the substitution pattern characteristics of those matrices, that is, those sequences they represent. In order to determine whether different secondary structures in the same protein class have consistent substitution patterns, the corresponding secondary structurespecific matrices of every protein class were used to conduct the comparison. For example, for all-alpha proteins, the sign matrices of ALPHASUM_alpha and ALPHASUM_coil were constructed and compared (refer to Methods section). The resulting ‘sign comparison matrix’ is shown in figure 5. A value of ‘–1’ in the ‘sign comparison matrix’ means that the corresponding substitutions in the different secondary structures (i.e. alpha and coil in all-alpha protein here) have totally different patterns: favourable to unfavourable (or reversed). A value of ‘@’ (refer to Methods section) means that the corresponding substitutions have some different patterns: neutral to favourable/unfavourable (or its reverse). A value of ‘1’ means that the corresponding substitutions have the same patterns: both favourable or both unfavourable. A value of ‘0’ means that the corresponding
309
substitutions have the same patterns: both neutral. In order to illustrate the differences clearly, we marked ‘–1’ as grey and ‘@’ as light grey. The differences shown in figure 5 indicate that the distinct substitution patterns of all-alpha protein are in fact the characteristics of one of the secondary structures in all-alpha protein. For example, the conservation of P in all-alpha protein (i.e. substitution of P with other amino acids is unfavourable; refer to the above) merely reflects the substitution behaviour of P in the coil part of all-alpha protein, whereas the substitution of P with some of the amino acids in the alpha part is favourable (refer to table 1 in Supplementary IV). We also compared the substitution pattern of the different secondary structures of the other two protein classes. The resulting ‘sign comparison matrices’ are given in figures 1.1–1.4 in Supplementary V. All these results indicate that different secondary structures in the same protein class have different substitution patterns. In order to determine whether the same secondary structures in different protein classes have consistent substitution patterns, we compared the same kind of secondary structure-specific matrices of different protein classes. The resulting ‘sign comparison matrix’ of
Figure 6. Sign comparison matrix of ALPHASUM_alpha and AFBETASUM_alpha. For the definition and construction of the sign comparison matrix, see Methods. This figure roughly shows the substitution behaviour differences between the two represented matrices. ‘–1’ is marked as grey and ‘@’ as light grey to to illustrate the differences clearly. The pattern distance (PD) of ALPHASUM_alpha and AFBETASUM_alpha is equal to 14 (for the definition of PD refer to Methods).
J. Biosci. 35(2), June 2010
310
Hai Song Xu et al.
Figure 7. Different supersecondary structural environments of proline in helix. Examples of proline in the helix of all-alpha protein (A1, A2) and alpha/beta protein (B1, B2) are given. The structures have been extracted from the training datasets. The SCOP (structural classification of proteins) IDs of these are: d1cuk_1 (A1), d1bu2a1 (A2), d1atia1 (B1) and d1a8y_3 (B2). The structure views (red wide cylinders, helix; cyan arrows, beta-sheet; grey tubes, coil and turn) have been generated using ViewerPro. All the prolines in the structures are labelled. The prolines in the N-terminal of the helix are shown in van der Waals’ sphere model, the others are shown in a ball and stick model. The percentage accessible surface areas of the prolines in the helix are: 35.0% (PRO174 in A1), 30.1% (PRO74 in A2), 31.6% (PRO411 in B1), 38.3% (PRO260 in B2), and 32.5% (PRO297 in B2). The local environments of these prolines are the same according to the prevailing environment definitions (Shi et al. 2001; Gelly et al. 2005). However, their supersecondary structure environments are different in all-alpha and alpha/beta proteins.
ALPHASUM_alpha and AFBETASUM_alpha is shown in figure 6 as an example. The other ‘sign comparison matrices’ of the same secondary structures are given in figures 2.1–2.4 of Supplementary V. The results indicate that even the same kind of secondary structures in different protein classes have different substitution patterns. The ‘pattern distances’ (for the definition, refer to Methods section) of these matrices and those of ‘sign comparison matrices’ of different J. Biosci. 35(2), June 2010
secondary structures in the same protein classes are at the same level, which indicates that their substitution pattern differences are also roughly comparable. It is expected that different secondary structures would have different substitution patterns. However, it is not so readily observable whether the same secondary structures in different protein classes would have comparable differences in substitution patterns. In order to interpret this, we inspected some
Aligning protein sequence and analysing substitution pattern using a class-specific matrix
311
Figure 8. The effect of substitution pattern difference on alignment. Alignments of two beta proteins (SCOP ID: d1iam_1, d1vcaa1; sequence identity: 17.2%) using different matrices (using optimized gap penalties on our training sets) are shown. The letters following MUSTANG are d1iam_1’s secondary structure ID according to the DSSP database. The grey letters mark the correctly aligned segments. Boldfaced E-T and T-R is the key for BETASUM’s last correctly aligned segment. Backbone structure alignment of these is given in the lower right corner.
structures of different protein classes. For example, figure 6 shows that for all-alpha proteins and alpha/beta proteins, the substitution behaviour of P (proline) in the alpha helix is different. However, the location of prolines in the alpha helix of all-alpha proteins and alpha/beta proteins are the same. It is always located at the N terminal of a helix or the first four sites of a short helix. The accessible surface areas of the prolines are comparative. The key difference is that these prolines have different secondary structure contexts within all-alpha proteins and alpha/beta proteins (see figure 7). In all-alpha proteins, the leading sequence of the proline is always a helix and a short coil. In alpha/beta proteins, the leading sequence of the proline is always a beta strand and a longer coil. Considering that the local environments of the proline in all-alpha proteins and alpha/beta proteins are almost the same, we suggest that it is the different secondary structure contexts that make the substitution behaviour of the proline different. Figure 6 also shows that cysteine has distinctly different substitution patterns. The different
secondary structure contexts (i.e. different supersecondary structure environment) may cause these differences as well. However, more detailed analyses need to be performed to conclude that the substitution pattern differences of the same secondary structures in different protein classes are caused by the different secondary structure contexts. 4.
Discussion
In this paper, we constructed three matrices based on different protein classes (i.e. all-alpha, all-beta and alpha/ beta proteins) to confirm whether the amino acid substitution pattern of class-specific proteins is different from the pattern of all proteins, and whether an optimal alignment of sequences in the ‘twilight zone’ could be obtained by using class-specific matrices. The comparison between ALPHASUM and ABUNDLESUN indicates that the amino acid substitution J. Biosci. 35(2), June 2010
312
Hai Song Xu et al.
pattern of proteins below the alpha-fold class (referred to here as up–down helical bundle proteins) could be well represented by ALPHASUM. Constructing a matrix at protein class level is reasonable compared to constructing a matrix at a lower level if the generalization of the matrix is also considered. The clustering of amino acids based on the matrix also shows that the difference between ALPHASUM and ABUNDLESUN is small. Most clusters of amino acids are the same, including hydrophobic clusters ([I, L, M, V], [F, Y, W]) and hydrophilic clusters ([S, T], [D, E] and [K, Q]), with C clustered to hydrophobic clusters in the same way, while N, R clustered to hydrophilic clusters differently. G, P, A, H are all distant members of hydrophilic clusters. The alignment performances in the ‘twilight zone’ of methods using the class-specific matrices and gap penalties are significantly better than those using the four other generalized matrices (BLOSUM30, BLOSUM62, Gonnet250 and HSDM) and their optimal gap penalties as suggested by Vogt et al. (1995) (see tables 2.1–2.3, Supplementary II). When the gap penalties of the four other matrices are optimized on the training datasets of the classspecific matrices, the performance differences between the class-specific matrices and the four other matrices (especially BLOSUM62 and Gonnet250) are reduced (see table 1.1–1.3, Supplementary II). The reduced differences between them indicate that the optimized gap penalties we issued play an important role in improving the alignment performances of class-specific matrices. However, the remaining differences also indicate that some differences in substitution pattern exist between class-specific proteins and all proteins, and these differences in substitution pattern make the classspecific matrices outperform the four other matrices. An alignment example in figure 8 illustrates this kind of effect. E-T and T-R substitutions are only favourable in BETASUM (refer to table 2, Supplementary IV). This makes BETASUM outperform others except HSDM. Here, it seems that HSDM (with E-T and T-R substitution score –1 and 0) performs equally well as BETASUM for some other reasons. The gap penalties optimized on the low identity (average identity <17 %, see tables 1.1–1.3, Supplementary II) class-specific training datasets are different from the gap penalties optimized by Vogt et al. (1995) on the dataset of different families and different identities (from 3% to near 100%). Concluding from the optimized gap penalties we have presented (see tables 1.1–1.3, Supplementary II), it appears that optimal gap penalties do not change greatly with different protein classes and different matrices (note the entry values of these matrices are of rather great difference), at least when the average identity of the dataset on which the gap penalties are optimized is low. The average identity of the training dataset has influenced the gap penalties greatly because the only commonality between the three training datasets is the low average sequence identity. J. Biosci. 35(2), June 2010
The good performances of the class-specific matrices in the ‘cross-alignment test’ indicate that the new sequence alignment scheme we suggest is robust. Another implicit result of the test is that different protein classes have many common amino acid substitution patterns although they do have some different ones as well. The further investigation of amino acid substitution patterns in different protein classes by clustering amino acids based on different matrices reveals some common and different amino acid substitution patterns between different protein classes. The common substitution patterns are almost the same as the results from earlier studies (Overington et al. 1990; Murphy et al. 2000). However, the amino acid cluster trees of different protein classes have some differences because of their distinct substitution patterns, which are in accordance with the results from previous studies (Hutchison and Thornton 1994; Kumar and Bansal 1998). The results showing that the class-specific matrices have small differences when compared with Gonnet250 are rather confusing, given that different protein classes have different substitution patterns according to our analysis and that of others (Hutchison and Thornton 1994; Kumar and Bansal 1998). The alignmental performance advantage of the classspecific matrices to HSDM (a matrix based on structure alignments of protein families in different protein classes) also implies that substitution pattern differences exist and influence performances. Considering that even the same protein class still has different secondary structures, we further constructed secondary structure-specific matrices based on specific protein classes. We then constructed the corresponding amino acid cluster trees (see Supplementary III). Comparison between class-specific matrices and corresponding secondary structure-specific matrices reveals that different secondary structures in the same protein class have different substitution patterns. However, the classspecific matrices can catch just one of the patterns of the corresponding secondary structure-specific matrices when the patterns of different secondary structures are contradictory. That is part of the reason why performance differences between the class-specific matrices and Gonnet250 are not very great. Moreover, Gonnet250 catches the common substitution patterns better than class-specific matrices. The Pearson correlation coefficients of different matrices (see table 2 in Supplementary V) also indicate that the difference between the secondary structure-specific matrices is bigger than the difference between the class-specific matrices and other general matrices. Moreover, we anticipate that using the secondary structure-specific matrices (based on specific protein class) with a modified Needleman–Wunsch algorithm (a global alignment algorithm based on dynamic programming) could further improve sequence alignment. We also found that even substitution patterns of the same secondary structures in different protein classes are different.
Aligning protein sequence and analysing substitution pattern using a class-specific matrix Moreover, the differences are comparable with those of the different secondary structures in the same protein classes, meaning that it cannot be neglected. However, the environment-specific score methods (Shi et al. 2001; Gelly et al. 2005) do not consider this kind of difference because the local environment definitions they use cannot distinguish the same local environment of different protein classes (refer to figure 7). By inspecting the structures of proteins of different protein classes, we are able to suggest that it is the different secondary structure contexts (i.e. different supersecondary structural environments) of the different protein classes that make the same kind of secondary structures have different substitution patterns. The underlying mechanism may relate to the non-local substitution restrictions imposed by the folding of different supersecondary structures. It can be seen as a complement of the local environment restriction theory primarily suggested by Overington et al.( 1990). We will investigate this phenomenon further in our future studies. 5.
Conclusion
Three class-specific matrices are constructed, based on which a new scheme of sequence alignment can be readily established. Performance of the new scheme is significantly better than some generalized matrices and gap penalty combinations currently available. However, the optimized gap penalties presented here play an important role, and the effect of class-specific substitution patterns is weakened by contradictory substitution patterns in the same protein class. Common substitution patterns of different protein classes are accompanied by some subtle differences in substitution patterns in different protein classes. The distinct substitution patterns of different protein classes are in fact the characteristics of one kind of secondary structure in the protein classes. Moreover, even the same kind of secondary structures in different protein classes have different substitution patterns, which implies that supersecondary structures somehow influence substitution patterns. Thus, we suggest that environment-specific matrices be constructed based on different protein classes. Generalization of the class-specific matrices makes it convenient to use the matrices as a plug-in for alignment programs (e.g. ClustalW) to improve multi-sequence alignment in the ‘twilight zone’. The matrices and the analysis method presented here would have potential value in protein sequence design. These matrices could also be used to construct sequence profiles of different protein classes. Acknowledgements Thanks to Dr V Chelliah for kindly giving us the list of the test set used in their paper (Proteins 2005, 61:722–731). Thanks
313
to Xiao Zeng Yang for helpful comments on the manuscript. This work was supported by grant number 30570427 from the Chinese Natural Science Foundation and grant number 4092008 from the Natural Science Foundation of Beijing. References Baussand J, Deremble C and Carbone A 2007 Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins; Proteins 67 695–708 Brenner S E, Chothia C, Hubbard T J P and Murzin A 1996 Understanding protein structure: using SCOP for fold interpretation; Methods Enzymol. 266 635–643 Brenner S E, Koehl P and Levitt M 2000 The ASTRAL compendium for sequence and structure analysis; Nucleic Acids Res. 28 254–256 Bystroff C, Thorsson V and Baker D 2000 HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins; J. Mol. Biol. 301 173–190 Cai Y D, Liu X J, Xu X B and Zhou G P 2001 Support Vector Machines for predicting protein structural class; BMC Bioinformatics 2 3 Chelliah V, Blundell T and Mizuguchi K 2005 Functional restraints on the patterns of amino acid substitutions: Application to sequence-structure homology recognition; Proteins 61 722– 731 Doolittle R F 1981 Similar amino acid sequences: chance or common ancestry?; Science 214 149–159 Dunbrack J R L 2006 Sequence comparison and protein structure prediction; Curr. Opin. Struct. Biol. 16 374–384 Fan Y P 2002 Family specific protein sequence scoring matrices and applications; Dissertation Abstracts International, DAI-B 62 5826 Gelly J C, Chiche L and Gracy J 2005 EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments; BMC Bioinformatics 6 4 Gonnet G H, Cohen M A and Benner S A 1992 Exhaustive matching of the entire protein sequence database; Science 256 1443–1445 Gribskov M, McLachlan A D and Eisenberg D 1987 Profile analysis: detection of distantly related proteins; Proc. Natl. Acad. Sci. USA 84 4355–4358 Henikoff S and Henikoff J G 1992 Amino acid substitution matrices from protein blocks; Proc. Natl. Acad. Sci. USA 89 10915–10919 Huang Y M and Bystroff C 2006 Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions; Bioinformatics 22 413–422 Hutchison E G and Thornton J M 1994 A revised set of potentials for {beta}-turn formation in proteins; Protein Sci. 3 2207–2216 Johnson M S and Overington J P 1993 A structural basis for sequence comparisons: an evaluation of scoring methodologies; J. Mol. Biol. 233 716–738 Jones D T, Taylor W R and Thornton J M 1992 The rapid generation of mutation data matrices from protein sequences; Bioinformatics 8 275–282 J. Biosci. 35(2), June 2010
Hai Song Xu et al.
314
Karlin S and Altschul S F 1990 Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes; Proc. Natl. Acad. Sci. USA 87 2264–2268 Karsch W and Sander C 1983 Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features; Biopolymers 22 2577–2637 Konagurthu A S, Whisstock J C, Stuckey P J and Lesk A M 2006 MUSTANG: a multiple structural alignment algorithm; Proteins 64 559–574 Krogh A, Brown M, Mian I S, Sjölander K and Haussler D 1994 Hidden markov models in computational biology: applications to protein modeling; J. Mol. Biol. 235 1301–1331 Kumar S and Bansal M 1998 Dissecting α-helices: position-specific analysis of α-helices in globular proteins; Proteins 31 460–476 Lüthy R, McLachlan A D and Eisenberg D 1991 Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities; Proteins 10 229–239 Murphy L R, Wallqvist A and Levy R M 2000 Simplified amino acid alphabets for protein fold recognition and implication for folding; Protein Eng. 13 149–152 Ng P C, Henikoff J G and Henikoff S 2000 PHAT: a transmembranespecific substitution matrix; Bioinformatics 16 760–766 Overington J P, Donnelly D, Sali A, Johnson M S and Blundell T L 1992 Environmental-specific amino acid substitution tables: tertiary templates and prediction of protein folds; Protein Sci. 1 216–226 Overington J P, Johnson M S, Sali A and Blundell T L 1990 Tertiary structural constraints on protein evolutionary diversity; Proc. R. Soc. London B. Biol. Sci. 241 132–145 Prlic A, Domingues F S and Sippl M J 2000 Structure-derived substitution matrices for alignment of distantly related sequences; Protein Eng. 13 545–550 Rice D W and Eisenberg D 1997 A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence; J. Mol. Biol. 267 1026–1038
Risler J L, Delorme M O, Delacroix H and Henaut A 1988 Amino acid substitutions in structurally related proteins. A pattern recognition approach: determination of a new and efficient scoring matrix; J. Mol. Biol. 204 1019–1029 Schwartz R M and Dayhoff M O 1978 Atlas of protein sequence and structure; Nat. Biomed. Res. Found. 5 353–358 Shi J, Blundell T L and Mizuguchi K 2001 FUGUE: sequencestructure homology recognition using environment-specific substitution tables and structure-dependent gap penalties; J. Mol. Biol. 310 243–257 Tang C L, Xie L, Koh I Y, Posy S, Alexov E and Honig B 2003 On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles; J. Mol. Biol. 334 1043–1062 Tatusov R L, Altschul S F and Koonin E V 1994 Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks; Proc. Natl. Acad. Sci. USA 91 12091–12095 Thompson J D, Higgins D G and Gibson T J 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice; Nucleic Acids Res. 22 4673–4680 Thompson J D, Plewniak F and Poch O 1999 BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs; Bioinformatics 15 87–88 Vilim R B, Cunningham R M, Lu B, Kheradpour P and Stevens F J 2004 Fold-specific substitution matrices for protein classification; Bioinformatics 20 847–853 Vogt G, Etzold T and Argos P 1995 An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited; J. Mol. Biol. 249 816–831 Zhou H Y and Zhou Y Q 2005 Fold recognition by combining sequence profiles derived from evolution and from depthdependent structural alignment of fragments; Proteins 58 321–328
MS received 23 September 2009; accepted 1 April 2010 ePublication: 4 May 2010 Corresponding editor: VIDYANAND NANJUNDIAH
J. Biosci. 35(2), June 2010