Computing DOI 10.1007/s00607-014-0436-3
Adaptation of the method of musical composition for solving the multiple sequence alignment problem Roman Anselmo Mora-Gutiérrez · María E. Lárraga-Ramírez · Eric A. Rincón-García · Antonin Ponsich · Javier Ramírez-Rodríguez
Received: 10 May 2014 / Accepted: 2 December 2014 © Springer-Verlag Wien 2014
Abstract The multiple sequence alignment (MSA) problem has become relevant to several areas in bioinformatics from finding sequences family, detecting structural homologies of protein/DNA sequences, determining functions of protein/DNA sequences to predict patients diseases by comparing DNAs of patients in disease discovery, etc. The MSA is a NP-hard problem. In this paper, two new methods based on a cultural algorithm, namely the method of musical composition, for the solution of the MSA problem are introduced. The performance of the first and second versions were evaluated and analyzed on 26 and 12 different benchmark alignments, respectively. Test instances were taken from BAliBASE 3.0. Alignment accuracies are computed using the QSCORE program, which is a quality scoring program that compares two multiple sequence alignments. Numerical results on the tackled instances indicate that the performance levels of the proposed versions of the MMC are promising. In particular, the experimental results show that the second version found the best alignment reported in the specialized literature in 25 % of the tested instances. Besides, for 50 % of
R. A. Mora-Gutiérrez (B) · E. A. Rincón-García · A. Ponsich · J. Ramírez-Rodríguez Departamento de Sistemas, Universidad Autónoma Metropolitana Azcapotzalco, Av. San Pablo 180, Col Reynosa Tamaulipas, C.P. 02200 Mexico, DF, Mexico e-mail:
[email protected] E. A. Rincón-García e-mail:
[email protected] A. Ponsich e-mail:
[email protected] J. Ramírez-Rodríguez e-mail:
[email protected] M. E. Lárraga-Ramírez Instituto de Ingeniería, Universidad Nacional Autónoma de México, C.P. 04360 Mexico, DF, Mexico e-mail:
[email protected]
123
R. A. Mora-Gutiérrez et al.
the tested instances, the second version achieved the second best alignment. Finally, the significance of the numerical results were analyzed according to the Wilcoxon rank-sum test, which indicated that the second proposed version is statistically similar to some state-of-the-art techniques for the MSA problem. Keywords
Social algorithms · Cultural algorithms · Socio-cultural creativity
Mathematics Subject Classification
90C59
1 Introduction Multiple sequence alignment (MSA) is the most natural way to see the similarity of several sequences by making an alignment between the primary sequences so that identical or similar residues will be aligned in columns. The idea behind MSA is to compare three or more sequences and come up with some alignment which would get the highest score under the given conditions (i.e. the scoring scheme used). Similarity between sequences, can be inferred from their scores and alignments. This information can in turn be used to predict properties of some given biological sequence. Thus, a multiple sequence alignment arranges protein sequences into a rectangular array with the goal that residuals in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role [11,22]. Although these three criteria are essentially equivalent for closely related proteins, sequence, structure and function diverge over evolutionary time and different criteria may result in different alignments. The formal definition of MSA is presented as below. Let Σ be an alphabet and Σ = Σ ∪ {−}, where − ∈ Σ denotes the gap. Let S be a sequence of length l over Σ, then S = (s1 , s2 , . . . , sl ) with sk ∈ Σ. Let S = {S1 , S2 , . . . S N } be a multiset of sequences. An alignment of S is a multiset A = { S¯1 , S¯2 , . . . S¯ N } of sequences on Σ, such that all S¯i are of equal length l A and S¯i is obtained from Si only by insertion of gaps [20]. Let d : Σ × Σ → R be a scoring function. Without loss of generality, the MSA optimization problem consists in finding an alignment A such that d( S¯i , S¯ j ). Generally, d( S¯i , S¯ j ) determines the degree of difference or min (i, j)∈A 1≤i< j≤N
similarity between i-th and j-th sequences in A, however there are a great variety of score functions reported in the literature (see [20]). Manually refined alignments continue to be superior to purely automated methods. Computing exact MSAs is computationally impossible because of the NP-hard nature of the problem, in practice heuristics are used to align sequences, by maximizing their similarity. Consequently, the number of available MSA methods has steadily increased over the last 20 years from multi-dimensional dynamic programming, iterative methods, via progressive, tree-based programs to the more recent methods combining several complementary algorithms and/or 3D structural information [37]. Some of these methods are reviewed in [7,9,27,29]. Assembling a suitable MSA is not, however, a trivial task, and none of the existing methods already managed to deliver biologically perfect MSAs. Thus, the development of new methods aiming at improving the speed and memory usage to accommodate the rapid increase in available sequence data has become a continuous necessity.
123
Adaptation of the method of musical composition
This work presents two adaptations of the MMC, which mimics a sociocultural creativity system involved in the musical composition process, for solving of the MSA. Cultural algorithms are guided by changes in the social environment, where agents interact among themselves and the environment. These interactions induce a learning process that leads individuals to adapt to their current environment faster than biological evolution. In particular, the MMC is based on the working mode of artificial societies that use a dynamic and creative system to compose music, and mimics the creative process of musical composition by agents that can create and exchange artworks. The main difference between the two versions of the MMC proposed in this paper is the objective function. The performance of the first version was evaluated on 26 different benchmark alignments, while, the second version was evaluated on 12 benchmark alignments. The test instances were drawn from BAliBASE 3.0 [3]. Alignment accuracies are computed using the QSCORE program, which is a quality scoring program that compares two multiple sequence alignments: an alignment to be evaluated (the “test” alignment) and a second alignment that is believed to be correct (the “reference” alignment). The QSCORE program outputs four scores: the PREFAB Q score (the Balibase SPS score or the Developer score), the modeler score, the Cline et al,. shift score, and the Balibase TC (total column) score. The obtained numerical results of our methods were compared to those resulting from other recent metaheuristics. Numerical results on the tackled instances indicate that the performance of both versions of the modified MMC are promising, though their behavior is different. In particular, the experimental results show than the second version has a good behavior, since for 25 % of the tested instances, the algorithm found the best alignment reported in specialized literature. Besides, for 50 % of the tested instances the second version of the modified MMC achieved the second best alignment. Finally, the significance of the numerical results was analyzed according to the Wilcoxon rank-sum test, which indicated that the second proposed version is statistically similar to ALIGNER, ALIGNER∗ , MUSCLE, TCOFFEE, MAFFT, PRRP, DIALIGN, DIALIGN-T, DIALIGN-TX, PROBCONS, PSALIGN1, PSALIGN2, KALIGN, PROBALIGN, PCMA, ALIGNM, MUMMALS, and AMAP. The remainder of the paper is organized as follows. In Sect. 2 the two versions of the MMC are defined. A description of the methodology used for the computational experiments is presented in Sect. 3. The analysis and discussion of the associated results are provided in Sect. 4. Finally, some conclusions and possible paths for future research are presented in Sect. 5.
2 Adapting the MMC algorithm to solve the MSA problem 2.1 Description of the MMC The MMC is an algorithm that mimics a creativity system involved in the musical composition process. The design of the model takes into account that musical composition can be viewed as an algorithm that is developed within a creative system, where a composer learns from his own experience and from others composers. Inter-
123
R. A. Mora-Gutiérrez et al.
actions between composers then involves the formation of social networks. The MMC is grounded on an agent-based model of a specific social network, which is made up of a set of Nc vertices or agents (hereafter called composers) and a set E of edges or links (which represents relationships between composers). Each composer has foreknowledge and a set of mechanisms and policies for interaction, whereby can communicate and exchange information with other composers. The foreknowledge is defined as a set of solutions or “tunes”, and each tune is a solution. A brief description of the MMC is given in the Algorithm 1. Algorithm 1: The basic MMC algorithm 1 2
3 4
5
6 7 8
Input: The operating parameters of the MMC and data about the instance to solve. Output: The best solutions generated by the composers society. Generate a social network with Nc agents (composers) and a set of links (or edges), which is generated in a random way. For each of the Nc agents, a set of solutions (artworks) is randomly generated. Each generated set, will be referred as to “scoring matrix”, representing the foreknowledge of an agent, and it will be denoted by P∗,∗,i while the stopping criterion is not fulfilled, the following process is carried out for each agent i, where i = 1, 2, . . . Nc do Exchange information between agents. An agent i will exchange information with any agent j subject to i = j and there is an edge joining them, and the worst solution in P∗,∗, j is better than the worst solution in P∗,∗,i , that is, tune j,wor st > tunei,wor st . Generate a new solution (tune) for the i − th agent. Based on the selected information in the previous step, the i − th agent builds his/her acquired knowledge as a matrix I SC∗,∗,i . Then, the i − th agent creates his/her background knowledge, K M∗,∗,i , which is formed from a concatenation of P∗,∗,i and I SC∗,∗,i , from which a new solution (tune), xi,new will be generated. Update the solution of the i − th agent. Then, the i − th agent (composer) will decide if tunei,wor st should be replaced by xi,new based on a quality criterion. end The final best solution of each composer is included in the set of solutions.
For more detailed information about the MMC, we refer the reader to [22–24]. 2.2 Adapting the MMC to solve the MSA problem In order to adapt the MMC to solve the MSA problem, the following aspects were considered: (a) a feasible alignment is easily generated through a binary representation of the MSA; (b) in order to solve the MSA problem, dynamic programming and a divide and conquer strategy are combined; (c) the overall approach relies on the definition of an objective function (OF) describing the quality of the MSA. In the first version of our method, the objective function is defined as Euclidean distance, which relations the coffee function, the SP-score function and the gap penalty function. On the other hand, the OF defined in the second version is based on a multiple linear regression, which combines the coffee function, the SP-score, the gap penalty function, the number of aligned sequences, the average of the lengths of the N aligned sequences, their length variance and Euclidean distance. The general structure of both versions is presented in Algorithm 2. N lk is the sum of the length The lmax is the length of the longer sequence, while k=1 of all the sequences, L is the length of the j-th alignment generated by the i-th com-
123
Adaptation of the method of musical composition
Algorithm 2: MMC adaptation to solve MSA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Input: MMC parameters and data about the instance to solve and substitution matrix (in this paper the Blosum 62 matrix is used). Output: The best alignment found by the society of composers. Generate a library of the alignments, in which an alignment is included for each pair of sequences (see Algorithm 3 ) Determine the score for each pair of sequences (scor e(A)) based on substitution matrix (see Algorithm 3 ) Change all negative elements in scor e(A) to positive (this step was only used in the second version, see Algorithm 4) Create an artificial society with specific interaction rules among agents. for every agent-composer in the society do for j = 1 : Ns do N L i, j ← lmax + rand ∗ ( k=1 lk − lmax ) Randomly generate a feasible alignment, denoted as tunei, j, (see Algorithm 5) Evaluate the objective function of tune tunei, j, (see Algorithm 8 for the first version and Algorithm 9 for the second version ) end Register Ns created tunes in the P,,i of the composer i end repeat Update links in the artificial society. Exchange information among agents. for each agent-composer in the society do Generate the knowledge matrix (K Mi, j ) of the i − th composer Evaluate the fitness of all tunes in the K Mi, j (see Algorithm 10) Generate and evaluate a new tune tunei,new for the i − th composer (see Algorithm 12) Determine the tune with the worst value of objective function in P,,i (tunei,wor st ) Update P,,i end Build the new set of solutions with the best solution found for each composer . until The termination criterion is satisfied; Select the best solution found in the society.
poser; the objective function and coffee function are calculated through Algorithms 8 and 9 for the first and the second version of the MMC, respectively. The steps of Algorithm 2 can be divided into nine phases (a) initialization of the algorithm (from input to step 3), (b) generation of an artificial society (step 4), (c) generation and evaluation of a set of Ns initial solutions for each composer in the society (from step 5 to step 12), (d) information exchange between composers (from step 14 to 15), (e) generation and evaluation of a new tune for each composer in the society (steps 17–19) (f) updating the foreknowledge matrix of each composer, P,,i , (steps 20–21), (g) construction of the set of solutions found by composers (step 23), (h) stopping criterion test (steps 13–24); (i) selection of the best solution based on the objective function (step 25). These phases are presented in more detail in the following subsections. 2.2.1 Initialization of the algorithm This phase includes the introduction and determination of the information, used in the subsequent phases of the algorithm. Initially, the problem characteristics and algorithm
123
R. A. Mora-Gutiérrez et al.
operating parameters are set as an input of the MMC algorithm. Then, the library of the alignments and the corresponding scor e(A) are determined by Algorithm 3. Algorithm 3: Generating an alignment library
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Input: N sequences to align and Scoring Matrix (Blosum 62) Output: Library of the alignments (Librar y), cost of pairwise sequences aligned scor e(A), index matrices, (index) and (index − f ), indicating the starting and the ending points for each sequence of the alignment, respectively Librar y ← ∅ index ← ∅ for a = 1 : N − 1 do for b = a + 1 : N do pair wisealigned ← ∅ s1 ← a − th sequence s2 ← b − th sequence Generate a local alignment between s1 and s2 Let M be the most similar region(s) between s1 and s2 sequences, determined in the previous step indexa,b ← starting of the local alignment in the a − th sequence indexb,a ← starting of the local alignment in the b − th sequence index − f a,b ← ending of the local alignment in the a − th sequence indexb,a ← ending point of the local alignment in the b − th sequence costlocal the cost associated to the local alignment sl−1 is the left subsequence in s1 not included in M sl−2 is the left subsequence in s2 not included in M sr −1 is the right subsequence in s1 not included in M slr −2 is the right subsequence in s2 not included in M Generate a global alignment between sl−1 and sl−2 Ml is the global alignment between sl−1 and sl−2 sequences, which was determined in the previous step. costglobal−l is the cost associated to the global alignment between sl−1 and sl−1 . Generate a global alignment between sr −1 and sr −2 Mr is the global alignment between sr −1 and sr −2 sequences, which was determined in the previous step. costglobal−r is the cost associated to the global alignment between sr −1 and sr −1 pair wisealigned ← Ml ∪ M ∪ Mr scor e(A)i, j ← costlocal + costglobal−l + costglobal−r end Librar y ← Librar y ∪ pair wisealigned end
For the second version of the MMC, all elements of scor e(A) are converted to positive numbers, through Algorithm 4. 2.2.2 Initial solutions generation In this phase, the set of initial solutions (P) is build in a random way through Algorithm 5. The j-th solution of the i-th composer is denoted by tunei, j, . In order to facilitate the management of information, tunei, j, is associated with two representasymbolic binar y symbolic , and the binary tune, tunei, j, . tunei, j, tions: the symbolic tune, tunei, j, is a feasible alignment of the set of sequences s1 = [c(1,1) , c(1,2) , . . . , c(1,|s1 |) ], s2 = [c(2,1) , c(2,2) , . . . , c(2,|s2 |) ], . . ., s N = [c(N ,1) , c(N ,2) , . . . , c(N ,|s N |) ]. The gensymbolic eral structure of tunei, j, is shown in Eq. (1).
123
Adaptation of the method of musical composition
Algorithm 4: Transformation scor e(A) 1 2 3 4 5 6 7
Input: scor e(A) Output: scor e(A) updates a is the minimum element in scor e(A) a ← abs(a) for a = 1 : N − 1 do for b = a + 1 : N do scor e(A)i, j ← scor e(A)i, j + a + 1 end end
⎡
symbolic
tunei, j,
s1,1 s1,2 ⎢ s ⎢ 2,1 s2,2 =⎢ .. ⎢ .. ⎣ . . s N ,1 s N ,2
. . . s1,L i, j . . . s2,L i, j .. ... . . . . s N ,L i, j
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
(1)
where sa is generated by insertion of gaps in sa , L i, j = |sa | for all a = 1, . . . , N binar y On the other hand, tunei, j, is a binary matrix with N rows and L i, j columns. , of tune element of the tune If the sa,l is a character, then the ψa,l i, j, i, j, is 1, else ψa,l is 0. Since, each tunei, j, is generated from a feasible alignment, then it satisfies both the set of Eq. (2) and the set of inequalities (3). The set of Eq. (2) involves the a-th row of tunei, j, , which contains all elements of the l-th sequence; while the set of inequalities (3) implies that the l-th column in the alignment must contain at least one element with value 1. symbolic
binar y
L i, j
ψa,l = |sa | ∀a = 1, . . . , N
(2)
ψa,l ≥ 1 ∀l = 1, . . . , L i, j
(3)
l=1 N a=1
Initially, each tunei, j, is randomly generated based on Algorithm 5. In row 40 of Algorithm 7, the sequence sb is realigned based on a strategy that combines dynamic programming and divide and conquer methods. The realigning strategy is shown in Algorithm 6. tunei, j, is made possible with Algorithm 7, where Eq. (2) and inequalities (3) are implemented. Besides, the value of L is also computed again in this algorithm, in such symbolic a way that it is equal to the length of the feasible tune tunei, j, . Then, tunei, j, is evaluated through Algorithms 8 and 9, for the first and second versions of the MMC method, respectively. A combination of objective functions is used in each version of the adapted MMC. The objective functions adopted in this work have been used in several previous papers (see [1,17,26]).
123
R. A. Mora-Gutiérrez et al.
Algorithm 5: Generate the set of initial solutions Input: Set of N alignment sequences , Nc , Ns , index,index − f , L, i and j symbolic symbolic binar y Output: P,,i (which is a set integrate to tunei, j, ) and P,,i (which is a set integrate to binar y
tunei, j, ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
P symbolic ← ∅ P binar y ← ∅ binar y
tunei, j,
for a = 1 : N do a3 ← randomly selected value within the index in the a − th column a4 ← |sa3 | − a3 a5 ← L − a2 a6 ← a2 a7 ← 1 for l = 1 : L do if l < a2 then a p1 ← a 3 6 if rand < p1 then symbolic tunei, j,a,l is the a7 − th character in the a − th sequence binar y
tunei, j,a,l ← 1 a7 ← a8 + 1 a3 ← a3 − 1 else symbolic tunei, j,a,l ← −
18 19 20 21 22
binar y
tunei, j,a,l ← 0 end a6 ← a6 − 1 else a p2 ← a 4 5 if rand < p2 then symbolic tunei, j,a,l is the a7 − th character in the a − th sequence
23 24 25 26 27 28 29
binar y
tunei, j,a,l ← 1 a7 ← a8 + 1 a4 ← a4 − 1 else symbolic tunei, j,a,l ← −
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
←∅
symbolic tunei, j, ←∅ indexa ← maximum value in index indexb ← maximum value in index − f a1 ← L − (indexa + indexb ) a2 ← indexa + r ound(rand ∗ (a1 ))
binar y
tunei, j,a,l ← 0 end a5 ← a5 − 1 end end Use dynamic programming for realigning the sb sequence in tunei, j, (see Algorithm 6) symbolic
Make possible tunei, j,
binar y
and tunei, j,
symbolic Include tunei, j, in Pi, j symbolic binar y binar y Include tunei, j, in Pi, j
end
123
through Algorithm 7
Adaptation of the method of musical composition
Algorithm 6: Strategy for realigning the sb − th sequence symbolic
binar y , tunei, j, and the set of N aligning sequences symbolic binar y Output: tunei, j, realigned and tunei, j, realigned symbolic 1 Randomly select a sequence of tune i, j, 2 sa ← the sequence selected in the previous step
Input: tunei, j,
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Randomly select a sequence of the set of N aligning sequences, different from the sa − th sequence sb ← the sequence selected in the previous step Generate a local alignment between sa and sb M ← the most similar region(s) between sa and sb sequences, determined in the previous step Mb ← the local alignment of the sb − th sequence sl−a ← the left subsequence in sa not include in M sl−b ← the left subsequence in sb not include in M sr−a ← the right subsequence in sa not include in M sr −b ← the right subsequence in sb not include in M and s Generate a global alignment between sl−a l−b Ml ← the global alignment between sl−a and sl−b sequences, determined in the previous step Mlb ← the global alignment of the sl−b − th sequence Generate a global alignment between sr−a and sr −b sequences, determined in the previous step Mr global alignment between sr−a and sr −b sequences, determined in the previous step. Mrb ← the global alignment of the sr −b − th sequence sb ← Mlb ∪ Mb ∪ Mrb for l = 1 : L do is a character then if sb,l −binar y
sb,l ←1 else −binar y sb,l ←0 end end
by sb binar y −binar y 27 Replace b − th row of the tune i, j, by sb 26
symbolic
Replace b − th row of the tunei, j,
The objective function applied in the first version is based on the Euclidean distance that combines three functions: the coffee function, the SP-score based on exact similitude, and a modification of the gap penalty function. On the other hand, the objective function applied in the second version is based on a multiple linear regression model. In order to generate the linear regression model, the coffee function, the SP-score based in exact similitude, the gap penalty function, the Euclidean distance, the number of aligned sequences, the average of the lengths of the N aligned sequences, and the variance of the lengths of the N aligned sequences, are used as independent variables. In this model βg (∀g = 0, . . . , 7) are the coefficients computed. 2.2.3 Communications among composers In this phase, composers exchange information using a specific interaction policy, defined in this work as: “composer i learns from composer k, if there is a link between them and if the artwork of composer k has more desirable characteristics than the
123
R. A. Mora-Gutiérrez et al.
symbolic
Algorithm 7: Make possible tunei, j, symbolic
Input: tunei, j,
binar y
,tunei, j, , N and L i, j
symbolic
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
binar y
and tunei, j,
binar y
Output: tunei, j, feasible and tunei, j, feasible for a = 1 : N do L a1 ← l=1 ψa,l if a1 = lsa then binar y Include or delete ones tunei, j, , while a1 = lsa end end binar y
Delete columns of tunei, j, , which do not satisfy inequality (3) L i, j is the length of the feasible tunei, j, symbolic
tunei, j, ←∅ for a = 1 : N do a2 ← 1 for l = 1 : L do binar y if tunei, j,a,l = 1 then
symbolic
In tune tunei, j,a,l put the a2 − th character of the a − th sequence a2 ← a2 + 1 else symbolic tunei, j,a,l ← − end end end
artwork of composer i”. These desirable characteristics are simply evaluated in terms of the objective function. This operating mode involves the three following steps: (a) the i-th composer decides if changing his links with other composer in the society (links between composers were updated trough the method presented in [24]), (b) composer i determines which of his tunes is the least desirable, this tune is denoted as tunei,wor st , (c) the i-th composer compares the objective function of his tunei,wor st versus the objective function of the tunek,wor st of the k-th composer, if there is a link between both composers, (d) if tunek,wor st is better than tunei,wor st then the i-th composer randomly selects a tune of the k-th composer and incorporates the selected tune to his acquired knowledge (I SC,,i ).
2.2.4 Generating and evaluating a new tune In this phase, each composer creates a new tune based on his knowledge (K Mi ). This phase is integrated by two sub-phases, which determine useful information and generate a new solution. The first sub-phase starts with the building of the K Mi , in such a way that the i-th composer combines his/her P,,i and his/her I SC,,i . Then, the i-th composer determines the fitness of the tunes in his K Mi (denoted by f itness(K Mi )) through
123
Adaptation of the method of musical composition
Algorithm 8: Evaluating solutions for the first version of the MMC symbolic
1 2 3 4 5 6 7
Take the sa − th sequence of tunei, j,
symbolic
8
symbolic Take the sb − th sequence of tunei, j, Librar ya is the a − th sequence aligned with the b − th sequence, which is included in
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
binar y
Input: tunei, j, feasible, tunei, j, feasible, L i, j , scor e(A), N , L i, j and Librar y Output: value of objective function i, j , value of coffee function i, j Objective function i, j ← ∅ Coffee function i, j ← ∅ Exact similarity function i, j ← ∅ density function i, j ← ∅ for a = 1 : N − 1 do for b = a + 1 : N do Υ ←∅
Librar y Librar yb is the b − th sequence aligned with the a − th sequence, which is included in Librar y for l = 1 : L i, j do = − ) or (s = − ) then if (sa,l b,l Υ1,l = sa,l Υ2,l = sb,l end end is the number of columns of Υ L a,b ιa,b ← scor e(A)a,b ∗ L a,b Wa,b ← 0 ωa,b ← 0 for l = 1 : L a,b do if (Υ1,l = Librar ya,l ) or (Υ2,l = Librar yb,l ) then Wa,b = Wa,b + 1 end if (Υ1,l = Υ2,l ) then ωa,b = ωa,b + 1 end end end end N −1 N
b=a+1 Wa,b ∗scor e(A)a,b b=a+1 L a,b∗scor e(a,b)∗scor e(A)a,b N −1 N b=a+1 ωa,b a=1 31 Exact similarity function i, j ← N −1 N L a,b b=a+1 ⎥ ⎢ a=1 ⎢ N binar y ⎥ ⎢ l=a tunei, j,a,l ⎥ ⎦ ⎣ N L i, j 32 Density function i, j ← l=1 L i. j
30
Coffee function i, j ← N −1 a=1 N a=1
33
ιi, j ←
(co f f ee f unction i, j )2 + (exact similarit y f unction i, j )2 + (densit y f unction i, j )2
34
Objective function i, j ← ιi, j
Algorithm 10. Subsequently, the i-th composer builds the harmonic matrix (which is denoted by M Hi ) with Algorithm 11. The second sub-phase generates and evaluates a new tune for each composer in the society. The corresponding procedure is based on each composer and the set of constraints in Table 12. Its process is described in Algorithm 12.
123
R. A. Mora-Gutiérrez et al.
Algorithm 9: Evaluating solutions for the second version of the MMC symbolic
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
binar y
Input: tunei, j, feasible, tunei, j, feasible, L i, j , scor e(A), N , L i, j and Librar y Output: value of objective function i, j , value of coffee function i, j Objective function i, j ← ∅ Coffee function i, j ← ∅ Exact similarity function i, j ← ∅ Density function i, j ← ∅ for a = 1 : N − 1 do for b = a + 1 : N do Υ ←∅ Take the sa − th sequence of tunei, j,
symbolic
symbolic Take the sb − th sequence of tunei, j, Librar ya is the a − th sequence aligned with the b − th sequence, which is included in
Librar y Librar yb is the b − th sequence aligned with the a − th sequence, which is included in Librar y for l = 1 : L i, j do = − ) or (s = − ) then if (sa,l b,l Υ1,l = sa,l Υ2,l = sb,l end end is the number of columns of Υ L a,b ιa,b ← scor e(A)a,b ∗ L a,b Wa,b ← 0 ωa,b ← 0 for l = 1 : L a,b do if (Υ1,l = Librar ya,l ) or (Υ2,l = Librar yb,l ) then Wa,b = Wa,b + 1 end if (Υ1,l = Υ2,l ) then ωa,b = ωa,b + 1 end end end end N −1 N
b=a+1 Wa,b ∗scor e(A)a,b b=a+1 L a,b∗scor e(a,b)∗scor e(A)a,b N −1 N b=a+1 ωa,b a=1 31 Exact similarity function i, j ← N −1 N L a,b b=a+1 ⎥ ⎢ a=1 ⎢ N binar y ⎥ ⎢ l=a tunei, j,a,l ⎥ ⎦ ⎣ N L i, j 32 Density function i, j ← l=1 L i. j
30
Coffee function i, j ← N −1 a=1 N a=1
33
34 35
ιi, j ← (co f f ee f unction i, j )2 + (exact similarit y f unction i, j )2 + (densit y f unction i, j )2 μl is the average of the lengths of the N aligned sequences σl2 is the variance of the lengths of the N aligned sequences Objective f unction i, j ← β0 + (β1 ∗ μl ) + (β2 ∗ σl2 ) + (β3 ∗ N ) + (β4 ∗ co f f ee f unction i, j ) +(β5 ∗ exact similarit y f unction i, j ) + (β6 ∗ densit y f unction i, j )+ (β7 ∗ ιi, j )
123
Adaptation of the method of musical composition
Algorithm 10: Determining the fitness of K Mi
1 2 3 4 5 6 7 8 9 10 11 12 13
Input: Set of values obtained of the coffee functioni,: , the SP-scorei,: and the gap penalty function i,: for the i − th composer,K Mi Output: f itness(K Mi ) a1 is the number of tunes in K Mi α1,a ← ∅ for a = 1 : a1 do a2 ← (1 − co f f ee f unction i,a )2 a3 ← (1 − S P − scor e f unction i,a )2 a4 ← (1 − gap penalt y f unction i,a )2 √ α1,a ← a2 + a3 + a4 end αmax ← is the maximum value contained in α1,a for a = 1 : a1 do 1 ← (α α1,a max − α1,a ) + a1 end a1 γ1 ← a=1 for a = 1 : a1 do α
f itness(K Mi,a ) ← γ1,a 1
14 15
end
Algorithm 11: Building M Hi 1 2 3 4 5
Input: K Mi ,L i, j ,N Output: M Hi , φ φ ← max(L i, j ) binar y
Add column vectors of zeros in each tunei, j, of K Mi until each tunei, j, has a length equal to φ for a = 1 : N do for l = 1 : φ do M Hi,a,l is the average of values contained in the a − th row and l − th column of all binar y
6 7 8
tunei, j, included in K Mi M Hi,a,l ← r ound(M Hi,a,l ) end end
2.2.5 Updating P,,i In this step, each composer makes a decision on either replacing or not the worst tune in P,,i with tunei,new . The decision is based on the value of the objective function and the coffee function. Algorithm 13 illustrates the procedure used for this purpose.
2.2.6 Building the set of solutions In this phase, the MMC algorithm selects the melody contained in the artwork of every composer that achieves the best objective function value. Then, the solution with the best objective function of each composer in the society is included in S, .
123
R. A. Mora-Gutiérrez et al.
Algorithm 12: Generating a new tune Input: K Mi , M Hi , i f g,c f g Output: tunei,new 1
2 3
K M ← ∅ Take the tunei, j, content in K Mi,: that is associated to the best value of the f itness(K Mi ). the tunebinar y taken in the previous step. Assign to K M1,: i, j, binar y
binar y
Recalculate f itness(K Mi ) without considering the tunei, j,
appropriated in the step 1, this
recomputed fitness is denoted as f itness(K Mi )
4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
content in K Mi,: based on f itness(K Mi ) binar y Assign to K M2,: tunei, j, taken in the previous step. binar y Randomly choose a tunei, j, content in K Mi,: binar y Assign to K M3,: the tunei, j, taken in the previous step. binar y L K Mi contained the numerical values of lengths associate with tunei, j, in steps 2, 5 and 7 Randomly choose an element of L K Mi , denoted L new binar y
Probabilistically choose a tunei, j,
block1 ← ∅ block2 ← ∅ block3 ← ∅ Determine the number of columns ([L 1 L 2 L 3 ]) that are contained in each blockk ∀ k = 1, . . . , 3. The elements L k ∀ k = 1, 2, 3 must satisfy the following constraint L new = 3k=1 L k , L k ≥ 1, L k ≤ L new − 2 and L k ∈ Z binar y tunei,new ← ∅ a1 = 0 if rand ≤ (1 − i f g) then for k = 1 : 3 do a2 ← r ound(1 + rand ∗ (2)) a3 = a1 + L k a1 = a1 + 1 if rand ≤ (1 − c f g) then blockk ← K Ma 2 ,a1 :a3 else blockk ← M H1,a 1 :a3 end if rand ≤ c f g then Randomly select an element in blockk and convert 0 to 1 and 1 to 0 end binar y
binar y
tunei,new ← blockk ∪ tunei,new end binar y
Make possible tunei,new with Algorithm 7 simboly
binar y
Generate tunei,new based on tunei,new else Randomly generate a tunei,new based on Algorithm 5 end Evaluate tunei,new through Algorithm 8 or 9 for both MMC versions
2.3 Satisfying the termination criterion and selecting the best solution Steps from 13 to 24 of Algorithm 2 are repeated max _arrangement times. Subsequently, the tune in S with the best objective function value is selected.
123
Adaptation of the method of musical composition
Algorithm 13: Updating P,,i 1 2 3 4 5 6 7 8 9 10 11 12 13
Input: objective function of P,,i , coffee function of P,,i , P,,i and tunei,new , Output: P,,i updated for i = 1 : N do for j = 1 : L do if the objective function of tunei,new is better than the objective function of tunei,wor st then Replace tunei,wor st by tunei,new in P,,i . else if the objective function of tunei,new is equal to that of tunei,wor st then if the coffee function of tunei,new is better than the coffee function of tunei,wor st then Replace tunei,wor st by tunei,new in P,,i . end end end end end
Computational experiments performed in order to test both versions of the proposed algorithm for several instances of multiple sequence alignment problem are presented in the following section. 3 Experimental methodology and test problems In this section, the performance of the proposed method is investigated through of numerical simulation. 3.1 Methodology The performance of the proposed methods was evaluated and analyzed for different benchmark alignments, taken from BAliBASE 3 [3,38,39]. In particular, we used 26 and 12 instances, which were randomly selected from BaliBase 3.0, for the first and second version of the proposed algorithm, respectively. These problems are included in the references from 1 to 5. Table 1 shows the instances considered for evaluation of the first version of our algorithm, where the name of the reference, the type of problem and its characteristic are depicted. Due to the stochastic nature of the MMC, 20 independent runs on each problem instance were carried out for both methods. The alignments found by each instance were evaluated with QSCORE program [10], which was downloaded from the website http://www.drive5.com/qscore/. The QSCORE is a scoring program that compares two multiple sequence alignments. The results include the following scores: the (a) Q score, (b) the modeler score, (c) the Cline et al., shift score and (d) The TC score. In the following, these scores are briefly described: (a) The Q score, also called SP score, is defined as the number of correctly aligned residual pairs that are found in the test alignment divided by the total number of
123
R. A. Mora-Gutiérrez et al. Table 1 Instances considered to evaluate two versions of the proposed algorithm References
Instance name
Characteristics of the instance N
[1]
a
b
μj
σ 2j
BB11002
8
193
52
98
2,117.142857
BB11007
9
457
385
418.3333333
698.25
BB11020
9
236
201
219.1111111
191.8611111
BB11030
14
392
236
291.7857143
2,189.873626
BB11035
5
138
71
97.4
778.3
BBS11016
8
345
305
324.125
184.9821429
BBS11030
14
220
169
184.2142857
206.9505495
BBS11034
8
355
311
336.75
247.0714286
BB12003
8
85
58
71.25
138.2142857
BB12005
9
234
197
209
163.5
BB12010
7
415
295
348.2857143
1,606.904762
BB12023
5
738
681
704.2
674.7
BBS12002
6
190
165
182
79.2
BBS12008
13
381
351
367.2307692
58.69230769
BBS12013
8
714
665
685.75
281.0714286
BBS12020
4
126
116
123
22
BBS12031
10
1,006
827
917.5
3,770.722222
BBS12037
13
509
451
480.9230769
190.9102564
BB20002
20
1,331
52
592.3
144,274.1158
BB20015
37
274
54
193.9189189
2,899.465465
BB20023
31
244
202
217.6774194
114.8258065
BB20030
47
155
76
108.8723404
538.2876966
BBS20027
29
271
237
250.7586207
90.6182266
[3]
BB30020
56
631
76
151.3392857
12,546.95552
[4]
BB40014
9
609
298
445.2222222
15,996.69444
[5]
BBS50016
18
523
303
361.6111111
4,192.486928
[2]
Where: N is the number of sequences, a is the longest sequence, b is the shortest sequence, μ j is the average length of sequences and σ 2j is the variance of the lengths of the sequences
aligned residual pairs of the reference alignment, increasing with the number of sequences correctly aligned. The Q score ranges from 0 to 1, values near to 1 indicate a better alignment [3,10,28]. (b) The modeler score indicates the fraction of residual pairs in the test alignment that are properly aligned compared to the reference. The modeler score ranges from 0 to 1. (c) The Cline et al. shift score is based on a measure of disagreement between two alignments, which includes penalties for both over and under alignment, and gives positive but reduced scores for slight misalignment. The Cline et al. shift score range from −∞ to 1.
123
Adaptation of the method of musical composition
(d) The TC score checks for identical columns in each set of aligned sequences. The TC score ranges from 0 to 1. We decided not to use the scoring program Bali-score, which was proposed by Thomson in [3], based on those reported in [4,5,41] The model of multiple linear regression is used as the objective function in the second version of the MMC, see Eq. (4). In this model, some characteristics of each instance (μ j , σ j and N ) are combined to the numerical values of the function ι, the coffee function (co f f ee), the SP-score function (S P), the gap penalty function (gap) determined by the first version of the MMC in each run of all test instances, to produce the numerical values of the corresponding QSCORE. Q scor e j = β0 + β1 ∗ μ j + β2 ∗ σ j2 + β3 ∗ N + β4 ∗ co f f ee + β5 ∗ S P + β6 ∗ gap + β7 ∗ ι
(4)
The value of parameters β obtained through regression are shown in Eq. (5). Q scor e j = 0.061727889 − 0.00015888 ∗ μ j + 1.13349E − 06 ∗ σ j2 − 0.003121622 ∗ N − 4.3176E − 07 ∗ co f f ee + 2.532070999 ∗ S P + 0.054386105 ∗ gap + 8.22621E − 07 ∗ ι j (5) As previously mentioned, Eq. (5) was used as the objective function in the second version of our algorithm, so that objective f unction j = Q scor e j . 3.1.1 Comparison with other methods The numerical results obtained from both versions of the adapted MMC are compared with those achieved from 23 algorithms reported in the literature (see [21]). The methods used by the comparison are ALIGNER [14], ALIGNER [14], MUSCLE [10], TCOFFEE [26], MAFFT [13], CLUSTALW [12,37], DIALIGN [25], DIALIGNT [33], DIALIGN-TX [34], MULTALIN [6], PILEUP8 [40], PROBCONS [8], POAV [18], PSALIGN1 [36], PSALIGN2 [36], KALIGN [16], PROBALIGN [32], PCMA [30], ALIGN-M [42], PRANK [19], MUMMALS [31], SAM [15] and AMAP [35]. With the aim of comparing the results obtained by different metaheuristics on each case, all the obtained solutions were normalized through the following equation: f (x nor mali zed-α ) =
f (x method-α ) − f (x ) f (x wor st in β ) − f (x )
(6)
where: f (x ) is the value of the objective function at the global optimal point, f (x method-α ) is the average value of the objective function found by the metaheuristic α, f (x wor st in β ) is the worst average of the objective function found by metaheuristics on case β, and f (x nor mali zed-α ) is the normalized value of the objective function found by metaheuristic α.
123
R. A. Mora-Gutiérrez et al. Table 2 Results of the Q score obtained by the first version
Instance
Best
Worst
Average
Variance
BB11002
0.296
0.0987
0.192285
0.001815172
BB11007
0.458
0.0698
0.184375
0.010373354
BB11020
0.429
0.0363
0.216085
0.013279331
BB11030
0.0436
0.0051
0.020551
0.000131852
BB11035
0.522
0.373
0.4896
0.002148042
BBS11016
0.135
0.0571
0.10568
0.000440703
BBS11030
0.207
0.00557
0.0329835
0.002070051
BBS11034
0.14
0.0211
0.05745
0.000735445
BB12003
0.831
0.326
0.56025
0.016684303
BB12005
0.856
0.15
0.7034
0.039586147
BB12010
0.442
0.0522
0.1553
0.010551545
BB12023
0.197
0.0772
0.109095
0.000867295
BBS12002
0.851
0.569
0.7646
0.008418568
BBS12008
0.362
0.12
0.2174
0.004101937
BBS12013
0.853
0.278
0.7516
0.025677516
BBS12020
0.326
0.0556
0.115085
0.005312096
BBS12031
0.601
0.0457
0.36283
0.032786162
BBS12037
0.0847
0.0451
0.05837
0.000117206
BB20002
0.285
0.024
0.0836
0.004943591
BB20015
0.0229
0.00258
0.007826
1.93744E−05
BB20023
0.436
0.0352
0.151265
0.012279707
BB20030
0.791
0.606
0.71185
0.002845818
BBS20027
0.139
0.0643
0.094345
0.000396886
BB30020
0.03
0.00662
0.0139905
4.65263E−05
BB40014
0.68
0.0793
0.235095
0.036367384
BBS50016
0.355
0.0199
0.06511
0.005101065
The value of f (x nor mali zed-α ) ranges from 0 to 1. If f (x nor mali zed-α ) is close to 0, the value of f (x method-α ) is close to f (x ). If f (x nor mali zed-α ) is close to 1, the value of f (x method-α ) is far from f (x ).
3.1.2 Statistical validation A non-parametric Wilcoxon rank sum test was applied to the results obtained both by the MMC and by other methods. The null hypothesis is that data in two solution sets are independent: a test value h = 1 indicates a rejection of the null hypothesis at the 5 % significance level, while h = 0 indicates a failure to reject the null hypothesis at the 5 % significance level. Parameters p (standing for the symmetry and mean of the distribution) and h (which is the result of the test hypothesis) are also computed using this statistical test.
123
Adaptation of the method of musical composition Table 3 Results of the modeler score obtained by the first version
Instance
Best
Worst
Average
Variance
BB11002
0.0841
0.0319
0.06324
0.000200327
BB11007
0.136
0.0157
0.05987
0.000827287
BB11020
0.18
0.0136
0.08951
0.002362276
BB11030
0.0377
0.00156
0.0086915
6.65262E−05
BB11035
0.286
0.206
0.26395
0.000808787
BBS11016
0.0563
0.0218
0.04192
8.47248E−05
BBS11030
0.0955
0.00275
0.015929
0.000426907
BBS11034
0.0799
0.00798
0.024149
0.000254537
BB12003
0.518
0.192
0.34795
0.006949734
BB12005
0.61
0.106
0.50075
0.02046725
BB12010
0.281
0.00626
0.097268
0.004620921
BB12023
0.145
0.0577
0.080135
0.000485002
BBS12002
0.672
0.451
0.6044
0.005244884
BBS12008
0.285
0.0919
0.169495
0.002532764
BBS12013
0.636
0.207
0.56135
0.014337818
BBS12020
0.422
0.0517
0.095895
0.007612065
BBS12031
0.486
0.0373
0.294206364
0.021709772
BBS12037
0.0496
0.0263
0.03371
3.97567E−05
BB20002
0.0245
0.00167
0.0075625
3.50129E−05
BB20015
0.0134
0.000674
0.00226615
7.49171E−06
BB20023
0.173
0.0151
0.059065
0.001819874
BB20030
0.342
0.2
0.2413
0.000859589
BBS20027
0.526
0.0353
0.07291
0.01146213
BB30020
0.00836
0.00295
0.0044255
1.96855E−06
BB40014
0.343
0.0411
0.12233
0.009134874
BBS50016
0.22
0.0121
0.04101
0.001959607
3.2 Parameter setting for the MMC algorithm In order to carry out simulations, the set of parameters was tuned through a brute-force approach [2]. The set used by both MMC versions was maxa = 1500, i f g = 0.05, c f g = 0.05 f cla = 0.05, N = 10 and N s = 5.
4 Numerical results In this section, numerical results obtained by both proposed methods are presented. The MMC was implemented in Matlab R2007a with a MacBook Air with 1.8 GHz and intel core i7. As previously mentioned, all alignments generated by our metaheuristic were evaluated through the QSCORE program. For each tested instance, we calculated the best, the worst, and the average value for each one of scoring functions generated by Q score
123
R. A. Mora-Gutiérrez et al. Table 4 Results of the Cline et al., shift score obtained by the first version
Instance
Best
Worst
Average
Variance
BB11002
0.788
0.0302
0.11127
0.025822201
BB11007
0.529
0.00617
0.0991915
0.012553441
BB11020
0.26
0.0126
0.13654
0.004624844
BB11030
0.0762
0.00453
0.0206535
0.000215509
BB11035
0.382
0.306
0.3614
0.000614779
BBS11016
0.223
0.0129
0.04737
0.002192166
BBS11030
0.0683
0.00695
0.0246325
0.000183906
BBS11034
0.066
0.00832
0.033943
0.00027966
BB12003
0.658
0.277
0.4869
0.011650095
BB12005
0.736
0.179
0.6366
0.022634884
BB12010
0.321
0.000426
0.1129813
0.006865309
BB12023
0.109
0.0007744
0.02134672
0.000998591
BBS12002
0.79
0.568
0.73015
0.003774661
BBS12008
0.374
0.207
0.2893
0.002194432
BBS12013
0.751
0.334
0.6864
0.011698042
BBS12020
0.236
0.006268
0.0473684
0.004239938
BBS12031
0.541
0.00757
0.337879364
0.026329976
BBS12037
0.0953
0.0237
0.055235
0.000303137
BB20002
0.0365
0.000389
0.00859845
8.16546E−05
BB20015
0.0209
0.0121
0.01642
8.09116E−06
BB20023
0.289
0.0153
0.093565
0.004743081
BB20030
0.408
0.35
0.38085
0.000296029
BBS20027
0.129
0.0871
0.10718
0.000176026
BB30020
0.0102
0.000557
0.00539785
6.03102E−06
BB40014
0.493
0.00608
0.138114
0.021769693
BBS50016
0.27
0.00168
0.030181
0.003300878
from the different runs; as well as the corresponding variance. The results of the Q score, the modeler score, the Cline et al., shift score and TC score obtained by the first version on all test instances are shown in Tables 2, 3, 4 and 5, respectively. Tables 2, 3 and 4 highlight that the maximum average value is found for the instance BBS12002. On the other hand, Table 5 shows that his maximum average value is obtained for instance BBS12013. The runtime required for the first version for each of the tested instances is exhibited in Table 6. As can be observed, the runtime is directly related with the number of the aligned sequences. In Tables 7, 8, 9 and 10, the corresponding results of the Q score, the modeler score, the Cline et al., shift score and TC score, respectively, obtained by the second version of the proposed algorithm for all test instances are shown. These table prove that the maximum average value is produced for the instance BB12005. In addition, the maximum average value for the TC score is achieved for instance BB12003. On the other hand, Tables 2, 5, 7 and 10 show, the corresponding results of the Q score, the
123
Adaptation of the method of musical composition Table 5 Results of the TC score obtained by the first version
Instance
Best
Worst
Average
Variance
BB11002
0
0
0
0
BB11007
0
0
0
0
BB11020
0
0
0
0
BB11030
0
0
0
0
BB11035
0.351
0.135
0.31185
0.005408029
BBS11016
0
0
0
0
BBS11030
0
0
0
0
BBS11034
0
0
0
0
BB12003
0.486
0
0.141765
0.026462417
BB12005
0.606
0
0.344515
0.050928336
BB12010
0
0
0
0
BB12023
0
0
0
0
BBS12002
0.65
0.234
0.49645
0.015809629
BBS12008
0.332
0
0.058244
0.005763812
BBS12013
0.677
0.0932
0.53401
0.02978956
BBS12020
0
0
0
0
BBS12031
0
0
0
0
BBS12037
0
0
0
0
BB20002
0
0
0
0
BB20015
0
0
0
0
BB20023
0
0
0
0
BB20030
0
0
0
0
BBS20027
0
0
0
0
BB30020
0
0
0
0
BB40014
0.329
0
0.03155
0.008590366
BBS50016
0
0
0
0
modeler score, the Cline et al., shift score and TC score, respectively, obtained with the second version of the proposed algorithm on all the test instances. From these tables, it can be observed that higher numerical values for all score functions are obtained by the second version, indicating that its performance is better than the obtained from the first version of the MMC to solve the MSA. The runtime of the second version for each of the tested instances is exhibited in Table 11. It can be seen from the Tables 6 and 11 that the results obtained from both version of the modified MMC do not differ significantly. The average results of Q score, obtained by two versions of MMC, were used as a basis for the comparative study. Normalized results generated in the comparison of the two versions of the modified MMC versus others methods are displayed in Tables 12 and 13. It can be observed that for the instances BB11035 and BB20030, the first version of the proposed algorithm achieves the best results. In addition, the second best result is also obtained for the instance BBS12002. In contrast, the first version is not suitable
123
R. A. Mora-Gutiérrez et al. Table 6 Runtime utilized by the first version in seconds
Table 7 Results of the Q score generated by the second version
123
Instance
Best
Worst
Average
Variance
BB11002
313.9
484.06
360.5365
1,851.797603
BB11007
2,499.4
4,238.6
3,366.73
254,734.2127
BB11020
456.53
512.06
493.259
179.1435042
BB11030
1,416.8
5,013.7
3,183.195
1,675,495.845
BB11035
382.51
396.02
389.6675
19.95018816
BBS11016
1,567.1
1,791.5
1,684.025
4,907.1925
BBS11030
1,218.9
2,416.9
1,876.025
111,694.3325
BBS11034
594.52
1,304.9
699.7695
35,016.67854
BB12003
162.26
176.97
170.69
10.83497895
BB12005
454
1,362.3
1,070.975
162,093.0372
BB12010
805.23
1,114.8
960.355
7,457.000279
BB12023
1,347.6
1,901.8
1,580.46
21,266.38568
BBS12002
267.13
557.1
311.1945
5,974.470205
BBS12008
2,089.7
2,863.7
2,246.515
31,406.94239
BBS12013
1,100.2
1,223.1
1,165.505
954.9215526
BBS12020
144.4
251.26
181.838
796.0189326
BBS12031
5,697.1
7,970.8
6,678.09
456,173.2546
BBS12037
1,741.3
1,914.4
1,842.255
2,284.295237
BB20002
5,748.6
7,747.1
6,508.395
282,848.7679
BB20015
10,699
17,723
13,475.4
3,499,357.832
BB20023
3,604.4
3,982.3
3,771.375
11,710.97461
BB20030
4,721.2
9,648.6
5,792.68
1,755,347.554
BBS20027
3,421.2
3,684.5
3,553.41
5,753.997789
BB30020
10,516
12,600
11,203.95
359,667.7342
BB40014
897.21
1,973.9
1,213.533
135,141.8266
BBS50016
2,171.3
2,586.2
2,409.545
8,041.012079
Instance
Best
Worst
Average
Variance
BB11002
0.824
0.208
0.3322
0.020416379
BB11007
0.513
0.324
0.43
0.001981368
BB11020
0.564
0.401
0.4936
0.002240674
BB11030
0.269
0.12
0.2009
0.001997989
BB11035
0.497
0.316
0.428
0.002609368
BBS11016
0.557
0.419
0.47175
0.001737987
BBS11030
0.332
0.178
0.26335
0.001908661
BBS11034
0.497
0.393
0.45435
0.000977818
BB12003
0.927
0.823
0.873
0.000863158
BB12005
0.899
0.802
0.84655
0.000689734
BB12010
0.866
0.736
0.7996
0.000926358
BB12023
0.812
0.733
0.7724
0.000480779
Adaptation of the method of musical composition Table 8 Results of the modeler score generated by the second version
Table 9 Results of the Cline et al., shift score generated by the second version
Table 10 Results of the TC score generated by the second version
Instance
Best
Worst
Average
Variance
BB11002
0.225
0.0532
0.09659
0.00150525
BB11007
0.204
0.131
0.1732
0.000308905
BB11020
0.229
0.164
0.2032
0.000320695
BB11030
0.0822
0.0379
0.0617555
0.000196687
BB11035
0.251
0.156
0.2117
0.000702326
BBS11016
0.219
0.165
0.18735
0.000260555
BBS11030
0.149
0.0828
0.120615
0.00038835
BBS11034
0.175
0.135
0.1581
0.000123568
BB12003
0.572
0.465
0.53735
0.000591503
BB12005
0.645
0.06
0.5827
0.015476537
BB12010
0.585
0.497
0.5389
0.000421884
BB12023
0.629
0.574
0.60175
0.000224724
Instance
Best
Worst
Average
Variance
BB11002
0.345
0.0619
0.13246
0.004150163
BB11007
0.33
0.214
0.28075
0.000848724
BB11020
0.342
0.26
0.3101
0.000601042
BB11030
0.135
0.0275
0.09407
0.00094117
BB11035
0.362
0.204
0.3095
0.001707211
BBS11016
0.342
0.259
0.294
0.000614947
BBS11030
0.256
0.1
0.1877
0.001975379
BBS11034
0.278
0.215
0.2483
0.000309274
BB12003
0.7171
0.659
0.687355
0.000289078
BB12005
0.774
0.705
0.73815
0.000352239
BB12010
0.721
0.622
0.67115
0.000472661
BB12023
0.728
0.663
0.7013
0.000317589
Instance
Best
Worst
Average
Variance
BB11002
0.294
0
0.015473684
0.004549263
BB11007
0.239
0.0141
0.12573
0.004084839
BB11020
0.347
0
0.23603
0.010619503
BB11030
0
0
0
0
BB11035
0.378
0.135
0.2511
0.003844516
BBS11016
0.218
0
0.12221
0.003890402
BBS11030
0.0435
0
0.00435
0.000179266
BBS11034
0.19
0.02
0.1165
0.002213421
BB12003
0.811
0.568
0.69355
0.004318366
BB12005
0.723
0.467
0.57545
0.004254787
BB12010
0.689
0.471
0.5431
0.003552726
BB12023
0.675
0.579
0.6193
0.000868747
123
R. A. Mora-Gutiérrez et al. Table 11 Runtime utilized by the second version in seconds
Instance
Best
Worst
Average
Variance
BB11002
321.24
370.96
346.0775
155.6638303
BB11007
769.58
820.27
796.3
199.8484526
BB11020
686.28
876.13
760.7175
3,068.49122
BB11030
1,993.2
2,186.8
2,073.715
2,529.141342
BB11035
130.43
146.93
139.218
28.25126947
BBS11016
925.96
1,045.5
964.105
1,537.081405
BBS11030
759.49
1,704.5
862.6725
80,630.9285
BBS11034
946.64
1,165.4
1,006.5235
3,161.136803
BB12003
154.26
162.59
158.262
5.953690526
BB12005
418.24
440.19
425.6415
34.42149763
BB12010
453.55
474.09
462.7695
30.39158395
BB12023
1,427.2
1,878.4
1,600.485
15,196.08555
for instances BBS12008, BBS12037, BB20015 and BB30020, where the worst results are obtained. However, it is important to stress that the average performance of the first version is better than those corresponding to the MULTIALIN, POAV and SAM. On the other hand, the second version generates the best solution for BBS11016, BBS11034 and BB12005, as well as the second best result on BB11002, BB11007, BB1120, BB11035, BB12003, BB12005, BB12010, BB12023, in comparison with the rest of the considered methods. Moreover, the average behavior of the best than the average behavior of the other methods. In addition, the results of Wilcoxon rank-sum test for both versions are shown in Tables 14 and 15, respectively. It can been seen in Table 14, that the first version of the modified MMC has a behaviour similar to CLUSTAL W, DIALIGN-T, DIALIGN-TX, ALIGN, PRANK, SAM and AMAP. Based on results of Table 15, it can be deduced that the second version of the MMC has a behavior different from CLUSTAL W, MULTAIN, POAV, PRANK, SAM and first version of MMC. 5 Conclusion In this paper, a new cultural method for the solution of the multiple sequence alignment problem was proposed. The method is an adaptation of a recently proposed metaheuristic referred as the Method of Musical Composition, which uses an artificial society where interactions between agents cause changes of the features of their individual artwork. In this work, two versions of the MMC are presented. The main difference between both versions is the objective function used to guide the search, however the numerical results showed that both versions are suitable to solve the MSA problem, although their behaviour is significantly different. The complexity of both versions of our algorithm is the order of O(N 2 ∗ L 2 ), which is similar to the complexity of other methods used in the literature.
123
Adaptation of the method of musical composition Table 12 Normalized results generated by the comparison of several heuristics on test instances
1
BB11002
BB11007
BB11020
1, 3, 4 to 23
11, 13, 22
11, 13, 22
2, 24
9, 24
10, 15, 24
0.9 0.8 0.7 0.6
20, 23
0.5
7, 19, 20
9, 14, 16
0.4
2, 23
4, 5, 7, 8, 12, 18
0.3
25
0.2 0.1
3, 10, 14, 15, 16, 18
2, 17, 19, 21
1, 25
3, 6
4, 5, 8.12, 17, 21
25
3
6
1
BB11030
BB11035
BBS11016
1
11, 13, 22
11, 13, 22
22
0.9
6, 20, 23, 24
20
9, 10
0.8
2
5, 8, 10
24
0.7
3, 9, 10, 16, 19
1, 2
11, 20
0.6
7
0.5
1, 15
15, 16
13, 16, 19
0.4
17, 18
4, 6, 7, 14, 23
3, 6, 18
0.3
4, 12, 14, 25
9, 12, 17, 18, 19
1, 5, 8, 12, 14, 15, 17
0.2
21
0
7, 23
21
4, 21
3, 25
2
5, 8
24
25
BBS11030
BBS11034
BB12003
1
22
22
13
0.9
24
24
11
0.8
7, 16, 20
10
0.7
11, 19
7, 9, 11, 15, 19, 20, 23
0.6
2, 23
3, 16
7, 20
0.5
6, 9, 13, 17, 18
1, 13
2, 24
0.4
3, 25
4, 6, 14, 17, 18, 21
0.3
10
2, 5, 8, 12
0.2
4, 5, 8, 12, 14, 21
3, 6, 9, 10, 16, 18, 22
0.1 0
0.1
15
0
1
1, 4, 5, 8, 14, 15, 19, 25 25
12, 17, 21, 23
123
R. A. Mora-Gutiérrez et al. Table 12 continued BB12005
BB12010
BB12023
13
13
13
0.9
11
11, 24
0.8
22, 24
1
0.7
11
0.6 0.5
2, 20, 22
0.4
9, 10, 16, 18
2
0.3
3, 4, 6, 7, 15, 19, 23
7, 19, 20, 23
0.2
5, 8, 12, 14, 17, 21, 24
4, 5, 6, 7, 8, 9, 10, 12, 15, 16, 19, 20, 23
1, 25
1, 21
1
BBS12002
BBS12008
BBS12013
1
13
11, 13, 24
13
0.9
22
0.8
11
0.1 0
2, 9, 10, 22 3, 4, 5, 6, 8, 12, 14, 15, 16, 17, 18, 21, 25
3, 14, 17, 18, 25
11
0.7 0.6
2
0.5
1
0.4
23
0.3
4, 5, 8, 9, 12, 17, 19, 20, 21
0.2
3, 6, 7, 10, 14, 15, 16, 18, 24
2, 9, 10 3, 7, 16, 20, 23
20, 24
4, 5, 6, 8, 14, 15, 17, 18, 19, 22
3, 5, 6, 7, 8, 9, 10, 14, 15, 16, 19
1
12, 21
1, 4,12, 17, 18, 21, 23
BBS12020
BBS12031
BBS12037
1
19, 22
11, 13
13, 24
0.9
24
0.1 0
11
0.8 0.7
24
0.6
13
0.5
11
0.4
2
0.3
5–10, 15, 16, 18, 20
2
9, 20
0.2
3, 4, 12, 14, 17, 21, 23
10, 20
3, 7, 23
3–9, 12, 14–19, 21–23
2, 5, 8, 15, 17, 18, 19
1
1, 4, 6, 12, 14, 21
0.1 0
1
22 10, 16
Where: 1 is ALIGNER, 2 is ALIGNER*, 3 is MUSCLE, 4 is TCOFFEE, 5 is MAFFT, 6 is PRRP, 7 is CLUSTALW, 8 is DIALIGN, 9 is DIALIGN-T, 10 is DIALIGN-TX, 11 is MULTALIN, 12 is PROBCONS, 13 is POAV, 14 is PSALIGN1, 15 is PSALIGN2, 16 is KALIGN, 17 is PROBALIGN, 18 is PCMA, 19 is ALIGN-M, 20 is PRANK, 21 is MUMMALS, 22 is SAM, 23 is AMAP, 24 is first-MMC, 25 is second-MMC
123
Adaptation of the method of musical composition Table 13 Normalized results generated by the comparison of several heuristics on test instances (continuation)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
BB20002
BB20015
BB20023
3, 6, 7, 9, 10, 11, 13, 14, 16, 20, 21, 22, 23 1
3, 5, 7, 8, 12, 17, 19, 20, 22, 23, 24 4, 6, 9, 10, 11, 13, 14, 15, 16, 18, 21
11, 13
4, 5, 8 12, 15, 17, 18 19
9, 22 10, 15, 19, 24 14, 16, 17, 18, 20, 23 3, 5, 6, 8, 12 4, 7 21
24
2
1 2
2 1
BB20030
BBS20027
BB30020
1
11, 13
13
3, 6–9, 11, 13, 14, 19, 20–14
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
7, 9, 18 4, 6, 10, 14, 15, 16, 19, 13 3, 5, 8, 12, 17, 21, 22 20
11, 24 4, 10
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1, 24
9, 19, 20 3, 10, 16, 18 7, 8, 22, 23 4, 5, 6, 14, 15, 21 12, 17 1, 2
BB40014
BBS50016
11, 13, 22
11, 13, 22 24
2
16, 17 12, 18 2, 5
1 15
24 6, 8, 16, 18, 20 3, 4, 5, 7, 9, 10, 12, 14, 15, 17, 19, 21, 23 2 1
7 8, 9, 10, 16 20, 23 6 2, 3, 5, 17, 18, 19 14, 15, 21 1, 4, 12
Where: 1 is ALIGNER, 2 is ALIGNER*, 3 is MUSCLE, 4 is TCOFFEE, 5 is MAFFT, 6 is PRRP, 7 is CLUSTALW, 8 is DIALIGN, 9 is DIALIGN-T, 10 is DIALIGN-TX, 11 is MULTALIN, 12 is PROBCONS, 13 is POAV, 14 is PSALIGN1, 15 is PSALIGN2, 16 is KALIGN, 17 is PROBALIGN, 18 is PCMA, 19 is ALIGN-M, 20 is PRANK, 21 is MUMMALS, 22 is SAM, 23 is AMAP, 24 is first-MMC, 25 is second-MMC
123
R. A. Mora-Gutiérrez et al. Table 14 Wilcoxon rank-sum test for the first version Mehtod
ALIGNER
ALIGNER*
MUSCLE
TCOFFEE
MAFFT
h
TRUE
TRUE
TRUE
TRUE
TRUE
Mehtod
PRRP
CLUSTALW
DIALIGN
DIALIGN-T
DIALIGN-TX
h
TRUE
FALSE
TRUE
FALSE
FALSE
Mehtod
MULTALIN
PROBCONS
POAV
PSALIGN1
PSALIGN2
h
TRUE
TRUE
TRUE
TRUE
TRUE
Mehtod
KALIGN
PROBALIGN
PCMA
ALIGN-M
PRANK
h
TRUE
TRUE
TRUE
FALSE
FALSE
Mehtod
MUMMALS
SAM
AMAP
h
TRUE
FALSE
FALSE
Table 15 Wilcoxon rank-sum test for the second version Mehtod
ALIGNER
ALIGNER*
MUSCLE
TCOFFEE
MAFFT
h
FALSE
FALSE
FALSE
FALSE
FALSE
Mehtod
PRRP
CLUSTALW
DIALIGN
DIALIGN-T
DIALIGN-TX
h
FALSE
TRUE
FALSE
FALSE
FALSE
Mehtod
MULTALIN
PROBCONS
POAV
PSALIGN1
PSALIGN2
h
TRUE
FALSE
TRUE
FALSE
FALSE
Mehtod
KALIGN
PROBALIGN
PCMA
ALIGN-M
PRANK
h
FALSE
FALSE
FALSE
FALSE
TRUE
Mehtod
MUMMALS
SAM
AMAP
first-MMC
h
FALSE
TRUE
FALSE
TRUE
The performance of the first version of the modified MMC was tested on twenty six instances taken from the BAliBASE 3 collection. The experimental results indicate that its behavior is better or similar than those obtained from other methods based on the Hidden Markov Model. In particular, according to the Wilcoxon ranksum test, this version has a behavior similar to that of CLUSTAL W, DIALIGNT, DIALIGN-TX, ALIGN, PRANK, SAM, and AMAP. On the other hand a second MMC adaptation, was evaluated on twelve instances taken from the BAliBASE 3 collection. The experimental results show than the second version has a good behavior, since for 25 % of the tested instances, the algorithm found the best alignment with respect to data of other state-of-the-art algorithms. Besides, for 50 % of the tested instances the second version of the modified MMC achieved the second best alignment. Finally, the significance of the numerical results was analyzed according to the Wilcoxon rank-sum test, which indicated that the second proposed version is statistically similar to ALIGNER, ALIGNER∗ , MUSCLE, TCOFFEE, MAFFT, PRRP, DIALIGN, DIALIGN-T, DIALIGN-TX, PROBCONS, PSALIGN1, PSALIGN2, KALIGN, PROBALIGN, PCMA, ALIGN-M, MUMMALS, and AMAP. The differences in the performance of both versions are attributed to the changes in the definition of the objective function used by each version. This aspect will be considered for the development of future works.
123
Adaptation of the method of musical composition
References 1. Altschul SF, Erickson BW (1986) Optimal sequence alignment using affine gap costs. Bull Math Biol 48(5–6):603–616 2. Birattari M (2009) Tuning metaheuristics: a machine learning perspective. Studies in computational intelligence, vol 197. Springer, Berlin 3. Bahr A, Thompson JD, Thierry JC, Poch O (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res 29(1):323–326 4. Baewicz J, Formanowicz P, Wojciechowski P (2009) Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark. Int J Appl Math Comput Sci 19(4):675– 678 5. Blazewicz J, Frohmberg W, Kierzynka M, Wojciechowski P (2013) G-MSAA GPU-based, fast and accurate algorithm for multiple sequence alignment. J Parallel Distrib Comput 73(1):32–41 6. Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16(22):10881–10890 7. Daugelait J, O’ Driscoll A, Sleator R (2013) An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomath 2013:Article ID 615630 8. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) PROBCONS: probabilistic consistencybased multiple sequence alignment. Genome Res 15:330–340 9. Duret L, Abdeddaim S (2000) Multiple alignments for structural, functional or phylogenetic analyses of homologous sequences, Bioinformatics Sequence structure and databanks. Oxford University Press, Oxford 10. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797 11. Edgar RC, Serafim B (2006) Multiple sequence alignment. Curr Opin Struct Biol 16(3):368–373 12. Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244 13. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30(14):305966. doi:10.1093/nar/gkf436 14. Kelil A (2011) Contribution a` l’analyse des séquences de protéines similarité, clustering et alignement. PhD thesis. Université de Sherbrooke Faculté des sciences 15. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235:1501–1531 16. Lassmann T, Sonnhammer ELL (2005) Kalignan accurate and fast multiple sequence alignment algorithm. BMC Bioinf 6:298 17. Lee ZL, Su SF, Chuang CC, Liu KH (2008) Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment. Appl Soft Comput 8(1):55–78. ISSN 1568–4946 18. Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18(3):452–464 19. Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA 102(30):10557–10562 20. Manthey B (2005) Non-approximability of weighted multiple sequence alignment for arbitrary metrics. Inf Process Lett 95(3):389–395 21. Mayers A, Monga E, Wang S (2014) ALIGNER detecting and aligning related protein sequences. ProspectUS. 16 February 2010. Check in 25 May 2014. Website http://prospectus.usherbrooke.ca/ aligner/Results/BALIBASE3.htm 22. Mora-Gutiérrez RA, Ramírez-Rodríguez J, ElizondoO-Cortes M (2011) Heurística para solucionar el problema de alineamiento múltiple de secuencias. Rev Mat [online] 18(1):121–136 23. Mora-Gutiérrez RA, Ramírez-Rodríguez J, Rincón-García EA (2012) An optimization algorithm inspired by musical composition. Artif Intell Rev 41(3):301–315 24. Mora-Gutiérrez RA, Ramírez-Rodríguez J, Rincón-García, Ponsich A, Herrera O (2012) An optimization algorithm inspired by social creativity systems. Computing 94(11):887–914 25. Morgenstern B, Frech K, Dress A, Werner T (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14:290–294 26. Notredame C, Higgins D, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217
123
R. A. Mora-Gutiérrez et al. 27. Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3(1):131–144 28. Nuin PAS, Wang ZZ, Elisabeth RM (2006) The accuracy of several multiple sequence alignment programs for proteins. Bioinformatics 7:471–489 29. Prakash Lingam KM, Chandrakala S (2011) A survey on recent developments in multiple sequence alignment methods. J Nat Sci Biol Med 2:96–97 30. Pei J, Sadreyev R, Grishin NV (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19(3):427–428 31. Pei J, Grishin NV (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34(16):4364–4374 32. Roshan U, Livesay DR (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22(22):2715–2721 33. Subramanian AR, Weyer-Menkhoff J, Kaufmann M, Morgenstern B (2005) Dialign-t: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinf 6:66. doi:10.1186/ 1471-2105-6-66 34. Subramanian AR, Kaufmann M, Morgenstern B (2008) DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol 3:6 35. Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23(2):e24– e29 36. Sze S-H, Lu Y, Yang Q (2006) A polynomial time solvable formulation of multiple sequence alignment. J Comput Biol 13:309–319 [Also appear in Proceedings of the 9th annual international conference on research in computational molecular biology (RECOMB’2005). Lecture notes in bioinformatics, vol 3500, pp 204–216] 37. Thompson J, Higgins D, Gibson T (1994) ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4690 38. Thompson JD, Ripp O (2014) BAliBASE 3 website of the LBGI Bioinformatique et Génomique Intégratives. Web 15 April 2014. http://lbgi.fr/balibase/ 39. Thompson JD, Koehl P, Ripp R, Poch O (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins Struct Funct Bioinf 61(1):127–136 40. Wisconsin Package v. 8, Genetics Computer Group, Madison, WI. http://www.gcg.com. Accessed 7 Aug 2014 41. Wojciechowski P, Formanowicz P, Blazewicz J (2014) Reference alignment based methods for quality evaluation of multiple sequence alignment—a survey. Curr Bioinf 9(1):44–56 42. Van Walle I, Lasters I, Wyns L (2004) Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20(9):1428–1435
123