Adaptation of the method of musical composition for solving the multiple sequence alignment problem

The multiple sequence alignment (MSA) problem has become relevant to several areas in bioinformatics from finding sequences family, detecting structur...

0 downloads 13 Views 336KB Size

Download PDF

Computing DOI 10.1007/s00607-014-0436-3

Adaptation of the method of musical composition for solving the multiple sequence alignment problem Roman Anselmo Mora-Gutiérrez · María E. Lárraga-Ramírez · Eric A. Rincón-García · Antonin Ponsich · Javier Ramírez-Rodríguez

Received: 10 May 2014 / Accepted: 2 December 2014 © Springer-Verlag Wien 2014

Abstract The multiple sequence alignment (MSA) problem has become relevant to several areas in bioinformatics from finding sequences family, detecting structural homologies of protein/DNA sequences, determining functions of protein/DNA sequences to predict patients diseases by comparing DNAs of patients in disease discovery, etc. The MSA is a NP-hard problem. In this paper, two new methods based on a cultural algorithm, namely the method of musical composition, for the solution of the MSA problem are introduced. The performance of the first and second versions were evaluated and analyzed on 26 and 12 different benchmark alignments, respectively. Test instances were taken from BAliBASE 3.0. Alignment accuracies are computed using the QSCORE program, which is a quality scoring program that compares two multiple sequence alignments. Numerical results on the tackled instances indicate that the performance levels of the proposed versions of the MMC are promising. In particular, the experimental results show that the second version found the best alignment reported in the specialized literature in 25 % of the tested instances. Besides, for 50 % of

R. A. Mora-Gutiérrez (B) · E. A. Rincón-García · A. Ponsich · J. Ramírez-Rodríguez Departamento de Sistemas, Universidad Autónoma Metropolitana Azcapotzalco, Av. San Pablo 180, Col Reynosa Tamaulipas, C.P. 02200 Mexico, DF, Mexico e-mail: [email protected] E. A. Rincón-García e-mail: [email protected] A. Ponsich e-mail: [email protected] J. Ramírez-Rodríguez e-mail: [email protected] M. E. Lárraga-Ramírez Instituto de Ingeniería, Universidad Nacional Autónoma de México, C.P. 04360 Mexico, DF, Mexico e-mail: [email protected]

123

R. A. Mora-Gutiérrez et al.

the tested instances, the second version achieved the second best alignment. Finally, the significance of the numerical results were analyzed according to the Wilcoxon rank-sum test, which indicated that the second proposed version is statistically similar to some state-of-the-art techniques for the MSA problem. Keywords

Social algorithms · Cultural algorithms · Socio-cultural creativity

Mathematics Subject Classification

90C59

1 Introduction Multiple sequence alignment (MSA) is the most natural way to see the similarity of several sequences by making an alignment between the primary sequences so that identical or similar residues will be aligned in columns. The idea behind MSA is to compare three or more sequences and come up with some alignment which would get the highest score under the given conditions (i.e. the scoring scheme used). Similarity between sequences, can be inferred from their scores and alignments. This information can in turn be used to predict properties of some given biological sequence. Thus, a multiple sequence alignment arranges protein sequences into a rectangular array with the goal that residuals in a given column are homologous (derived from a single position in an ancestral sequence), superposable (in a rigid local structural alignment) or play a common functional role [11,22]. Although these three criteria are essentially equivalent for closely related proteins, sequence, structure and function diverge over evolutionary time and different criteria may result in different alignments. The formal definition of MSA is presented as below. Let Σ be an alphabet and Σ = Σ ∪ {−}, where − ∈ Σ denotes the gap. Let S be a sequence of length l over Σ, then S = (s1 , s2 , . . . , sl ) with sk ∈ Σ. Let S = {S1 , S2 , . . . S N } be a multiset of sequences. An alignment of S is a multiset A = { S¯1 , S¯2 , . . . S¯ N } of sequences on Σ, such that all S¯i are of equal length l A and S¯i is obtained from Si only by insertion of gaps [20]. Let d : Σ × Σ → R be a scoring function. Without loss of generality, the MSA optimization problem consists in finding an alignment A such that d( S¯i , S¯ j ). Generally, d( S¯i , S¯ j ) determines the degree of difference or min (i, j)∈A 1≤i< j≤N

similarity between i-th and j-th sequences in A, however there are a great variety of score functions reported in the literature (see [20]). Manually refined alignments continue to be superior to purely automated methods. Computing exact MSAs is computationally impossible because of the NP-hard nature of the problem, in practice heuristics are used to align sequences, by maximizing their similarity. Consequently, the number of available MSA methods has steadily increased over the last 20 years from multi-dimensional dynamic programming, iterative methods, via progressive, tree-based programs to the more recent methods combining several complementary algorithms and/or 3D structural information [37]. Some of these methods are reviewed in [7,9,27,29]. Assembling a suitable MSA is not, however, a trivial task, and none of the existing methods already managed to deliver biologically perfect MSAs. Thus, the development of new methods aiming at improving the speed and memory usage to accommodate the rapid increase in available sequence data has become a continuous necessity.

123

Adaptation of the method of musical composition

This work presents two adaptations of the MMC, which mimics a sociocultural creativity system involved in the musical composition process, for solving of the MSA. Cultural algorithms are guided by changes in the social environment, where agents interact among themselves and the environment. These interactions induce a learning process that leads individuals to adapt to their current environment faster than biological evolution. In particular, the MMC is based on the working mode of artificial societies that use a dynamic and creative system to compose music, and mimics the creative process of musical composition by agents that can create and exchange artworks. The main difference between the two versions of the MMC proposed in this paper is the objective function. The performance of the first version was evaluated on 26 different benchmark alignments, while, the second version was evaluated on 12 benchmark alignments. The test instances were drawn from BAliBASE 3.0 [3]. Alignment accuracies are computed using the QSCORE program, which is a quality scoring program that compares two multiple sequence alignments: an alignment to be evaluated (the “test” alignment) and a second alignment that is believed to be correct (the “reference” alignment). The QSCORE program outputs four scores: the PREFAB Q score (the Balibase SPS score or the Developer score), the modeler score, the Cline et al,. shift score, and the Balibase TC (total column) score. The obtained numerical results of our methods were compared to those resulting from other recent metaheuristics. Numerical results on the tackled instances indicate that the performance of both versions of the modified MMC are promising, though their behavior is different. In particular, the experimental results show than the second version has a good behavior, since for 25 % of the tested instances, the algorithm found the best alignment reported in specialized literature. Besides, for 50 % of the tested instances the second version of the modified MMC achieved the second best alignment. Finally, the significance of the numerical results was analyzed according to the Wilcoxon rank-sum test, which indicated that the second proposed version is statistically similar to ALIGNER, ALIGNER∗ , MUSCLE, TCOFFEE, MAFFT, PRRP, DIALIGN, DIALIGN-T, DIALIGN-TX, PROBCONS, PSALIGN1, PSALIGN2, KALIGN, PROBALIGN, PCMA, ALIGNM, MUMMALS, and AMAP. The remainder of the paper is organized as follows. In Sect. 2 the two versions of the MMC are defined. A description of the methodology used for the computational experiments is presented in Sect. 3. The analysis and discussion of the associated results are provided in Sect. 4. Finally, some conclusions and possible paths for future research are presented in Sect. 5.

2 Adapting the MMC algorithm to solve the MSA problem 2.1 Description of the MMC The MMC is an algorithm that mimics a creativity system involved in the musical composition process. The design of the model takes into account that musical composition can be viewed as an algorithm that is developed within a creative system, where a composer learns from his own experience and from others composers. Inter-

123

R. A. Mora-Gutiérrez et al.

actions between composers then involves the formation of social networks. The MMC is grounded on an agent-based model of a specific social network, which is made up of a set of Nc vertices or agents (hereafter called composers) and a set E of edges or links (which represents relationships between composers). Each composer has foreknowledge and a set of mechanisms and policies for interaction, whereby can communicate and exchange information with other composers. The foreknowledge is defined as a set of solutions or “tunes”, and each tune is a solution. A brief description of the MMC is given in the Algorithm 1. Algorithm 1: The basic MMC algorithm 1 2

3 4

5

6 7 8

Input: The operating parameters of the MMC and data about the instance to solve. Output: The best solutions generated by the composers society. Generate a social network with Nc agents (composers) and a set of links (or edges), which is generated in a random way. For each of the Nc agents, a set of solutions (artworks) is randomly generated. Each generated set, will be referred as to “scoring matrix”, representing the foreknowledge of an agent, and it will be denoted by P∗,∗,i while the stopping criterion is not fulfilled, the following process is carried out for each agent i, where i = 1, 2, . . . Nc do Exchange information between agents. An agent i will exchange information with any agent j subject to i = j and there is an edge joining them, and the worst solution in P∗,∗, j is better than the worst solution in P∗,∗,i , that is, tune j,wor st > tunei,wor st . Generate a new solution (tune) for the i − th agent. Based on the selected information in the previous step, the i − th agent builds his/her acquired knowledge as a matrix I SC∗,∗,i . Then, the i − th agent creates his/her background knowledge, K M∗,∗,i , which is formed from a concatenation of P∗,∗,i and I SC∗,∗,i , from which a new solution (tune), xi,new will be generated. Update the solution of the i − th agent. Then, the i − th agent (composer) will decide if tunei,wor st should be replaced by xi,new based on a quality criterion. end The final best solution of each composer is included in the set of solutions.

For more detailed information about the MMC, we refer the reader to [22–24]. 2.2 Adapting the MMC to solve the MSA problem In order to adapt the MMC to solve the MSA problem, the following aspects were considered: (a) a feasible alignment is easily generated through a binary representation of the MSA; (b) in order to solve the MSA problem, dynamic programming and a divide and conquer strategy are combined; (c) the overall approach relies on the definition of an objective function (OF) describing the quality of the MSA. In the first version of our method, the objective function is defined as Euclidean distance, which relations the coffee function, the SP-score function and the gap penalty function. On the other hand, the OF defined in the second version is based on a multiple linear regression, which combines the coffee function, the SP-score, the gap penalty function, the number of aligned sequences, the average of the lengths of the N aligned sequences, their length variance and Euclidean distance. The general structure of both versions is presented in Algorithm 2. N lk is the sum of the length The lmax is the length of the longer sequence, while k=1 of all the sequences, L is the length of the j-th alignment generated by the i-th com-

123

Adaptation of the method of musical composition

Algorithm 2: MMC adaptation to solve MSA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Input: MMC parameters and data about the instance to solve and substitution matrix (in this paper the Blosum 62 matrix is used). Output: The best alignment found by the society of composers. Generate a library of the alignments, in which an alignment is included for each pair of sequences (see Algorithm 3 ) Determine the score for each pair of sequences (scor e(A)) based on substitution matrix (see Algorithm 3 ) Change all negative elements in scor e(A) to positive (this step was only used in the second version, see Algorithm 4) Create an artificial society with specific interaction rules among agents. for every agent-composer in the society do for j = 1 : Ns do N L i, j ← lmax + rand ∗ ( k=1 lk − lmax ) Randomly generate a feasible alignment, denoted as tunei, j, (see Algorithm 5) Evaluate the objective function of tune tunei, j, (see Algorithm 8 for the first version and Algorithm 9 for the second version ) end Register Ns created tunes in the P,,i of the composer i end repeat Update links in the artificial society. Exchange information among agents. for each agent-composer in the society do Generate the knowledge matrix (K Mi, j ) of the i − th composer Evaluate the fitness of all tunes in the K Mi, j (see Algorithm 10) Generate and evaluate a new tune tunei,new for the i − th composer (see Algorithm 12) Determine the tune with the worst value of objective function in P,,i (tunei,wor st ) Update P,,i end Build the new set of solutions with the best solution found for each composer . until The termination criterion is satisfied; Select the best solution found in the society.

poser; the objective function and coffee function are calculated through Algorithms 8 and 9 for the first and the second version of the MMC, respectively. The steps of Algorithm 2 can be divided into nine phases (a) initialization of the algorithm (from input to step 3), (b) generation of an artificial society (step 4), (c) generation and evaluation of a set of Ns initial solutions for each composer in the society (from step 5 to step 12), (d) information exchange between composers (from step 14 to 15), (e) generation and evaluation of a new tune for each composer in the society (steps 17–19) (f) updating the foreknowledge matrix of each composer, P,,i , (steps 20–21), (g) construction of the set of solutions found by composers (step 23), (h) stopping criterion test (steps 13–24); (i) selection of the best solution based on the objective function (step 25). These phases are presented in more detail in the following subsections. 2.2.1 Initialization of the algorithm This phase includes the introduction and determination of the information, used in the subsequent phases of the algorithm. Initially, the problem characteristics and algorithm

123

R. A. Mora-Gutiérrez et al.

operating parameters are set as an input of the MMC algorithm. Then, the library of the alignments and the corresponding scor e(A) are determined by Algorithm 3. Algorithm 3: Generating an alignment library

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Input: N sequences to align and Scoring Matrix (Blosum 62) Output: Library of the alignments (Librar y), cost of pairwise sequences aligned scor e(A), index matrices, (index) and (index − f ), indicating the starting and the ending points for each sequence of the alignment, respectively Librar y ← ∅ index ← ∅ for a = 1 : N − 1 do for b = a + 1 : N do pair wisealigned ← ∅ s1 ← a − th sequence s2 ← b − th sequence Generate a local alignment between s1 and s2 Let M be the most similar region(s) between s1 and s2 sequences, determined in the previous step indexa,b ← starting of the local alignment in the a − th sequence indexb,a ← starting of the local alignment in the b − th sequence index − f a,b ← ending of the local alignment in the a − th sequence indexb,a ← ending point of the local alignment in the b − th sequence costlocal the cost associated to the local alignment sl−1 is the left subsequence in s1 not included in M sl−2 is the left subsequence in s2 not included in M sr −1 is the right subsequence in s1 not included in M slr −2 is the right subsequence in s2 not included in M Generate a global alignment between sl−1 and sl−2 Ml is the global alignment between sl−1 and sl−2 sequences, which was determined in the previous step. costglobal−l is the cost associated to the global alignment between sl−1 and sl−1 . Generate a global alignment between sr −1 and sr −2 Mr is the global alignment between sr −1 and sr −2 sequences, which was determined in the previous step. costglobal−r is the cost associated to the global alignment between sr −1 and sr −1 pair wisealigned ← Ml ∪ M ∪ Mr scor e(A)i, j ← costlocal + costglobal−l + costglobal−r end Librar y ← Librar y ∪ pair wisealigned end

For the second version of the MMC, all elements of scor e(A) are converted to positive numbers, through Algorithm 4. 2.2.2 Initial solutions generation In this phase, the set of initial solutions (P) is build in a random way through Algorithm 5. The j-th solution of the i-th composer is denoted by tunei, j, . In order to facilitate the management of information, tunei, j, is associated with two representasymbolic binar y symbolic , and the binary tune, tunei, j, . tunei, j, tions: the symbolic tune, tunei, j, is a feasible alignment of the set of sequences s1 = [c(1,1) , c(1,2) , . . . , c(1,|s1 |) ], s2 = [c(2,1) , c(2,2) , . . . , c(2,|s2 |) ], . . ., s N = [c(N ,1) , c(N ,2) , . . . , c(N ,|s N |) ]. The gensymbolic eral structure of tunei, j, is shown in Eq. (1).

123

Adaptation of the method of musical composition

Algorithm 4: Transformation scor e(A) 1 2 3 4 5 6 7

Input: scor e(A) Output: scor e(A) updates a is the minimum element in scor e(A) a ← abs(a) for a = 1 : N − 1 do for b = a + 1 : N do scor e(A)i, j ← scor e(A)i, j + a + 1 end end

⎡

symbolic

tunei, j,

s1,1 s1,2 ⎢ s ⎢ 2,1 s2,2 =⎢ .. ⎢ .. ⎣ . . s N ,1 s N ,2

. . . s1,L i, j . . . s2,L i, j .. ... . . . . s N ,L i, j

⎤ ⎥ ⎥ ⎥ ⎥ ⎦

(1)

where sa is generated by insertion of gaps in sa , L i, j = |sa | for all a = 1, . . . , N binar y On the other hand, tunei, j, is a binary matrix with N rows and L i, j columns. , of tune element of the tune If the sa,l is a character, then the ψa,l i, j, i, j, is 1, else ψa,l is 0. Since, each tunei, j, is generated from a feasible alignment, then it satisfies both the set of Eq. (2) and the set of inequalities (3). The set of Eq. (2) involves the a-th row of tunei, j, , which contains all elements of the l-th sequence; while the set of inequalities (3) implies that the l-th column in the alignment must contain at least one element with value 1. symbolic

binar y

L i, j

ψa,l = |sa | ∀a = 1, . . . , N

(2)

ψa,l ≥ 1 ∀l = 1, . . . , L i, j

(3)

l=1 N a=1

Initially, each tunei, j, is randomly generated based on Algorithm 5. In row 40 of Algorithm 7, the sequence sb is realigned based on a strategy that combines dynamic programming and divide and conquer methods. The realigning strategy is shown in Algorithm 6. tunei, j, is made possible with Algorithm 7, where Eq. (2) and inequalities (3) are implemented. Besides, the value of L is also computed again in this algorithm, in such symbolic a way that it is equal to the length of the feasible tune tunei, j, . Then, tunei, j, is evaluated through Algorithms 8 and 9, for the first and second versions of the MMC method, respectively. A combination of objective functions is used in each version of the adapted MMC. The objective functions adopted in this work have been used in several previous papers (see [1,17,26]).

123

R. A. Mora-Gutiérrez et al.

Algorithm 5: Generate the set of initial solutions Input: Set of N alignment sequences , Nc , Ns , index,index − f , L, i and j symbolic symbolic binar y Output: P,,i (which is a set integrate to tunei, j, ) and P,,i (which is a set integrate to binar y

tunei, j, ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

P symbolic ← ∅ P binar y ← ∅ binar y

tunei, j,

for a = 1 : N do a3 ← randomly selected value within the index in the a − th column a4 ← |sa3 | − a3 a5 ← L − a2 a6 ← a2 a7 ← 1 for l = 1 : L do if l < a2 then a p1 ← a 3 6 if rand < p1 then symbolic tunei, j,a,l is the a7 − th character in the a − th sequence binar y

tunei, j,a,l ← 1 a7 ← a8 + 1 a3 ← a3 − 1 else symbolic tunei, j,a,l ← −

18 19 20 21 22

binar y

tunei, j,a,l ← 0 end a6 ← a6 − 1 else a p2 ← a 4 5 if rand < p2 then symbolic tunei, j,a,l is the a7 − th character in the a − th sequence

23 24 25 26 27 28 29

binar y

tunei, j,a,l ← 1 a7 ← a8 + 1 a4 ← a4 − 1 else symbolic tunei, j,a,l ← −

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

←∅

symbolic tunei, j, ←∅ indexa ← maximum value in index indexb ← maximum value in index − f a1 ← L − (indexa + indexb ) a2 ← indexa + r ound(rand ∗ (a1 ))

binar y

tunei, j,a,l ← 0 end a5 ← a5 − 1 end end Use dynamic programming for realigning the sb sequence in tunei, j, (see Algorithm 6) symbolic

Make possible tunei, j,

binar y

and tunei, j,

symbolic Include tunei, j, in Pi, j symbolic binar y binar y Include tunei, j, in Pi, j

end

123

through Algorithm 7

Adaptation of the method of musical composition

Algorithm 6: Strategy for realigning the sb − th sequence symbolic

binar y , tunei, j, and the set of N aligning sequences symbolic binar y Output: tunei, j, realigned and tunei, j, realigned symbolic 1 Randomly select a sequence of tune i, j, 2 sa ← the sequence selected in the previous step

Input: tunei, j,

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Randomly select a sequence of the set of N aligning sequences, different from the sa − th sequence sb ← the sequence selected in the previous step Generate a local alignment between sa and sb M ← the most similar region(s) between sa and sb sequences, determined in the previous step Mb ← the local alignment of the sb − th sequence sl−a ← the left subsequence in sa not include in M sl−b ← the left subsequence in sb not include in M sr−a ← the right subsequence in sa not include in M sr −b ← the right subsequence in sb not include in M and s Generate a global alignment between sl−a l−b Ml ← the global alignment between sl−a and sl−b sequences, determined in the previous step Mlb ← the global alignment of the sl−b − th sequence Generate a global alignment between sr−a and sr −b sequences, determined in the previous step Mr global alignment between sr−a and sr −b sequences, determined in the previous step. Mrb ← the global alignment of the sr −b − th sequence sb ← Mlb ∪ Mb ∪ Mrb for l = 1 : L do is a character then if sb,l −binar y

sb,l ←1 else −binar y sb,l ←0 end end

by sb binar y −binar y 27 Replace b − th row of the tune i, j, by sb 26

symbolic

Replace b − th row of the tunei, j,

The objective function applied in the first version is based on the Euclidean distance that combines three functions: the coffee function, the SP-score based on exact similitude, and a modification of the gap penalty function. On the other hand, the objective function applied in the second version is based on a multiple linear regression model. In order to generate the linear regression model, the coffee function, the SP-score based in exact similitude, the gap penalty function, the Euclidean distance, the number of aligned sequences, the average of the lengths of the N aligned sequences, and the variance of the lengths of the N aligned sequences, are used as independent variables. In this model βg (∀g = 0, . . . , 7) are the coefficients computed. 2.2.3 Communications among composers In this phase, composers exchange information using a specific interaction policy, defined in this work as: “composer i learns from composer k, if there is a link between them and if the artwork of composer k has more desirable characteristics than the

123

R. A. Mora-Gutiérrez et al.

symbolic

Algorithm 7: Make possible tunei, j, symbolic

Input: tunei, j,

binar y

,tunei, j, , N and L i, j

symbolic

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

binar y

and tunei, j,

binar y

Output: tunei, j, feasible and tunei, j, feasible for a = 1 : N do L a1 ← l=1 ψa,l if a1 = lsa then binar y Include or delete ones tunei, j, , while a1 = lsa end end binar y

Delete columns of tunei, j, , which do not satisfy inequality (3) L i, j is the length of the feasible tunei, j, symbolic

tunei, j, ←∅ for a = 1 : N do a2 ← 1 for l = 1 : L do binar y if tunei, j,a,l = 1 then

symbolic

In tune tunei, j,a,l put the a2 − th character of the a − th sequence a2 ← a2 + 1 else symbolic tunei, j,a,l ← − end end end

artwork of composer i”. These desirable characteristics are simply evaluated in terms of the objective function. This operating mode involves the three following steps: (a) the i-th composer decides if changing his links with other composer in the society (links between composers were updated trough the method presented in [24]), (b) composer i determines which of his tunes is the least desirable, this tune is denoted as tunei,wor st , (c) the i-th composer compares the objective function of his tunei,wor st versus the objective function of the tunek,wor st of the k-th composer, if there is a link between both composers, (d) if tunek,wor st is better than tunei,wor st then the i-th composer randomly selects a tune of the k-th composer and incorporates the selected tune to his acquired knowledge (I SC,,i ).

2.2.4 Generating and evaluating a new tune In this phase, each composer creates a new tune based on his knowledge (K Mi ). This phase is integrated by two sub-phases, which determine useful information and generate a new solution. The first sub-phase starts with the building of the K Mi , in such a way that the i-th composer combines his/her P,,i and his/her I SC,,i . Then, the i-th composer determines the fitness of the tunes in his K Mi (denoted by f itness(K Mi )) through

123

Adaptation of the method of musical composition

Algorithm 8: Evaluating solutions for the first version of the MMC symbolic

1 2 3 4 5 6 7

Take the sa − th sequence of tunei, j,

symbolic

8

symbolic Take the sb − th sequence of tunei, j, Librar ya is the a − th sequence aligned with the b − th sequence, which is included in

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

binar y

Input: tunei, j, feasible, tunei, j, feasible, L i, j , scor e(A), N , L i, j and Librar y Output: value of objective function i, j , value of coffee function i, j Objective function i, j ← ∅ Coffee function i, j ← ∅ Exact similarity function i, j ← ∅ density function i, j ← ∅ for a = 1 : N − 1 do for b = a + 1 : N do Υ ←∅

Librar y Librar yb is the b − th sequence aligned with the a − th sequence, which is included in Librar y for l = 1 : L i, j do = − ) or (s = − ) then if (sa,l b,l Υ1,l = sa,l Υ2,l = sb,l end end is the number of columns of Υ L a,b ιa,b ← scor e(A)a,b ∗ L a,b Wa,b ← 0 ωa,b ← 0 for l = 1 : L a,b do if (Υ1,l = Librar ya,l ) or (Υ2,l = Librar yb,l ) then Wa,b = Wa,b + 1 end if (Υ1,l = Υ2,l ) then ωa,b = ωa,b + 1 end end end end N −1 N

b=a+1 Wa,b ∗scor e(A)a,b b=a+1 L a,b∗scor e(a,b)∗scor e(A)a,b N −1 N b=a+1 ωa,b a=1 31 Exact similarity function i, j ← N −1 N L a,b b=a+1 ⎥ ⎢ a=1 ⎢ N binar y ⎥ ⎢ l=a tunei, j,a,l ⎥ ⎦ ⎣ N L i, j 32 Density function i, j ← l=1 L i. j

30

Coffee function i, j ← N −1 a=1 N a=1

33

ιi, j ←

(co f f ee f unction i, j )2 + (exact similarit y f unction i, j )2 + (densit y f unction i, j )2

34

Objective function i, j ← ιi, j

Algorithm 10. Subsequently, the i-th composer builds the harmonic matrix (which is denoted by M Hi ) with Algorithm 11. The second sub-phase generates and evaluates a new tune for each composer in the society. The corresponding procedure is based on each composer and the set of constraints in Table 12. Its process is described in Algorithm 12.

123

R. A. Mora-Gutiérrez et al.

Algorithm 9: Evaluating solutions for the second version of the MMC symbolic

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

binar y

Input: tunei, j, feasible, tunei, j, feasible, L i, j , scor e(A), N , L i, j and Librar y Output: value of objective function i, j , value of coffee function i, j Objective function i, j ← ∅ Coffee function i, j ← ∅ Exact similarity function i, j ← ∅ Density function i, j ← ∅ for a = 1 : N − 1 do for b = a + 1 : N do Υ ←∅ Take the sa − th sequence of tunei, j,

symbolic

symbolic Take the sb − th sequence of tunei, j, Librar ya is the a − th sequence aligned with the b − th sequence, which is included in

Librar y Librar yb is the b − th sequence aligned with the a − th sequence, which is included in Librar y for l = 1 : L i, j do = − ) or (s = − ) then if (sa,l b,l Υ1,l = sa,l Υ2,l = sb,l end end is the number of columns of Υ L a,b ιa,b ← scor e(A)a,b ∗ L a,b Wa,b ← 0 ωa,b ← 0 for l = 1 : L a,b do if (Υ1,l = Librar ya,l ) or (Υ2,l = Librar yb,l ) then Wa,b = Wa,b + 1 end if (Υ1,l = Υ2,l ) then ωa,b = ωa,b + 1 end end end end N −1 N

b=a+1 Wa,b ∗scor e(A)a,b b=a+1 L a,b∗scor e(a,b)∗scor e(A)a,b N −1 N b=a+1 ωa,b a=1 31 Exact similarity function i, j ← N −1 N L a,b b=a+1 ⎥ ⎢ a=1 ⎢ N binar y ⎥ ⎢ l=a tunei, j,a,l ⎥ ⎦ ⎣ N L i, j 32 Density function i, j ← l=1 L i. j

30

Coffee function i, j ← N −1 a=1 N a=1

33

34 35

ιi, j ← (co f f ee f unction i, j )2 + (exact similarit y f unction i, j )2 + (densit y f unction i, j )2 μl is the average of the lengths of the N aligned sequences σl2 is the variance of the lengths of the N aligned sequences Objective f unction i, j ← β0 + (β1 ∗ μl ) + (β2 ∗ σl2 ) + (β3 ∗ N ) + (β4 ∗ co f f ee f unction i, j ) +(β5 ∗ exact similarit y f unction i, j ) + (β6 ∗ densit y f unction i, j )+ (β7 ∗ ιi, j )

123

Adaptation of the method of musical composition

Algorithm 10: Determining the fitness of K Mi

1 2 3 4 5 6 7 8 9 10 11 12 13

Input: Set of values obtained of the coffee functioni,: , the SP-scorei,: and the gap penalty function i,: for the i − th composer,K Mi Output: f itness(K Mi ) a1 is the number of tunes in K Mi α1,a ← ∅ for a = 1 : a1 do a2 ← (1 − co f f ee f unction i,a )2 a3 ← (1 − S P − scor e f unction i,a )2 a4 ← (1 − gap penalt y f unction i,a )2 √ α1,a ← a2 + a3 + a4 end αmax ← is the maximum value contained in α1,a for a = 1 : a1 do 1 ← (α α1,a max − α1,a ) + a1 end a1 γ1 ← a=1 for a = 1 : a1 do α

f itness(K Mi,a ) ← γ1,a 1

14 15

end

Algorithm 11: Building M Hi 1 2 3 4 5

Input: K Mi ,L i, j ,N Output: M Hi , φ φ ← max(L i, j ) binar y

Add column vectors of zeros in each tunei, j, of K Mi until each tunei, j, has a length equal to φ for a = 1 : N do for l = 1 : φ do M Hi,a,l is the average of values contained in the a − th row and l − th column of all binar y

6 7 8

tunei, j, included in K Mi M Hi,a,l ← r ound(M Hi,a,l ) end end

2.2.5 Updating P,,i In this step, each composer makes a decision on either replacing or not the worst tune in P,,i with tunei,new . The decision is based on the value of the objective function and the coffee function. Algorithm 13 illustrates the procedure used for this purpose.

2.2.6 Building the set of solutions In this phase, the MMC algorithm selects the melody contained in the artwork of every composer that achieves the best objective function value. Then, the solution with the best objective function of each composer in the society is included in S, .

123

R. A. Mora-Gutiérrez et al.

Algorithm 12: Generating a new tune Input: K Mi , M Hi , i f g,c f g Output: tunei,new 1

2 3

K M ← ∅ Take the tunei, j, content in K Mi,: that is associated to the best value of the f itness(K Mi ). the tunebinar y taken in the previous step. Assign to K M1,: i, j, binar y

binar y

Recalculate f itness(K Mi ) without considering the tunei, j,

appropriated in the step 1, this

recomputed fitness is denoted as f itness(K Mi )

4 5 6 7 8 9 10 11 12 13

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

content in K Mi,: based on f itness(K Mi ) binar y Assign to K M2,: tunei, j, taken in the previous step. binar y Randomly choose a tunei, j, content in K Mi,: binar y Assign to K M3,: the tunei, j, taken in the previous step. binar y L K Mi contained the numerical values of lengths associate with tunei, j, in steps 2, 5 and 7 Randomly choose an element of L K Mi , denoted L new binar y

Probabilistically choose a tunei, j,

block1 ← ∅ block2 ← ∅ block3 ← ∅ Determine the number of columns ([L 1 L 2 L 3 ]) that are contained in each blockk ∀ k = 1, . . . , 3. The elements L k ∀ k = 1, 2, 3 must satisfy the following constraint L new = 3k=1 L k , L k ≥ 1, L k ≤ L new − 2 and L k ∈ Z binar y tunei,new ← ∅ a1 = 0 if rand ≤ (1 − i f g) then for k = 1 : 3 do a2 ← r ound(1 + rand ∗ (2)) a3 = a1 + L k a1 = a1 + 1 if rand ≤ (1 − c f g) then blockk ← K Ma 2 ,a1 :a3 else blockk ← M H1,a 1 :a3 end if rand ≤ c f g then Randomly select an element in blockk and convert 0 to 1 and 1 to 0 end binar y

binar y

tunei,new ← blockk ∪ tunei,new end binar y

Make possible tunei,new with Algorithm 7 simboly

binar y

Generate tunei,new based on tunei,new else Randomly generate a tunei,new based on Algorithm 5 end Evaluate tunei,new through Algorithm 8 or 9 for both MMC versions

2.3 Satisfying the termination criterion and selecting the best solution Steps from 13 to 24 of Algorithm 2 are repeated max _arrangement times. Subsequently, the tune in S with the best objective function value is selected.

123

Adaptation of the method of musical composition

Algorithm 13: Updating P,,i 1 2 3 4 5 6 7 8 9 10 11 12 13

Input: objective function of P,,i , coffee function of P,,i , P,,i and tunei,new , Output: P,,i updated for i = 1 : N do for j = 1 : L do if the objective function of tunei,new is better than the objective function of tunei,wor st then Replace tunei,wor st by tunei,new in P,,i . else if the objective function of tunei,new is equal to that of tunei,wor st then if the coffee function of tunei,new is better than the coffee function of tunei,wor st then Replace tunei,wor st by tunei,new in P,,i . end end end end end

Computational experiments performed in order to test both versions of the proposed algorithm for several instances of multiple sequence alignment problem are presented in the following section. 3 Experimental methodology and test problems In this section, the performance of the proposed method is investigated through of numerical simulation. 3.1 Methodology The performance of the proposed methods was evaluated and analyzed for different benchmark alignments, taken from BAliBASE 3 [3,38,39]. In particular, we used 26 and 12 instances, which were randomly selected from BaliBase 3.0, for the first and second version of the proposed algorithm, respectively. These problems are included in the references from 1 to 5. Table 1 shows the instances considered for evaluation of the first version of our algorithm, where the name of the reference, the type of problem and its characteristic are depicted. Due to the stochastic nature of the MMC, 20 independent runs on each problem instance were carried out for both methods. The alignments found by each instance were evaluated with QSCORE program [10], which was downloaded from the website http://www.drive5.com/qscore/. The QSCORE is a scoring program that compares two multiple sequence alignments. The results include the following scores: the (a) Q score, (b) the modeler score, (c) the Cline et al., shift score and (d) The TC score. In the following, these scores are briefly described: (a) The Q score, also called SP score, is defined as the number of correctly aligned residual pairs that are found in the test alignment divided by the total number of

123

R. A. Mora-Gutiérrez et al. Table 1 Instances considered to evaluate two versions of the proposed algorithm References

Instance name

Characteristics of the instance N

[1]

a

b

μj

σ 2j

BB11002

8

193

52

98

2,117.142857

BB11007

9

457

385

418.3333333

698.25

BB11020

9

236

201

219.1111111

191.8611111

BB11030

14

392

236

291.7857143

2,189.873626

BB11035

5

138

71

97.4

778.3

BBS11016

8

345

305

324.125

184.9821429

BBS11030

14

220

169

184.2142857

206.9505495

BBS11034

8

355

311

336.75

247.0714286

BB12003

8

85

58

71.25

138.2142857

BB12005

9

234

197

209

163.5

BB12010

7

415

295

348.2857143

1,606.904762

BB12023

5

738

681

704.2

674.7

BBS12002

6

190

165

182

79.2

BBS12008

13

381

351

367.2307692

58.69230769

BBS12013

8

714

665

685.75

281.0714286

BBS12020

4

126

116

123

22

BBS12031

10

1,006

827

917.5

3,770.722222

BBS12037

13

509

451

480.9230769

190.9102564

BB20002

20

1,331

52

592.3

144,274.1158

BB20015

37

274

54

193.9189189

2,899.465465

BB20023

31

244

202

217.6774194

114.8258065

BB20030

47

155

76

108.8723404

538.2876966

BBS20027

29

271

237

250.7586207

90.6182266

[3]

BB30020

56

631

76

151.3392857

12,546.95552

[4]

BB40014

9

609

298

445.2222222

15,996.69444

[5]

BBS50016

18

523

303

361.6111111

4,192.486928

[2]

Where: N is the number of sequences, a is the longest sequence, b is the shortest sequence, μ j is the average length of sequences and σ 2j is the variance of the lengths of the sequences

aligned residual pairs of the reference alignment, increasing with the number of sequences correctly aligned. The Q score ranges from 0 to 1, values near to 1 indicate a better alignment [3,10,28]. (b) The modeler score indicates the fraction of residual pairs in the test alignment that are properly aligned compared to the reference. The modeler score ranges from 0 to 1. (c) The Cline et al. shift score is based on a measure of disagreement between two alignments, which includes penalties for both over and under alignment, and gives positive but reduced scores for slight misalignment. The Cline et al. shift score range from −∞ to 1.

123

Adaptation of the method of musical composition

(d) The TC score checks for identical columns in each set of aligned sequences. The TC score ranges from 0 to 1. We decided not to use the scoring program Bali-score, which was proposed by Thomson in [3], based on those reported in [4,5,41] The model of multiple linear regression is used as the objective function in the second version of the MMC, see Eq. (4). In this model, some characteristics of each instance (μ j , σ j and N ) are combined to the numerical values of the function ι, the coffee function (co f f ee), the SP-score function (S P), the gap penalty function (gap) determined by the first version of the MMC in each run of all test instances, to produce the numerical values of the corresponding QSCORE. Q scor e j = β0 + β1 ∗ μ j + β2 ∗ σ j2 + β3 ∗ N + β4 ∗ co f f ee + β5 ∗ S P + β6 ∗ gap + β7 ∗ ι

(4)

The value of parameters β obtained through regression are shown in Eq. (5). Q scor e j = 0.061727889 − 0.00015888 ∗ μ j + 1.13349E − 06 ∗ σ j2 − 0.003121622 ∗ N − 4.3176E − 07 ∗ co f f ee + 2.532070999 ∗ S P + 0.054386105 ∗ gap + 8.22621E − 07 ∗ ι j (5) As previously mentioned, Eq. (5) was used as the objective function in the second version of our algorithm, so that objective f unction j = Q scor e j . 3.1.1 Comparison with other methods The numerical results obtained from both versions of the adapted MMC are compared with those achieved from 23 algorithms reported in the literature (see [21]). The methods used by the comparison are ALIGNER [14], ALIGNER [14], MUSCLE [10], TCOFFEE [26], MAFFT [13], CLUSTALW [12,37], DIALIGN [25], DIALIGNT [33], DIALIGN-TX [34], MULTALIN [6], PILEUP8 [40], PROBCONS [8], POAV [18], PSALIGN1 [36], PSALIGN2 [36], KALIGN [16], PROBALIGN [32], PCMA [30], ALIGN-M [42], PRANK [19], MUMMALS [31], SAM [15] and AMAP [35]. With the aim of comparing the results obtained by different metaheuristics on each case, all the obtained solutions were normalized through the following equation: f (x nor mali zed-α ) =

f (x method-α ) − f (x ) f (x wor st in β ) − f (x )

(6)

where: f (x ) is the value of the objective function at the global optimal point, f (x method-α ) is the average value of the objective function found by the metaheuristic α, f (x wor st in β ) is the worst average of the objective function found by metaheuristics on case β, and f (x nor mali zed-α ) is the normalized value of the objective function found by metaheuristic α.

123

R. A. Mora-Gutiérrez et al. Table 2 Results of the Q score obtained by the first version

Instance

Best

Worst

Average

Variance

BB11002

0.296

0.0987

0.192285

0.001815172

BB11007

0.458

0.0698

0.184375

0.010373354

BB11020

0.429

0.0363

0.216085

0.013279331

BB11030

0.0436

0.0051

0.020551

0.000131852

BB11035

0.522

0.373

0.4896

0.002148042

BBS11016

0.135

0.0571

0.10568

0.000440703

BBS11030

0.207

0.00557

0.0329835

0.002070051

BBS11034

0.14

0.0211

0.05745

0.000735445

BB12003

0.831

0.326

0.56025

0.016684303

BB12005

0.856

0.15

0.7034

0.039586147

BB12010

0.442

0.0522

0.1553

0.010551545

BB12023

0.197

0.0772

0.109095

0.000867295

BBS12002

0.851

0.569

0.7646

0.008418568

BBS12008

0.362

0.12

0.2174

0.004101937

BBS12013

0.853

0.278

0.7516

0.025677516

BBS12020

0.326

0.0556

0.115085

0.005312096

BBS12031

0.601

0.0457

0.36283

0.032786162

BBS12037

0.0847

0.0451

0.05837

0.000117206

BB20002

0.285

0.024

0.0836

0.004943591

BB20015

0.0229

0.00258

0.007826

1.93744E−05

BB20023

0.436

0.0352

0.151265

0.012279707

BB20030

0.791

0.606

0.71185

0.002845818

BBS20027

0.139

0.0643

0.094345

0.000396886

BB30020

0.03

0.00662

0.0139905

4.65263E−05

BB40014

0.68

0.0793

0.235095

0.036367384

BBS50016

0.355

0.0199

0.06511

0.005101065

The value of f (x nor mali zed-α ) ranges from 0 to 1. If f (x nor mali zed-α ) is close to 0, the value of f (x method-α ) is close to f (x ). If f (x nor mali zed-α ) is close to 1, the value of f (x method-α ) is far from f (x ).

3.1.2 Statistical validation A non-parametric Wilcoxon rank sum test was applied to the results obtained both by the MMC and by other methods. The null hypothesis is that data in two solution sets are independent: a test value h = 1 indicates a rejection of the null hypothesis at the 5 % significance level, while h = 0 indicates a failure to reject the null hypothesis at the 5 % significance level. Parameters p (standing for the symmetry and mean of the distribution) and h (which is the result of the test hypothesis) are also computed using this statistical test.

123

Adaptation of the method of musical composition Table 3 Results of the modeler score obtained by the first version

Instance

Best

Worst

Average

Variance

BB11002

0.0841

0.0319

0.06324

0.000200327

BB11007

0.136

0.0157

0.05987

0.000827287

BB11020

0.18

0.0136

0.08951

0.002362276

BB11030

0.0377

0.00156

0.0086915

6.65262E−05

BB11035

0.286

0.206

0.26395

0.000808787

BBS11016

0.0563

0.0218

0.04192

8.47248E−05

BBS11030

0.0955

0.00275

0.015929

0.000426907

BBS11034

0.0799

0.00798

0.024149

0.000254537

BB12003

0.518

0.192

0.34795

0.006949734

BB12005

0.61

0.106

0.50075

0.02046725

BB12010

0.281

0.00626

0.097268

0.004620921

BB12023

0.145

0.0577

0.080135

0.000485002

BBS12002

0.672

0.451

0.6044

0.005244884

BBS12008

0.285

0.0919

0.169495

0.002532764

BBS12013

0.636

0.207

0.56135

0.014337818

BBS12020

0.422

0.0517

0.095895

0.007612065

BBS12031

0.486

0.0373

0.294206364

0.021709772

BBS12037

0.0496

0.0263

0.03371

3.97567E−05

BB20002

0.0245

0.00167

0.0075625

3.50129E−05

BB20015

0.0134

0.000674

0.00226615

7.49171E−06

BB20023

0.173

0.0151

0.059065

0.001819874

BB20030

0.342

0.2

0.2413

0.000859589

BBS20027

0.526

0.0353

0.07291

0.01146213

BB30020

0.00836

0.00295

0.0044255

1.96855E−06

BB40014

0.343

0.0411

0.12233

0.009134874

BBS50016

0.22

0.0121

0.04101

0.001959607

3.2 Parameter setting for the MMC algorithm In order to carry out simulations, the set of parameters was tuned through a brute-force approach [2]. The set used by both MMC versions was maxa = 1500, i f g = 0.05, c f g = 0.05 f cla = 0.05, N = 10 and N s = 5.

4 Numerical results In this section, numerical results obtained by both proposed methods are presented. The MMC was implemented in Matlab R2007a with a MacBook Air with 1.8 GHz and intel core i7. As previously mentioned, all alignments generated by our metaheuristic were evaluated through the QSCORE program. For each tested instance, we calculated the best, the worst, and the average value for each one of scoring functions generated by Q score

123

R. A. Mora-Gutiérrez et al. Table 4 Results of the Cline et al., shift score obtained by the first version

Instance

Best

Worst

Average

Variance

BB11002

0.788

0.0302

0.11127

0.025822201

BB11007

0.529

0.00617

0.0991915

0.012553441

BB11020

0.26

0.0126

0.13654

0.004624844

BB11030

0.0762

0.00453

0.0206535

0.000215509

BB11035

0.382

0.306

0.3614

0.000614779

BBS11016

0.223

0.0129

0.04737

0.002192166

BBS11030

0.0683

0.00695

0.0246325

0.000183906

BBS11034

0.066

0.00832

0.033943

0.00027966

BB12003

0.658

0.277

0.4869

0.011650095

BB12005

0.736

0.179

0.6366

0.022634884

BB12010

0.321

0.000426

0.1129813

0.006865309

BB12023

0.109

0.0007744

0.02134672

0.000998591

BBS12002

0.79

0.568

0.73015

0.003774661

BBS12008

0.374

0.207

0.2893

0.002194432

BBS12013

0.751

0.334

0.6864

0.011698042

BBS12020

0.236

0.006268

0.0473684

0.004239938

BBS12031

0.541

0.00757

0.337879364

0.026329976

BBS12037

0.0953

0.0237

0.055235

0.000303137

BB20002

0.0365

0.000389

0.00859845

8.16546E−05

BB20015

0.0209

0.0121

0.01642

8.09116E−06

BB20023

0.289

0.0153

0.093565

0.004743081

BB20030

0.408

0.35

0.38085

0.000296029

BBS20027

0.129

0.0871

0.10718

0.000176026

BB30020

0.0102

0.000557

0.00539785

6.03102E−06

BB40014

0.493

0.00608

0.138114

0.021769693

BBS50016

0.27

0.00168

0.030181

0.003300878

from the different runs; as well as the corresponding variance. The results of the Q score, the modeler score, the Cline et al., shift score and TC score obtained by the first version on all test instances are shown in Tables 2, 3, 4 and 5, respectively. Tables 2, 3 and 4 highlight that the maximum average value is found for the instance BBS12002. On the other hand, Table 5 shows that his maximum average value is obtained for instance BBS12013. The runtime required for the first version for each of the tested instances is exhibited in Table 6. As can be observed, the runtime is directly related with the number of the aligned sequences. In Tables 7, 8, 9 and 10, the corresponding results of the Q score, the modeler score, the Cline et al., shift score and TC score, respectively, obtained by the second version of the proposed algorithm for all test instances are shown. These table prove that the maximum average value is produced for the instance BB12005. In addition, the maximum average value for the TC score is achieved for instance BB12003. On the other hand, Tables 2, 5, 7 and 10 show, the corresponding results of the Q score, the

123

Adaptation of the method of musical composition Table 5 Results of the TC score obtained by the first version

Instance

Best

Worst

Average

Variance

BB11002

0

0

0

0

BB11007

0

0

0

0

BB11020

0

0

0

0

BB11030

0

0

0

0

BB11035

0.351

0.135

0.31185

0.005408029

BBS11016

0

0

0

0

BBS11030

0

0

0

0

BBS11034

0

0

0

0

BB12003

0.486

0

0.141765

0.026462417

BB12005

0.606

0

0.344515

0.050928336

BB12010

0

0

0

0

BB12023

0

0

0

0

BBS12002

0.65

0.234

0.49645

0.015809629

BBS12008

0.332

0

0.058244

0.005763812

BBS12013

0.677

0.0932

0.53401

0.02978956

BBS12020

0

0

0

0

BBS12031

0

0

0

0

BBS12037

0

0

0

0

BB20002

0

0

0

0

BB20015

0

0

0

0

BB20023

0

0

0

0

BB20030

0

0

0

0

BBS20027

0

0

0

0

BB30020

0

0

0

0

BB40014

0.329

0

0.03155

0.008590366

BBS50016

0

0

0

0

modeler score, the Cline et al., shift score and TC score, respectively, obtained with the second version of the proposed algorithm on all the test instances. From these tables, it can be observed that higher numerical values for all score functions are obtained by the second version, indicating that its performance is better than the obtained from the first version of the MMC to solve the MSA. The runtime of the second version for each of the tested instances is exhibited in Table 11. It can be seen from the Tables 6 and 11 that the results obtained from both version of the modified MMC do not differ significantly. The average results of Q score, obtained by two versions of MMC, were used as a basis for the comparative study. Normalized results generated in the comparison of the two versions of the modified MMC versus others methods are displayed in Tables 12 and 13. It can be observed that for the instances BB11035 and BB20030, the first version of the proposed algorithm achieves the best results. In addition, the second best result is also obtained for the instance BBS12002. In contrast, the first version is not suitable

123

R. A. Mora-Gutiérrez et al. Table 6 Runtime utilized by the first version in seconds

Table 7 Results of the Q score generated by the second version

123

Instance

Best

Worst

Average

Variance

BB11002

313.9

484.06

360.5365

1,851.797603

BB11007

2,499.4

4,238.6

3,366.73

254,734.2127

BB11020

456.53

512.06

493.259

179.1435042

BB11030

1,416.8

5,013.7

3,183.195

1,675,495.845

BB11035

382.51

396.02

389.6675

19.95018816

BBS11016

1,567.1

1,791.5

1,684.025

4,907.1925

BBS11030

1,218.9

2,416.9

1,876.025

111,694.3325

BBS11034

594.52

1,304.9

699.7695

35,016.67854

BB12003

162.26

176.97

170.69

10.83497895

BB12005

454

1,362.3

1,070.975

162,093.0372

BB12010

805.23

1,114.8

960.355

7,457.000279

BB12023

1,347.6

1,901.8

1,580.46

21,266.38568

BBS12002

267.13

557.1

311.1945

5,974.470205

BBS12008

2,089.7

2,863.7

2,246.515

31,406.94239

BBS12013

1,100.2

1,223.1

1,165.505

954.9215526

BBS12020

144.4

251.26

181.838

796.0189326

BBS12031

5,697.1

7,970.8

6,678.09

456,173.2546

BBS12037

1,741.3

1,914.4

1,842.255

2,284.295237

BB20002

5,748.6

7,747.1

6,508.395

282,848.7679

BB20015

10,699

17,723

13,475.4

3,499,357.832

BB20023

3,604.4

3,982.3

3,771.375

11,710.97461

BB20030

4,721.2

9,648.6

5,792.68

1,755,347.554

BBS20027

3,421.2

3,684.5

3,553.41

5,753.997789

BB30020

10,516

12,600

11,203.95

359,667.7342

BB40014

897.21

1,973.9

1,213.533

135,141.8266

BBS50016

2,171.3

2,586.2

2,409.545

8,041.012079

Instance

Best

Worst

Average

Variance

BB11002

0.824

0.208

0.3322

0.020416379

BB11007

0.513

0.324

0.43

0.001981368

BB11020

0.564

0.401

0.4936

0.002240674

BB11030

0.269

0.12

0.2009

0.001997989

BB11035

0.497

0.316

0.428

0.002609368

BBS11016

0.557

0.419

0.47175

0.001737987

BBS11030

0.332

0.178

0.26335

0.001908661

BBS11034

0.497

0.393

0.45435

0.000977818

BB12003

0.927

0.823

0.873

0.000863158

BB12005

0.899

0.802

0.84655

0.000689734

BB12010

0.866

0.736

0.7996

0.000926358

BB12023

0.812

0.733

0.7724

0.000480779

Adaptation of the method of musical composition Table 8 Results of the modeler score generated by the second version

Table 9 Results of the Cline et al., shift score generated by the second version

Table 10 Results of the TC score generated by the second version

Instance

Best

Worst

Average

Variance

BB11002

0.225

0.0532

0.09659

0.00150525

BB11007

0.204

0.131

0.1732

0.000308905

BB11020

0.229

0.164

0.2032

0.000320695

BB11030

0.0822

0.0379

0.0617555

0.000196687

BB11035

0.251

0.156

0.2117

0.000702326

BBS11016

0.219

0.165

0.18735

0.000260555

BBS11030

0.149

0.0828

0.120615

0.00038835

BBS11034

0.175

0.135

0.1581

0.000123568

BB12003

0.572

0.465

0.53735

0.000591503

BB12005

0.645

0.06

0.5827

0.015476537

BB12010

0.585

0.497

0.5389

0.000421884

BB12023

0.629

0.574

0.60175

0.000224724

Instance

Best

Worst

Average

Variance

BB11002

0.345

0.0619

0.13246

0.004150163

BB11007

0.33

0.214

0.28075

0.000848724

BB11020

0.342

0.26

0.3101

0.000601042

BB11030

0.135

0.0275

0.09407

0.00094117

BB11035

0.362

0.204

0.3095

0.001707211

BBS11016

0.342

0.259

0.294

0.000614947

BBS11030

0.256

0.1

0.1877

0.001975379

BBS11034

0.278

0.215

0.2483

0.000309274

BB12003

0.7171

0.659

0.687355

0.000289078

BB12005

0.774

0.705

0.73815

0.000352239

BB12010

0.721

0.622

0.67115

0.000472661

BB12023

0.728

0.663

0.7013

0.000317589

Instance

Best

Worst

Average

Variance

BB11002

0.294

0

0.015473684

0.004549263

BB11007

0.239

0.0141

0.12573

0.004084839

BB11020

0.347

0

0.23603

0.010619503

BB11030

0

0

0

0

BB11035

0.378

0.135

0.2511

0.003844516

BBS11016

0.218

0

0.12221

0.003890402

BBS11030

0.0435

0

0.00435

0.000179266

BBS11034

0.19

0.02

0.1165

0.002213421

BB12003

0.811

0.568

0.69355

0.004318366

BB12005

0.723

0.467

0.57545

0.004254787

BB12010

0.689

0.471

0.5431

0.003552726

BB12023

0.675

0.579

0.6193

0.000868747

123

R. A. Mora-Gutiérrez et al. Table 11 Runtime utilized by the second version in seconds

Instance

Best

Worst

Average

Variance

BB11002

321.24

370.96

346.0775

155.6638303

BB11007

769.58

820.27

796.3

199.8484526

BB11020

686.28

876.13

760.7175

3,068.49122

BB11030

1,993.2

2,186.8

2,073.715

2,529.141342

BB11035

130.43

146.93

139.218

28.25126947

BBS11016

925.96

1,045.5

964.105

1,537.081405

BBS11030

759.49

1,704.5

862.6725

80,630.9285

BBS11034

946.64

1,165.4

1,006.5235

3,161.136803

BB12003

154.26

162.59

158.262

5.953690526

BB12005

418.24

440.19

425.6415

34.42149763

BB12010

453.55

474.09

462.7695

30.39158395

BB12023

1,427.2

1,878.4

1,600.485

15,196.08555

for instances BBS12008, BBS12037, BB20015 and BB30020, where the worst results are obtained. However, it is important to stress that the average performance of the first version is better than those corresponding to the MULTIALIN, POAV and SAM. On the other hand, the second version generates the best solution for BBS11016, BBS11034 and BB12005, as well as the second best result on BB11002, BB11007, BB1120, BB11035, BB12003, BB12005, BB12010, BB12023, in comparison with the rest of the considered methods. Moreover, the average behavior of the best than the average behavior of the other methods. In addition, the results of Wilcoxon rank-sum test for both versions are shown in Tables 14 and 15, respectively. It can been seen in Table 14, that the first version of the modified MMC has a behaviour similar to CLUSTAL W, DIALIGN-T, DIALIGN-TX, ALIGN, PRANK, SAM and AMAP. Based on results of Table 15, it can be deduced that the second version of the MMC has a behavior different from CLUSTAL W, MULTAIN, POAV, PRANK, SAM and first version of MMC. 5 Conclusion In this paper, a new cultural method for the solution of the multiple sequence alignment problem was proposed. The method is an adaptation of a recently proposed metaheuristic referred as the Method of Musical Composition, which uses an artificial society where interactions between agents cause changes of the features of their individual artwork. In this work, two versions of the MMC are presented. The main difference between both versions is the objective function used to guide the search, however the numerical results showed that both versions are suitable to solve the MSA problem, although their behaviour is significantly different. The complexity of both versions of our algorithm is the order of O(N 2 ∗ L 2 ), which is similar to the complexity of other methods used in the literature.

123

Adaptation of the method of musical composition Table 12 Normalized results generated by the comparison of several heuristics on test instances

1

BB11002

BB11007

BB11020

1, 3, 4 to 23

11, 13, 22

11, 13, 22

2, 24

9, 24

10, 15, 24

0.9 0.8 0.7 0.6

20, 23

0.5

7, 19, 20

9, 14, 16

0.4

2, 23

4, 5, 7, 8, 12, 18

0.3

25

0.2 0.1

3, 10, 14, 15, 16, 18

2, 17, 19, 21

1, 25

3, 6

4, 5, 8.12, 17, 21

25

3

6

1

BB11030

BB11035

BBS11016

1

11, 13, 22

11, 13, 22

22

0.9

6, 20, 23, 24

20

9, 10

0.8

2

5, 8, 10

24

0.7

3, 9, 10, 16, 19

1, 2

11, 20

0.6

7

0.5

1, 15

15, 16

13, 16, 19

0.4

17, 18

4, 6, 7, 14, 23

3, 6, 18

0.3

4, 12, 14, 25

9, 12, 17, 18, 19

1, 5, 8, 12, 14, 15, 17

0.2

21

0

7, 23

21

4, 21

3, 25

2

5, 8

24

25

BBS11030

BBS11034

BB12003

1

22

22

13

0.9

24

24

11

0.8

7, 16, 20

10

0.7

11, 19

7, 9, 11, 15, 19, 20, 23

0.6

2, 23

3, 16

7, 20

0.5

6, 9, 13, 17, 18

1, 13

2, 24

0.4

3, 25

4, 6, 14, 17, 18, 21

0.3

10

2, 5, 8, 12

0.2

4, 5, 8, 12, 14, 21

3, 6, 9, 10, 16, 18, 22

0.1 0

0.1

15

0

1

1, 4, 5, 8, 14, 15, 19, 25 25

12, 17, 21, 23

123

R. A. Mora-Gutiérrez et al. Table 12 continued BB12005

BB12010

BB12023

13

13

13

0.9

11

11, 24

0.8

22, 24

1

0.7

11

0.6 0.5

2, 20, 22

0.4

9, 10, 16, 18

2

0.3

3, 4, 6, 7, 15, 19, 23

7, 19, 20, 23

0.2

5, 8, 12, 14, 17, 21, 24

4, 5, 6, 7, 8, 9, 10, 12, 15, 16, 19, 20, 23

1, 25

1, 21

1

BBS12002

BBS12008

BBS12013

1

13

11, 13, 24

13

0.9

22

0.8

11

0.1 0

2, 9, 10, 22 3, 4, 5, 6, 8, 12, 14, 15, 16, 17, 18, 21, 25

3, 14, 17, 18, 25

11

0.7 0.6

2

0.5

1

0.4

23

0.3

4, 5, 8, 9, 12, 17, 19, 20, 21

0.2

3, 6, 7, 10, 14, 15, 16, 18, 24

2, 9, 10 3, 7, 16, 20, 23

20, 24

4, 5, 6, 8, 14, 15, 17, 18, 19, 22

3, 5, 6, 7, 8, 9, 10, 14, 15, 16, 19

1

12, 21

1, 4,12, 17, 18, 21, 23

BBS12020

BBS12031

BBS12037

1

19, 22

11, 13

13, 24

0.9

24

0.1 0

11

0.8 0.7

24

0.6

13

0.5

11

0.4

2

0.3

5–10, 15, 16, 18, 20

2

9, 20

0.2

3, 4, 12, 14, 17, 21, 23

10, 20

3, 7, 23

3–9, 12, 14–19, 21–23

2, 5, 8, 15, 17, 18, 19

1

1, 4, 6, 12, 14, 21

0.1 0

1

22 10, 16

Where: 1 is ALIGNER, 2 is ALIGNER*, 3 is MUSCLE, 4 is TCOFFEE, 5 is MAFFT, 6 is PRRP, 7 is CLUSTALW, 8 is DIALIGN, 9 is DIALIGN-T, 10 is DIALIGN-TX, 11 is MULTALIN, 12 is PROBCONS, 13 is POAV, 14 is PSALIGN1, 15 is PSALIGN2, 16 is KALIGN, 17 is PROBALIGN, 18 is PCMA, 19 is ALIGN-M, 20 is PRANK, 21 is MUMMALS, 22 is SAM, 23 is AMAP, 24 is first-MMC, 25 is second-MMC

123

Adaptation of the method of musical composition Table 13 Normalized results generated by the comparison of several heuristics on test instances (continuation)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

BB20002

BB20015

BB20023

3, 6, 7, 9, 10, 11, 13, 14, 16, 20, 21, 22, 23 1

3, 5, 7, 8, 12, 17, 19, 20, 22, 23, 24 4, 6, 9, 10, 11, 13, 14, 15, 16, 18, 21

11, 13

4, 5, 8 12, 15, 17, 18 19

9, 22 10, 15, 19, 24 14, 16, 17, 18, 20, 23 3, 5, 6, 8, 12 4, 7 21

24

2

1 2

2 1

BB20030

BBS20027

BB30020

1

11, 13

13

3, 6–9, 11, 13, 14, 19, 20–14

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

7, 9, 18 4, 6, 10, 14, 15, 16, 19, 13 3, 5, 8, 12, 17, 21, 22 20

11, 24 4, 10

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1, 24

9, 19, 20 3, 10, 16, 18 7, 8, 22, 23 4, 5, 6, 14, 15, 21 12, 17 1, 2

BB40014

BBS50016

11, 13, 22

11, 13, 22 24

2

16, 17 12, 18 2, 5

1 15

24 6, 8, 16, 18, 20 3, 4, 5, 7, 9, 10, 12, 14, 15, 17, 19, 21, 23 2 1

7 8, 9, 10, 16 20, 23 6 2, 3, 5, 17, 18, 19 14, 15, 21 1, 4, 12

Where: 1 is ALIGNER, 2 is ALIGNER*, 3 is MUSCLE, 4 is TCOFFEE, 5 is MAFFT, 6 is PRRP, 7 is CLUSTALW, 8 is DIALIGN, 9 is DIALIGN-T, 10 is DIALIGN-TX, 11 is MULTALIN, 12 is PROBCONS, 13 is POAV, 14 is PSALIGN1, 15 is PSALIGN2, 16 is KALIGN, 17 is PROBALIGN, 18 is PCMA, 19 is ALIGN-M, 20 is PRANK, 21 is MUMMALS, 22 is SAM, 23 is AMAP, 24 is first-MMC, 25 is second-MMC

123

R. A. Mora-Gutiérrez et al. Table 14 Wilcoxon rank-sum test for the first version Mehtod

ALIGNER

ALIGNER*

MUSCLE

TCOFFEE

MAFFT

h

TRUE

TRUE

TRUE

TRUE

TRUE

Mehtod

PRRP

CLUSTALW

DIALIGN

DIALIGN-T

DIALIGN-TX

h

TRUE

FALSE

TRUE

FALSE

FALSE

Mehtod

MULTALIN

PROBCONS

POAV

PSALIGN1

PSALIGN2

h

TRUE

TRUE

TRUE

TRUE

TRUE

Mehtod

KALIGN

PROBALIGN

PCMA

ALIGN-M

PRANK

h

TRUE

TRUE

TRUE

FALSE

FALSE

Mehtod

MUMMALS

SAM

AMAP

h

TRUE

FALSE

FALSE

Table 15 Wilcoxon rank-sum test for the second version Mehtod

ALIGNER

ALIGNER*

MUSCLE

TCOFFEE

MAFFT

h

FALSE

FALSE

FALSE

FALSE

FALSE

Mehtod

PRRP

CLUSTALW

DIALIGN

DIALIGN-T

DIALIGN-TX

h

FALSE

TRUE

FALSE

FALSE

FALSE

Mehtod

MULTALIN

PROBCONS

POAV

PSALIGN1

PSALIGN2

h

TRUE

FALSE

TRUE

FALSE

FALSE

Mehtod

KALIGN

PROBALIGN

PCMA

ALIGN-M

PRANK

h

FALSE

FALSE

FALSE

FALSE

TRUE

Mehtod

MUMMALS

SAM

AMAP

first-MMC

h

FALSE

TRUE

FALSE

TRUE

The performance of the first version of the modified MMC was tested on twenty six instances taken from the BAliBASE 3 collection. The experimental results indicate that its behavior is better or similar than those obtained from other methods based on the Hidden Markov Model. In particular, according to the Wilcoxon ranksum test, this version has a behavior similar to that of CLUSTAL W, DIALIGNT, DIALIGN-TX, ALIGN, PRANK, SAM, and AMAP. On the other hand a second MMC adaptation, was evaluated on twelve instances taken from the BAliBASE 3 collection. The experimental results show than the second version has a good behavior, since for 25 % of the tested instances, the algorithm found the best alignment with respect to data of other state-of-the-art algorithms. Besides, for 50 % of the tested instances the second version of the modified MMC achieved the second best alignment. Finally, the significance of the numerical results was analyzed according to the Wilcoxon rank-sum test, which indicated that the second proposed version is statistically similar to ALIGNER, ALIGNER∗ , MUSCLE, TCOFFEE, MAFFT, PRRP, DIALIGN, DIALIGN-T, DIALIGN-TX, PROBCONS, PSALIGN1, PSALIGN2, KALIGN, PROBALIGN, PCMA, ALIGN-M, MUMMALS, and AMAP. The differences in the performance of both versions are attributed to the changes in the definition of the objective function used by each version. This aspect will be considered for the development of future works.

123

Adaptation of the method of musical composition

References 1. Altschul SF, Erickson BW (1986) Optimal sequence alignment using affine gap costs. Bull Math Biol 48(5–6):603–616 2. Birattari M (2009) Tuning metaheuristics: a machine learning perspective. Studies in computational intelligence, vol 197. Springer, Berlin 3. Bahr A, Thompson JD, Thierry JC, Poch O (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res 29(1):323–326 4. Baewicz J, Formanowicz P, Wojciechowski P (2009) Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark. Int J Appl Math Comput Sci 19(4):675– 678 5. Blazewicz J, Frohmberg W, Kierzynka M, Wojciechowski P (2013) G-MSAA GPU-based, fast and accurate algorithm for multiple sequence alignment. J Parallel Distrib Comput 73(1):32–41 6. Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16(22):10881–10890 7. Daugelait J, O’ Driscoll A, Sleator R (2013) An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomath 2013:Article ID 615630 8. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) PROBCONS: probabilistic consistencybased multiple sequence alignment. Genome Res 15:330–340 9. Duret L, Abdeddaim S (2000) Multiple alignments for structural, functional or phylogenetic analyses of homologous sequences, Bioinformatics Sequence structure and databanks. Oxford University Press, Oxford 10. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797 11. Edgar RC, Serafim B (2006) Multiple sequence alignment. Curr Opin Struct Biol 16(3):368–373 12. Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–244 13. Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30(14):305966. doi:10.1093/nar/gkf436 14. Kelil A (2011) Contribution a` l’analyse des séquences de protéines similarité, clustering et alignement. PhD thesis. Université de Sherbrooke Faculté des sciences 15. Krogh A, Brown M, Mian IS, Sjölander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235:1501–1531 16. Lassmann T, Sonnhammer ELL (2005) Kalignan accurate and fast multiple sequence alignment algorithm. BMC Bioinf 6:298 17. Lee ZL, Su SF, Chuang CC, Liu KH (2008) Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment. Appl Soft Comput 8(1):55–78. ISSN 1568–4946 18. Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18(3):452–464 19. Löytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA 102(30):10557–10562 20. Manthey B (2005) Non-approximability of weighted multiple sequence alignment for arbitrary metrics. Inf Process Lett 95(3):389–395 21. Mayers A, Monga E, Wang S (2014) ALIGNER detecting and aligning related protein sequences. ProspectUS. 16 February 2010. Check in 25 May 2014. Website http://prospectus.usherbrooke.ca/ aligner/Results/BALIBASE3.htm 22. Mora-Gutiérrez RA, Ramírez-Rodríguez J, ElizondoO-Cortes M (2011) Heurística para solucionar el problema de alineamiento múltiple de secuencias. Rev Mat [online] 18(1):121–136 23. Mora-Gutiérrez RA, Ramírez-Rodríguez J, Rincón-García EA (2012) An optimization algorithm inspired by musical composition. Artif Intell Rev 41(3):301–315 24. Mora-Gutiérrez RA, Ramírez-Rodríguez J, Rincón-García, Ponsich A, Herrera O (2012) An optimization algorithm inspired by social creativity systems. Computing 94(11):887–914 25. Morgenstern B, Frech K, Dress A, Werner T (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14:290–294 26. Notredame C, Higgins D, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217

123

R. A. Mora-Gutiérrez et al. 27. Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3(1):131–144 28. Nuin PAS, Wang ZZ, Elisabeth RM (2006) The accuracy of several multiple sequence alignment programs for proteins. Bioinformatics 7:471–489 29. Prakash Lingam KM, Chandrakala S (2011) A survey on recent developments in multiple sequence alignment methods. J Nat Sci Biol Med 2:96–97 30. Pei J, Sadreyev R, Grishin NV (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19(3):427–428 31. Pei J, Grishin NV (2006) MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 34(16):4364–4374 32. Roshan U, Livesay DR (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22(22):2715–2721 33. Subramanian AR, Weyer-Menkhoff J, Kaufmann M, Morgenstern B (2005) Dialign-t: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinf 6:66. doi:10.1186/ 1471-2105-6-66 34. Subramanian AR, Kaufmann M, Morgenstern B (2008) DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol 3:6 35. Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23(2):e24– e29 36. Sze S-H, Lu Y, Yang Q (2006) A polynomial time solvable formulation of multiple sequence alignment. J Comput Biol 13:309–319 [Also appear in Proceedings of the 9th annual international conference on research in computational molecular biology (RECOMB’2005). Lecture notes in bioinformatics, vol 3500, pp 204–216] 37. Thompson J, Higgins D, Gibson T (1994) ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4690 38. Thompson JD, Ripp O (2014) BAliBASE 3 website of the LBGI Bioinformatique et Génomique Intégratives. Web 15 April 2014. http://lbgi.fr/balibase/ 39. Thompson JD, Koehl P, Ripp R, Poch O (2005) BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins Struct Funct Bioinf 61(1):127–136 40. Wisconsin Package v. 8, Genetics Computer Group, Madison, WI. http://www.gcg.com. Accessed 7 Aug 2014 41. Wojciechowski P, Formanowicz P, Blazewicz J (2014) Reference alignment based methods for quality evaluation of multiple sequence alignment—a survey. Curr Bioinf 9(1):44–56 42. Van Walle I, Lasters I, Wyns L (2004) Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 20(9):1428–1435

123

Adaptation of the method of musical composition for solving the multiple sequence alignment problem

Recommend Documents