Computational approaches, databases and tools for in silico motif discovery

Motifs are the biologically significant fragments of nucleotide or peptide sequences in a specific pattern. Motifs are categorized as structural motif...

0 downloads 45 Views 237KB Size

Download PDF

Interdiscip Sci Comput Life Sci (2012) 4: 239–255 DOI: 10.1007/s12539-012-0141-x

Computational Approaches, Databases and Tools for in silico Motif Discovery Tanmaya Kumar SAHU1 , A.R. RAO1∗ , Shuchi VASISHT2 , Nishtha SINGH1 , U.P. SINGH1 1

(Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi 110012, India) 2 (Delhi Institute of Advanced Studies, Guru Gobind Singh Indraprastha University, Sector-16 C, Dwarka, New Delhi 110075, India)

Received 17 October 2011 / Revised 12 April 2012 / Accepted 13 June 2012

Abstract: Motifs are the biologically signiﬁcant fragments of nucleotide or peptide sequences in a speciﬁc pattern. Motifs are categorized as structural motifs and sequence motifs. These are discovered by phylogenetic studies of similar genes across species. Structural motifs are formed by three dimensional arrangements of amino acids consisting of two or more α helices or β strands whereas sequence motifs are formed by the nucleotide fragments appearing in the exons of a gene. The arrangement of residues in structural motifs may not be continuous while it is continuous in sequence motifs. Sequence motifs may encode to the structural motifs. The algorithms used for motif discovery are important part of the bio-computational studies. The purpose of motif discovery is to identify patterns in biopolymer (nucleotide or protein) sequences to understand the structure and function of the molecules and their evolutionary aspects. The main aim of this paper is to provide systematic compilation of a review on diﬀerent approaches, databases and tools used in motif discovery. Key words: motif discovery, artiﬁcial intelligence, artiﬁcial neural networks, enumerative approaches, evolutionary algorithms, phylogeny, biopolymers.

1 Introduction Discovery of the regulatory motifs is an important topic in computational biology. Indeed, these motifs control the expression of genes. But direct experimental determination of these motifs is not practical or eﬃcient in many biological systems (Conlon et al., 2003). The purpose of motif discovery is to discover patterns in bio-polymeric (nucleotide or peptide) sequences (Bailey et al., 2007) and connectivity-patterns in gene or protein interaction networks to understand and unravel the structure and function of the bio-molecules represented by these motifs. Algorithms meant for motif discovery are important for computational biologists to identify phylogeny and function of the gene products. It is well known that, most of the functional elements are not coding for proteins and are expected to include regulatory signals, genes for RNA and other structural elements. Although the largest and most conserved elements are readily identiﬁed, the vast majority of non-coding functional elements remain unknown (Xie et al., 2005), in particular, recognition of short motifs, such as the DNA binding sites for transcription ∗

Corresponding author. E-mail: [email protected]

factors. Also, function prediction of unknown proteins is another challenging problem in functional genomics. Several computational approaches have been developed for detection of short sequences/motifs of amino acids that are conserved within a family of closely related protein sequences (Hudak and McClure, 1999). Conserved blocks/motifs in the sequences of a family can often highlight features responsible for structural similarity between proteins and can be used to predict the three dimensional structure of a protein as protein structures are vital from their functional view point. Motifs may also enclose powerful diagnostic features, generating rules for determining whether or not an unknown sequence belongs to a family and thus deﬁne a characteristic function for families (Blekas et al., 2003). Though the ultimate aim of motif discovery is to identify, classify and characterize an individual based on its bio-molecules or bio-molecular interaction networks, the motif structure varies based on the object considered for this purpose. DNA motifs generally function as regulatory elements in biological systems (e.g. transcription regulatory motifs) whereas protein motifs are generally the functional sites (e.g. catalytic and DNA binding sites). The DNA and protein motifs are the nucleotide or amino acid sequence based, while the network motifs are the species or family speciﬁc con-

240

nectivity patterns in the interaction (gene or protein) networks. Therefore, all the tools and algorithms for ﬁnding these motifs are diﬀerent. DNA and Protein motif algorithms use diﬀerent position based scoring systems or matrices whereas network motif algorithms are generally based on structural stability scoring systems. However exploration on the importance of these motifs, showed the necessity to provide a review on the approaches, tools and databases for identifying the sequence, structural and network motifs in the DNA, protein and their interaction networks respectively. Hence the aim of the paper is to provide an insight into the algorithms and approaches used for motif discovery and their applications in the development of online servers, oﬄine tools and software. It also provides a brief review on the online databases for motif ﬁnding with the appropriate references and URLs.

2 Classification of motifs Motifs are generally classiﬁed into three major types, i.e. Sequence, Structure and Network motifs. Besides, motifs are also categorized based on the nature of bio molecules such as DNA, RNA and Protein motifs where the DNA and RNA motifs are often referred as nucleotide motifs and protein motifs as peptide motifs. It is to be noted that the sequence and structure motifs are the conserved patterns of either nucleotides or peptides but network motifs are conserved sub graphs of gene or protein interaction networks. 2.1 Sequence motifs Sequence motifs are nucleotide or peptide sequence patterns, presumed to have a biological signiﬁcance. They are either DNA sequence motifs or protein sequence motifs. DNA motifs are short, recurring patterns often indicate sequence-speciﬁc binding sites for proteins like nucleases and transcription factors (TF). Few others are involved in many RNA level processes that include ribosome binding, mRNA processing and transcription termination. On the other hand, protein sequence motifs are signatures of protein families and often determine the function of proteins. These motifs are formed by the continuous amino acid sequences and often represents catalytic or active sites of metabolic enzymes and regulatory proteins. Protein sequence motifs can often be recognized by simple inspection of the amino-acid sequence of a protein, and when detected provide strong evidence for biochemical function. Sequence motifs explain and recognize the speciﬁc characteristics of DNA, RNA and protein sequences such as TF binding sites, splice junctions and protein-protein interaction sites (Martin et al., 2008). 2.2 Structure motifs In a protein or nucleic acid chain, a structural motif is recognised as a super-secondary structure. DNA structural motifs occur in unimolecular folded DNAs as

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

well as bimolecular hybridization. These structural motifs include a variety of loop motifs including hairpins, bulges, internal loops, and multi-branched loops (SantaLucia et al., 2004). In case of proteins, the structural motifs are formed by three dimensional arrangements of continuous or discontinuous amino acids and therefore distinguished from sequence motifs. These motifs often include loops of variable length and unspeciﬁed structure, which in eﬀect create the “slack” necessary to bring together in space two elements that are not encoded by immediately adjacent DNA sequences in a gene. Certain structural motifs having a particular functional signiﬁcance or having a contribution to an independently folded functional domain are referred to as functional motifs. For example, the helix-turn-helix motifs found in many DNA-binding proteins do not exist as a stably folded domain, if expressed separately from the functional domain. But when detected in a DNA binding domain of a protein, behaves like a candidate motif for the DNA binding proteins (Gregory et al., 2004). Beta hairpin, Greek key, Omega loop, Helixloop-helix, Zinc ﬁnger are the well known examples of such functional motifs. 2.3 Network motifs The network motifs are the connectivity-patterns (sub-graphs), which occur very commonly in speciﬁc types of gene and protein interaction networks. Each type of network seems to display its own set of characteristic motifs. Alon et al. (2006) ﬁrst presented the concept of network motifs when such motifs were discovered in the gene regulation (transcription) network of the bacteria E. coli (Shen-Orr et al., 2002). Understanding the network motifs in gene regulatory networks is very important as these networks control the gene expression in response to biological signals. In the gene regulatory networks the genes represent the nodes, and directed edges represent the control of one gene by a transcription factor (regulatory protein that binds DNA) encoded by another gene. Thus, network motifs are patterns of genes regulating each other’s transcription rate. It is worth noted that the sequence motifs may or may not always be associated with structural and vice versa as two biopolymers may share the same motif without any considerable primary structure similarity and the existence of a sequence motif does not necessarily imply a distinctive structure.

3 Approaches for motif discovery The application of artiﬁcial intelligence (AI) in the ﬁeld of bioinformatics is growing at an exponential rate. This is the area of computer science which especially deals with the problems through the use of heuristics and probabilistic approaches. AI approaches stand out when dealing with the problems having no require-

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

ment for a strong constraint but requirement of a weak constraint (Narayanan et al., 2003). Many problems in bioinformatics do not have strong constraints, thus there is plenty of scope for the application of AI techniques in Bioinformatics. In particular, these AI approaches can be divided into two categories: (i) enumerative approaches and (ii) statistical approaches. The former attempts to enumerate the search space of possible motifs and test each for signiﬁcance, while the latter attempts to optimize the parameters of a motif model by statistical analysis of the sequence data. Enumerative algorithms ﬁnd an optimal solution for the discrete representations of relatively short motifs, but do not scale to larger motifs and continuous models due to combinatorial growth in the search space (Lones et al., 2007). Statistical approach such as position frequency matrices is used for ﬁnding continuous motif models. The statistical approaches used in motif discovery tools are Expectation Maximization and Gibbs sampling algorithms. “Multiple Expectation Maximization for Motif Elicitation” (MEME) is perhaps a best known tool which uses the above approaches. 3.1 Enumerative approaches 3.1.1 Evolutionary computation approach/genetic algorithms Evolutionary Computation (EC) is a modern search technique which uses computational models of processes of evolution and selection. Concepts and mechanisms of Darwinian evolution and natural selection are encoded in evolutionary algorithms (EAs) and among computational paradigm, EC is now recognized as particularly appropriate for various traditional and novel computational applications in structural engineering (Kicinger et al., 2005). EAs are stochastic population based search algorithms that carry out evolutionary search process loosely modelled upon biological evolution. EAs require a user to deﬁne a cost function so that alternative solutions can be scored appropriately. Lones and Tyrell (2005) have highlighted one of the 3 characteristics of EA that make them suitable for motif discovery: global search that is neither exhaustive nor biased by speciﬁc heuristics, representational ﬂexibility, and no dependence between the way in which solutions are derived and the way in which they are scored. Lones and Tyrell (2007) have described a novel EA for regulatory motif discovery in DNA promoter sequences by using data clustering to logically distribute the evolving population across the search space. Similarly, Hubley and Roach (2003) have implemented a modiﬁed version of the Strength-Pareto EA (SPEA2) in java i.e. Multiobjective Analyzer for Genetic Maker Acquisition (MAGMA) for the selection of Single Nucleotide Polymorphisms (SNPs) at motifs. This set is very useful for the design of large scale studies, involving disease identiﬁcation, genetic mapping, population

241

studies and haplotype block elucidation. Broadly, EAs are sub-divided into: Evolution Strategies (ES), Evolutionary Programming (EP), and Genetic Algorithms (GAs). Evolution Strategies (ES) are probabilistic optimization techniques and EPs are meant for the evolution of a simple program to a very larger and highly reliable software system by making the code more robust. GAs are a family of computational models inspired by evolution which encode a potential solution to a speciﬁc problem on a simple chromosome like data structure and apply recombination operators to these structures so as to preserve critical information. These algorithms are often viewed as function optimizers being applied to a broad range of problems. However, in all the above EC methods, diﬀerent methods of ﬁtness calculation are used and only one motif per sequence is assumed. But, in a sequence, multiple similar motifs may exist, and identiﬁcation of those motifs is equally important for identiﬁcation of a single motif per sequence. Paul and Iba (2006) proposed a novel GA based method for identiﬁcation of multiple motifs in each of the given sequences. This method can handle longer motifs and can identify multiple positions of the motif instances of a consensus motif. For a broader review of EC and structural design, the reader may refer to Kicinger et al. (2005). Another major EA developed recently is Genetic Programming (GP). GP is an extension of GA in which an initial population of computer programs composed of the primitive functions is generated and then each program in the population is executed and its ﬁtness at solving the problem is evaluated. A new generation of programs is then created by applying reproduction and crossover operators. 3.1.2 Self Organizing Maps (SOM) SOM is a type of ANN, trained using unsupervised learning, to produce a low-dimensional and discretized representation of input space of the training samples. SOMs use a neighbour-hood function to conserve the topological properties of the input space. This makes SOM useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. Liu et al. (2005) used self-organizing neural networks for identifying motifs with insertions/deletions, by introducing a new deﬁnition for calculating the distance between a pair of subsequences. This algorithm can ﬁnd motifs with up to two insertions, deletions or their combinations. The results showed that the performance of this algorithm is more accurate and costs less computation than existing algorithms. Self organizing neural networks have been applied to genome signature analysis, endogenous retrovirus clustering, various classiﬁcation problems and eﬃcient design of DNA microarrays for SNP analysis. Mahony et al. (2005) proposed an approach based on SOM where the output layer neuron weight vectors are replaced by Position Weight Matrices (PWMs). This

242

approach is used as an aid in overrepresented motif discovery as it characterises the features present in a set of sequences. Further, Mahony et al. (2006) evaluated the performance of three Self-organizing neural networks to solve the motif ﬁnding problem. Here, they integrated a SOM-based motif-ﬁnder, SOMBRERO, with a SOMbased method for automatic construction of generalized models for structurally related motifs and a selforganizing tree method to display the relationships between various motifs. Kohonen et al. (2007) described SOM and hence it is also known as Kohonen map. Selforganizing neural networks have wide applicability to the study of DNA-binding motifs and familiar binding proﬁles. Neural networks pose two signiﬁcant problems when applied to motif discovery: (i) their relative inﬂexibility when mapping sequence data to inputs, (ii) the diﬃculty of interpreting neural models, particularly when hidden layers are present within the network. 3.1.3 Longest Common Subsequences (LCS) and their Shortest Common Supersequences (SCS) approach Under this approach, a set of related sequences is given as input for deriving patterns in a set of biological sequences and the goal is to ﬁnd a set of patterns that are common to all or most of the sequences in the set. Ning and Leong (2006) proposed a PAtterns by LCS & SCS, (PALS) algorithm which discovers patterns in a biological sequences proﬁle by ﬁrst generating the results for LCS & SCS of sequences by heuristic method. Consequently, it derives patterns from the obtained results. PALS have high sensitivity & speciﬁcity, and they are eﬀective in deriving patterns close to the known consensus patterns for real sequences. 3.2 Statistical approaches The successful applications of sophisticated statistical models in computational biology have been observed during the past decade. These statistical models have been used to ﬁnd short repetitive pattern in a set of DNA or protein sequences based on either Probabilistic models or Bayesian models. 3.2.1 Clustering Clustering deals with the assignment of a set of observations into subsets or clusters, so that observations in the same cluster are similar in some as compared to the observations in the other clusters. A detailed description on diﬀerent clustering methods and distance measures is given in Johnson and Wichern (2007). This method is based on unsupervised learning, and is a common technique used for the analysis of data in many ﬁelds like pattern recognition, image analysis and bioinformatics. As far as transcriptomics is concerned, clustering is used to build the groups of genes with related expression patterns which are called as co-expressed genes. Often such groups contain functionally related proteins, such as enzymes for a speciﬁc pathway, or genes that are co-regulated and expected

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

to have some common regulatory motifs. In sequence analysis, clustering is used to group homologous sequences into a particular gene family based on sequence similarity. In high-throughput genotyping platforms, clustering is used to automatically assign genotypes to distinct groups. Consensus of a particular family helps to identify a recognizable pattern as a functional motif. 3.2.2 Fuzzy Logic Approach Fuzzy set theory and fuzzy logic have been used in bioinformatics, but to much less extent than graph theory and machine learning. These are the ideal frameworks for describing some biological systems/objects and provide suitable computational methods for a broad spectrum of bioinformatics problems. Fuzzy models attempt to quantify imprecision and uncertainty that is not easily captured by standard mathematical models, particularly in pattern recognition problems. Arredondo et al. (2005) developed a fuzzy inference engine based on information theory for genomic sequences (DNA) to predict coding regions. Fuzzy approaches are also suitable to describe and compare molecular structures. Cundari et al. (2001) used fuzzy logic in the analysis of a database of small molecular structures. Sadegh-Zadeh et al. (2000) and Torres et al. (2003) used polynucleotides as fuzzy sets and introduced a means to measure dissimilitudes between them as points in a hypercube. They used this method as a tool to compare diﬀerent genomic sequences. Jacob et al. (2005) used fuzzy scoring functions based on diverse biological information to predict operon, an important structure in bacterial genomes. Xu and Bondugula (2006) discussed the fuzzy concepts and methods in studying biological problem. They have presented two applications of fuzzy logic: one is fuzzy measurement of ontological similarity and its application in bioinformatics and the other is the application of fuzzy ‘k’ nearest neighbor algorithm in protein secondary structure prediction. 3.2.3 Hidden Markov Models (HMM) HMM is a statistical model where the system being modeled is assumed to be a Markov process with unobserved state. Normally, an HMM is considered as the simplest dynamic Bayesian Network. In a regular Markov model, the observer can directly visualize the state and therefore the state transition probabilities are the only parameters. In an HMM, the state is not directly visible, but output dependent on the state is visible. Each state has a probability distribution over the probable output tokens. Therefore the sequence of tokens generated by a HMM gives information on the sequence of states. Chudova et al. (2002) have derived the Bayes error rate for the problem of motif discovery in DNA sequences under a Markov assumption. They have used the ﬁrst order Markov framework as their base model where there is a hidden layer for each posi-

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

tion in the pattern and a background state to model the background. They found that the detection of structured patterns in a Markov context presents a more diﬃcult problem over the detection of random patterns. The advantage of using HMMs is in their ability to learn parameters directly from a set of observations, freeing the user from having to set the parameters a priori. 3.2.4

Position Weight Matrices (PWM)

PWM contains log odds weights for computing a match score. To check whether an input sequence matches the motif or not is based on a threshold. A PWM is designed from Position frequency matrix (PFM), which records the position-dependent frequency of each residue. ‘Systematic Evolution of Ligands by Exponential Enrichment (SELEX)’ experiments or computationally discovered tools such as MEME using Hidden Markov Models are used to determine the PFMs experimentally. Schwartz and Gygi (2005) presented an iterative statistical approach to identify protein phosphorylation motifs from large data sets of naturally occurring phosphorylation sites. They took two parallel sequence data sets for background probability calculations. These two data sets are converted into position-weight matrices where each matrix contained information on the frequency of all residues at 6 positions upstream and downstream of the phosphorylation site. They created a 3rd matrix, the binomial probability matrix using the phosphorylation matrix and background matrix (created from the background dataset). A greedy recursive search on the sequence space is then applied to extract a list of signiﬁcant motifs. This algorithm has the versatility to extract motifs from a wide range of data sets and may lead to discovery of novel biologically relevant protein motifs directly from a raw protein. 3.2.5

Bayesian models

Bayesian models are based on the principles of Bayes theorem and used to capture signiﬁcant patterns in genomic sequences and lead to more accurate motif prediction. The framework provides a basic infrastructure for hierarchical model dependence and for dealing with nuisance parameters without leading to over-whelming analytical complexity. The Bayesian concept allows a “learning” capability from “historical” data that can be used in modelling a new, but similar problem. Here, the priors can be obtained from partial information on known motifs thereby improving estimation of information on novel motifs. Gibbs sampling is commonly used as a means of Bayesian inference. This algorithm is used to generate a sequence of samples from the joint probability distribution of two or more random variables. It is applicable when the joint distribution is not known explicitly or is diﬃcult to sample from directly, but the conditional distribution of each variable is known and is easy to sample from (Liu et al., 1994).

243

4 Databases and tools for motif finding Several algorithms have been developed for discovery of DNA, Protein and Network motifs. Also, there exist several motif databases with dynamic search facility through web interface. Some of the databases and tools meant for motif discovery are discussed below. 4.1 Databases and tools for DNA motifs 4.1.1 Databases 4.1.1.1 TRANSFAC (TRANScription FACtors) TRANSFAC database has mutual cross-references with Flybase, EMBL and SwissProt databases. In appropriate cases, FACTORS are linked to PIR, and most CLASS entries point to the PROSITE database (Wingender et al., 1996). It is a widely used DNA motif database which contains information on transcription factors, their experimentally proven binding sites, nucleotide distribution matrices and regulated genes. Its broad compilation of binding sites allows the derivation of positional weight matrices (Matys et al., 2005). URL: http://www.gene-regulation.com/pub/databases.html 4.1.1.2 JASPAR JASPAR is an open access database of annotated matrix based transcription factor binding site proﬁles for multi-cellular eukaryotes. The proﬁles were derived exclusively from the sets of nucleotide sequences those are experimentally demonstrated to bind transcription factors. The web interface of the database is dynamically designed for browsing, searching and subset selection of DNA motifs. Besides, an online sequence analysis utility with a suite of programming tools for genome wide and comparative genomic analysis of regulatory regions are also incorporated in the website (Sandelin et al., 2004). URL: http://jaspar.cgb.ki.se 4.1.1.3 DBTBS (Database of Transcriptional regulation in Bacillus subtils) This database is enriched with experimentally validated gene regulatory relations and the corresponding transcription factor binding sites in upstream regions of Bacillus subtilis genes. DBTBS is based on 947 references and has a collection of the information on 120 binding factors and 1475 gene regulatory relations. For each promoter, all of its known cis-elements are listed according to their positions, while these cis-elements are aligned to illustrate the consensus sequence for each transcription factor. All probable transcription factors coded in the genome were classiﬁed using Pfam motifs (Sierro et al., 2008). URL: http://dbtbs.hgc.jp 4.1.2 Tools 4.1.2.1 BioProspector This tool is used for the discovery of conserved DNA regulatory sequence motifs in upstream regions of co-expressed genes. It is based on a C program with a Gibbs sampling strategy. It uses Markov background to model the base dependencies of nonmotif bases, which greatly improved the speciﬁcity of

244

the reported motifs. A new motif scoring function is adopted to allow each input sequences to contain zero to multiple copies of the motif. In addition, BioProspector can model gapped motifs and motifs with palindromic patterns, which are prevalent motif patterns in prokaryotes (Liu et al., 2001). URL: http://ai.stanford.edu/∼xsliu/BioProspector/ 4.1.2.2 MDscan(Motif Discovery scan) MDscan is used to ﬁnd out conserved transcription factor binding sites among upstream sequences from chromatin immune precipitation experiments (CHIP) as well as from single microarray experiments. The used algorithm is strictly deterministic. Hence, it always gives the globally best conserved motif and do not discover the local minima. It is faster than most other algorithms and nearly as sensitive as BioProspector (Liu et al., 2002). URL: http://ai.stanford.edu/∼xsliu/cgibin/MDsearch.cgi 4.1.2.3 YMF (Yeast Motif Finder) YMF is a program that detects statistically overrepresented motifs in DNA sequences. YMF uses enumerative search algorithm that matches the speciﬁed characteristics, scoring each motif for its signiﬁcance, and results the top motifs sorted by their signiﬁcance. The signiﬁcance of a motif is measured by the Z-score of its count in the input sequences (Sinha et al., 2000; Sinha et al., 2002). URL: http://bio.cs.washington.edu/software.html 4.1.2.4 Motif Cluster Alignment and Search Tool (MCAST) MCAST identiﬁes the statistically signiﬁcant clusters of non-overlapping “hits”(match) to the motifs in a query sequence from database search. The maximum allowed distance between the hits in a match is to be speciﬁed by the user. Two hits separated by more than the maximum allowed gap are reported in separate matches. Each match between the query and the subject sequences is assigned an E-value, and matches scoring below an E-value cut-oﬀ are displayed in order of increasing E-value. Bailey and Noble (2003) showed the predictive accuracy possible when MCAST is used to search for cis-regulatory modules in Drosophila and human sequences. URL: http://genome.jouy.inra.fr/doc/genome/analyse-desequences/mast/mcast.html 4.1.2.5 Weeder Weeder uses an algorithm for the automatic conserved motif discovery in a set of related regulatory DNA sequences. The identiﬁed motifs are in turn likely to be instances of binding sites for some transcription factor. The interface automatically starts diﬀerent runs of the program, each one with diﬀerent parameters, and provides an overall summary of the results to the user with suggestions on the statistical signiﬁcance of the resulted motifs (Pavesi et al., 2004). URL: http://www.pesolelab.it/

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

4.1.2.6 Improbizer It searches for motifs in DNA or RNA sequences occurring with improbable frequency (just by chance) by using a variation of the expectation maximization (EM) algorithm. It accepts either one sequence per line or in FA format. The time consumption for running Improbizer is directly proportional to the input sequence length and also depends on the advanced settings chosen. Improbizer is available as a Web based CGI program as well as command line based executable. The executable for the Improbizer is called “ameme” (Ao et al., 2004). URL: http:// users.soe.ucsc.edu/∼kent/improbizer/improbizer.html 4.1.2.7 PhyloGibbs PhyloGibbs is used to identify the binding site motifs for transcription factors in cis-regulatory sequences of DNA, based on the Gibbs sampling algorithm with the following augmentations: (i) it scientiﬁcally accounts for non-functional conservation due to phylogeny for the sequences of closely related species and accordingly modiﬁes the scoring. (ii) It circumvents the problems of estimating the number of motifs in the sequence, and of evaluating the signiﬁcance of identiﬁed motifs, by using a two-stage motifﬁnding strategy: a) “simulated annealing” phase that ﬁnds a few high-quality groups of binding sites representing a few diﬀerent motifs, and b) “tracking” phase keeps statistics on measures of togetherness of these groups and the other sites get co-clustered with these groups. The output consists of a list of putative binding sites and the fraction of time they were co-clustered in that group (Siddharthan et al., 2005). URL: http://www.phylogibbs.unibas.ch/cgibin/phylogibbs.pl 4.1.2.8 Self-Organizing Map for Biological Regulatory Element Recognition and Ordering (SOMBRERO) SOMBRERO is a SOM-based motif identiﬁer which ﬁnds regulatory binding sites as overrepresented motifs in a DNA sequence set. It simultaneously characterizes a complete set of motifs for a given dataset. It also helps in separating weak motif signals from large datasets and improves motif detection performance in real biological datasets. The extension of SOMBRERO was initialised using a previously trained SOM on a set of known transcription factor binding matrices. Using such prior knowledge for initializing SOMBRERO considerably improves the accuracy, provided (i) known motifs are found to be present in the input data set and (ii) the accuracy is not negatively aﬀected for the discovery of novel motifs. One of the specialities of SOMBRERO is that it can incorporate the entire transcription factor binding matrix databases as prior knowledge (Mahony et al., 2005). SOMBRERO v.1.1 with precompiled binaries for various operating systems is freely available for download.

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

It is released under the GNU General Public Licence. The source code for SOMBRERO and a parallelised version of SOMBRERO (written using Message Passing Interface) are also available for download. URL: http://bioinf.nuigalway.ie/sombrero/download.html 4.1.2.9 Motif finding using an UnSupervised Approach (MUSA) MUSA can be used either to autonomously ﬁnd over-represented complex motifs or to estimate search parameters for modern motif ﬁnders. This method is based on a bi-clustering algorithm which operates on a co-occurrence matrix of small motifs. The performance and accuracy of this method is independent of the composite structure of the motifs being sought, making few assumptions about their characteristics (Mendes et al., 2006). URL: http://www.yeastract.com/discoverer/musa.php 4.1.2.10 GIbbsMarkov with Significance ANalysis (GIMSAN) GIMSAN is a novel web-server tool for de novo motif discovery. It is a Hybrid Gibbs Sampler for motif ﬁnding which uses a Bayesian prior on the percentage of sequences containing sites. It uses a maximum likelihood approach on the motif matrix. It gives biologically realistic and reliable statistical signiﬁcance analysis using a 3-Gamma approximation scheme with factoring local sequence composition information and reports an approximate 95% conﬁdence interval of the point estimator of the motif p-value (Ng and Keich, 2006). URL: http://www.cs.cornell.edu/∼ppn3/gimsan/ 4.1.2.11 Suite for Computational identification Of Promoter Elements (SCOPE) SCOPE is a novel parameter-free method proposed by Carlson et al. (2007) for the de-novo identiﬁcation of potential regulatory motifs in sets of co-regulated genes. SCOPE identiﬁes the best candidate motifs from its three component algorithms using an ensemble learning approach. It identiﬁes motifs with a signiﬁcantly higher level of accuracy from experimentally determined datasets. As SCOPE has no modiﬁable parameters, the web server has a perceptive interface, which requires only a set of gene names or sequences in FASTA formats and a choice of species. The most signiﬁcant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. URL: http://genie.dartmouth.edu/scope/ 4.1.2.12 Binding-site Estimation Suite of Tools (BEST) BEST is a package of motif-ﬁnding programs, which includes 4 major programmes, viz., AlignACE, BioProspector, Consensus, and MEME. The optimization program BioOptimizer is conﬁgured for each of these programs. BEST is compiled on Linux, and thus it can only be run on Linux machines. Two compiled versions are provided by

245

BEST to satisfy diﬀerent linux versions (Che et al., 2007). URL: http://www.cs.uga.edu/∼che/BEST and http://www.fas.harvard.edu/∼junliu/BEST 4.1.2.13 WebMOTIFS The WebMOTIFS is an online tool for motif discovery, scoring, analysis, and visualization. It allows the user to combine diﬀerent programs, viz., Aligns Nucleic Acid Conserved Elements (AlignACE), Motif Discovery Scan (MDscan), MEME, and Weeder to search for DNA-sequence motifs, and evaluate the results. It is a combination of two tools, Tools for Analysis of MOtifs (TAMO) (Gordon et al., 2005) and THEME (MacIssac et al., 2006). TAMO uses a number of motif discovery programs viz. AlignACE, MDscan, MEME, Weeder on the input sequences, and then combines, analyzes, and clusters the results. THEME does Bayesian motif analysis, incorporating prior knowledge about plausible motifs (Romer et al., 2007). URL: http://fraenkel.mit.edu/webmotifs/form.html 4.1.2.14 Matlign (Matrix alignment) Sequence motifs representing transcription factor binding sites (TFBS) are normally encoded as position frequency matrices (PFM) or degenerate consensus sequences (CS). Matlign was developed to facilitate the post-processing of motifs, in both PFM and CS formats. These formats are used to represent the characterized TFBS proﬁles available in the transcription factor databases, as well as to represent the potential motifs predicted using computational methods. A combination of scoring functions is used to evaluate the similarity of motifs after aligning the sequences and the results are visualized by the hierarchical clustering. The alignment algorithm suitably aligns the motifs with an internal spacer by limiting the number of distinct gaps created. The best non-redundant motif set is selected from the merged repetitive motifs by cutting the hierarchical tree using silhouette values (Kankainen and Loytynoja, 2007). URL: http:// ekhidna.biocenter.helsinki.ﬁ/poxo/matlign/matlign 4.1.2.15 Coding Motif Identification Tool (COMIT) Coding nucleotide sequences include myriad functions independent of their encoded protein sequences. COMIT algorithm is used to detect functional noncoding motifs in coding regions from sequence conservation, explicitly separating nucleotide from amino acid eﬀects. COMIT concurs with various experimental datasets, including splicing enhancers, silencers, replication motifs, and microRNA targets, and predicts many novel functional motifs. Interestingly, COMIT scores are well-correlated to scores uncalibrated for amino acids, suggesting that nucleotide motifs often override peptide-level constraints (Kural et al., 2009). 4.1.2.16 (conserved) Evidence-Ranked Motif Identification Tool (cERMIT) cERMIT is used for analyzing the genome-wide

246

quantitative regulatory evidence, implementing an eﬃcient enumerative strategy for identifying cisregulatory elements to address the reformulation of the motif ﬁnding problem. It operates on a set of non-coding regulatory regions and their corresponding evidence of direct or indirect regulation. It utilizes the information across all the sequence regions to search for high-scoring motifs instead of pre-selecting appropriate candidate sequences. Georgiev et al. (2010) applied cERMIT on a range of direct binding and over-expression datasets, which substantially outperforms state-of-the-art approaches on curated ChIP-chip datasets, and easily scales to mammalian ChIP-seq experiments with the data on thousands of non-coding regions. It was also validated extensively on curetted yeast datasets. The binary executable of cERMIT (compiled for Linux), the sample datasets, sample ChIP-chip, ChIP-seq, and microRNA datasets are available for download. URL: http://tools.genome.duke.edu/generegulation/ transcription/cERMIT/ 4.1.2.17 ModuleMaster It is a novel application for ﬁnding cis-regulatory modules (CRMs) in sets of co-expressed genes. It not only considers transcription factor binding information but also multivariate functional relationships between regulators and target genes to improve the identiﬁcation of CRMs. The program includes all necessary data and algorithms to perform every step to ﬁnd CRMs for a given result of a microarray and a subsequent clustering experiment. This workbench provides an easy-to-use GUI, together with job-processing and command-line options. The detected CRMs can be visualized and evaluated in various ways, i.e., generating GraphML (an XML-based ﬁle format for graphs), R-based whole regulatory Network visualizations and generating Systems Biology Markup Language (SBML) ﬁles for subsequent analytical processing and dynamic modeling. The tool is freely available to academic users as a webstart application including all comprehensive documentation (Wrzodek et al., 2010). URL: http://www.ra.cs.unituebingen.de/software/ModuleMaster/ 4.1.2.18 Automated Motif Discovery (AMD) Tool Using Stepwise Refinement of Gapped Consensuses Recently, Shi et al. (2011) developed an opensource software called Automated Motif Discovery (AMD) Tool, which allows for eﬀective and eﬃcient recognition of motifs in various datasets. AMD can eﬃciently discover motifs up to 20 nucleotide bases length, the most common size of the currently known motifs. AMD could be applied to aligned sequences to identify conserved motifs, regardless of whether the motifs are long or short and gapped or continuous. AMD also identiﬁes overrepresented

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

motifs in a given input sequence set and compared with a background sequence set. Here, the method is divided into ﬁve sequential steps: core motif ﬁltering, degeneration, extension, reﬁnement and redundancy removal. The selected probable motifs in each step are subjected to the next step for further processing. Motif discovery in these clustered genes requires the consideration of multiple clusters simultaneously. URL: http://www.plosone.org/article/info%3Adoi% 2F10.1371%2Fjournal.pone.0024576 4.2 Databases and tools for protein motifs 4.2.1 Databases 4.2.1.1 BLOCKS The blocks for this database are made automatically by considering the strongly conserved regions in groups of proteins from the Prosite Database. However, the Prosite pattern for a protein group is not used to make the BLOCKS Database and the pattern may or may not be contained in one of the blocks representing a group. These blocks are then calibrated against the SWISSPROT database to obtain a measure of the chance distribution of matches, and these calibrated blocks that build up the BLOCKS Database (Pietrokovski et al., 1996). URL: http://blocks.fhcrc.org/ 4.2.1.2 The Eukaryotic Linear Motif (ELM) server ELM is an online bio-computational server for the prediction of putative functional sites using speciﬁc patterns in eukaryotic proteins. It applies context-based rules and logical ﬁlters to reduce the amount of false positives. Here, the positionally conserved motifs are identiﬁed if known ELM instances and predictions of the sequences are found similar to ELM instance sequences. Functional sites which ﬁt to the description of linear motif are currently speciﬁed as patterns using Regular Expression rules. Context-based rules and logical ﬁlters are being developed to improve the predictive power and reduce the amount of false positives. It is currently the largest collection of linear motif classes with annotated and experimentally validated linear motif instances. The ELM structure ﬁlter aids users to assess the candidate motifs present in globular structural regions (Puntervoll et al., 2003). URL: http://www.elm.eu.orgn 4.2.1.3 PRINTS ‘PRINTS’ is a compendium of protein ﬁngerprints. A ﬁngerprint is referred as a group of conserved motifs used for the characterization of a protein family. The diagnostic power of PRINTS is reﬁned by the iterative scanning of a SWISS-PROT/TrEMBL composite. Generally the motifs do not overlap, but are separated along a sequence, in spite of being contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more ﬂexibly and powerfully than a single motif because the full diagnostic potency is deriving from the mutual context supplied by the motif

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

neighbors in a ﬁngerprint (Attwood et al., 2004). URL: http://www.bioinf.manchester.ac.uk/dbbrowser/ PRINTS/index.php 4.2.1.4 PROSITE PROSITE database consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and proﬁles to identify them. It includes a method of determining the function of uncharacterized proteins translated from genomic or cDNA sequences. It contains biologically signiﬁcant sites and patterns that help in the identiﬁcation and characterization of an unknown protein sequence. The signatures (patterns and proﬁles) of PROSITE have complete documentation that provides the background information on the structure and function of these proteins (Hulo et al., 2006). URL: http://prosite.expasy.org/scanprosite 4.2.1.5 Pfam (Protein Families) The Pfam is one of the important biological databases for protein classiﬁcation. The database categorises 75 % of known proteins to form a library of protein. Each entry in this database includes a protein sequence alignment as well as an accompanying HMM. Pfam allows users to analyse their sequence data and ﬁnd out the functional domain and motifs in the protein. Besides, it contains the information on higher level groupings of related protein families (clans), which are related by similarity of sequence, structure or by a statistical analysis of their associated HMM. The latest version of the Pfam database contains approaching 12,000 curated protein families (Finn et al., 2010). URL: http://www.sanger.ac.uk/resources/databases/pfam.html 4.2.2 Tools 4.2.2.1 MINER It is a web enabled tool for phylogenetic motif (PM) identiﬁcation. PMs are fragments of a sequence that is conserved in the overall familial phylogeny. PMs have been shown to correspond to an extensive diversity of catalytic regions, substrate-binding sites and protein interfaces, predicting them as ideal functional sites. MINER takes multiple sequence alignment (MSA) as an input and it also aligns the unaligned sequences, using ClustalW (Thompson et al., 1994).The output of MINER provides an interactive interface for the analysis of PM sequences and structural visualization. An algorithm for a sliding sequence window is used in MINER to quantitatively evaluate the phylogenetic similarity between each sequence region and the whole sequence. Distance-based trees are calculated for the whole alignment and each window. Tree topology based phylogenetic similarity is calculated using the partition metric algorithm (Penny et al., 1985). The partition metric counts the number of topological diﬀerences between the two trees. Partition metric values are recast as Z-scores. Overlapping sequence windows, scor-

247

ing some preset phylogenetic similarity Z-score (PSZ) threshold are identiﬁed as PMs (La et al., 2005). URL: http://www.pmap.csupomona.edu/MINER/ 4.2.2.2 Motif Identification Neural Design (MOTIFIND) The tool was developed for rapid and sensitive protein family identiﬁcation as an extension of gene classiﬁcation artiﬁcial neural system. It employs new designs to enhance the detection of distant relationships. The accuracy of MOTIFIND is comparable to the BLAST database search method, but its speed is 20 times faster than BLAST search. It also compares favorably with BLocks IMProved Searcher (BLIMPS)(Wallace and Henikoﬀ, 1992), the HMM and PROFILESEARCH (Gribskov et al., 1990) for discovering fragmentary sequences lacking complete motif regions and for identifying distant relationships, especially for the members of under-represented subgroups within a family (Wu et al., 1996). URL: http://diana.uthct.edu/ 4.2.2.3 ChloroP ChloroP is a neural Network based method to identify chloroplast transit peptides (cTP) and their corresponding cleavage sites. The performance accuracy of this tool is near about the only publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-ﬁnding algorithm. ChloroP method is useful for the identiﬁcation of putative transit peptides in genome-wide sequence data (Emanuelsson et al., 1999). The stand-alone version of ChloroP, ChloroP 1.1 software package is available with the same functionality as the online server of ChloroP. Ready-to-ship packages of the tool are compiled in UNIX platforms. It is free for the academic users and commercially available at CBS Software Package Manager. URL: http://www.cbs.dtu.dk/services/ChloroP/ 4.2.2.4 InterPro Scan InterProScan is a motif discovery tool which scans a given sequence of a protein against the protein signatures of the InterPro member databases. InterPro is an EMBL-EBI integrated database of predictive protein “signatures” (Hunter et al., 2009). InterProScan combines diﬀerent protein signature recognition methods native to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotation. The number of signature databases and their associated scanning tools as well as the further reﬁnement procedures make the problem complex (Zdobnov and Apweiler, 2001). InterProScan is designed to be a scalable and extensible system with a robust internal architecture. The Perl-based InterProScan implementation is available from the EBI ftp server ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/. URL: http://www.ebi.ac.uk/ Tools/InterProScan 4.2.2.5 Motif Scan Motif Scan is an open source motif scanner available

248

in MyHits (http://myhits.isb-sib.ch) database. MyHits is a free database developed for protein domains and includes many tools for the investigation of the relationships between protein sequences and their motifs. These motifs are deﬁned by a heterogeneous collection of predictors, including regular expressions, generalized proﬁles and HMMs. Motif Scan ﬁnds all known motifs that occur in an input protein sequence from the well known motif sources such as PeroxiBase, HAMAP, PROSITE and Pfam (Pagni et al., 2007). URL: http://myhits.isb-sib.ch/cgi-bin/motif scan 4.3 Other important motif discovery tools This section includes the tools meant for both DNA and Protein motif discovery. Besides, here a Network motif discovery tool is also discussed. 4.3.1 Multiple Expectation Maximization for Motif Elicitation (MEME) MEME is one of the most commonly used tools for searching novel ‘signals’ in sets of biological sequences. It is applied for the discovery of new transcription factor binding sites and protein domains. MEME searches for repeated, ungapped sequence patterns occurring in the DNA or protein sequences supplied by the user. MEME searches can be performed at the web server hosted by the National Biomedical Computation Resource (Bailey et al., 1994). URL: http://meme.nbcr.net 4.3.2 Gibbs Motif Sampler The Gibbs Motif Sampler is a Bayesian model based motif discovery tool that allows the user to identify motifs and conserved regions in both DNA and protein sequences. It is a compendium of several analysis tools all developed from a basic algorithm called Gibbs sampling method (Neuwald et al., 1995). URL: http://bayesweb.wadsworth.org/gibbs/gibbs.html 4.3.3 gn It is both a tool and a domain speciﬁc language for motif discovery in inter-species gene Networks. gn identiﬁes motifs by comparing the input Network with a series of randomly generated networks having similar properties but it does not support Network randomization. The properties like, edge weights speciﬁc to the kind of inter-genome analysis or restricting the set of nodes from particular organisms are easy to specify with gn. The software is developed on Linux systems. It requires recent versions of flex and bison to generate its frontend. gn does all its work by querying data stored in a PostgreSQL database. To query, gene Network data must be ﬂattened into gn’s prescribed relational schema. It is an opensource software and licensed under the GNU General Public License, version 3. URL: http://www.cs.unh.edu/∼tfogal/gn/ A list of all the tools with approaches, principle, motif type, purpose, category, webpage and references, discussed in this study is presented in Table 1.

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

5 Discussion As biological data sets grow exponentially, tools for extraction of biologically relevant motifs have become more useful. Several types of motifs like regulatory motifs, transcription factor binding motifs, active site motifs, species speciﬁc and function speciﬁc motifs, Network motifs and highly conserved non coding motifs in a gene are vital from biological point of view. Motifs representing a particular functional class or family help in the annotation and characterization of novel nucleotide or protein sequences. Though, the discovery of such motifs is being done by many approaches, the enumerative and statistical approaches based on AI techniques are highly precise and used with a broad range of applicability. The Enumerative algorithms viz. EC/EA, SOM and PALS are suitable for ﬁnding an optimal solution for ﬁnding the relatively short motifs. However, they do not scale to larger motifs and continuous motif models. The Statistical approaches are best for ﬁnding continuous motifs by optimizing the parameters of the sequence data and using EM and Gibbs sampling algorithms. The EAs are based on the evolutionary search process and perform global search, which is ﬂexibly represented, concise and not inﬂuenced by speciﬁc heuristics. They are suitable for DNA regulatory motif discovery. An implementation of EA as SPEA2 further modiﬁed as MAGMA is a faster application in terms of approximating a set of optimal tradeoﬀ solutions for the large scale problems. GAs, a subtype of EAs has a great advantage over the preexisting ECs in terms of handling the problems related with long motifs and multiple similar motifs in a single sequence. These problems may be more easily tackled by the GPs, an improved extension of GAs. The SOMs are also important for ﬁnding DNA binding Motifs and familiar binding proﬁles but restricted to identiﬁcation of two indel (insertions and/or deletions) combinations. However, these algorithms are cost eﬀective and highly reliable. The PALS have high sensitivity and speciﬁcity and are applicable to derive the patterns closer to the known consensus patterns for real sequences. The statistical approaches viz. Clustering, Fuzzy Logic, HMM, EM and PWMs are widely used to solve the pattern recognition problems. Clustering is a good method for identiﬁcation and characterization of coexpressed genes based on the expression patterns where as Fuzzy models have the advantage over the other standard mathematical models in terms of capturing the imprecision and uncertainty in pattern recognition problems. HMMs can be used for motif discovery in DNA sequence in a user friendly way as it learn parameters from the input set of observations. PWMs calculate the positional conservedness of a residue and it also computes the threshold for the likelihood of a

Interdiscip Sci Comput Life Sci (2012) 4: 239–255 Table 1

249

List of tools for motif discovery arranged by type of motif Type

Name

Approaches

Principle

of

Tool/

Purpose

Database

Motif

Webpage

Reference

AMD

Enumerative

PWM and Clustering

DNA

Transcription factor binding site, Cisregulatory elements

Tool

http://www.plosone.org/ article/info%3Adoi% 2F10.1371%2Fjournal. pone.0024576

Shi et al., 2011

BioProspector

Statistical

Gibbs sampling

DNA

Regulatory sequence motifs

Tool

http://ai.stanford.edu/ ∼xsliu/BioProspector/

Liu et al., 2001

COMIT

Statistical

COMIT algorithm

DNA

Functional noncoding motif in coding region

Tool

http://www.genomebiology. Kural et com/2009/10/11/R133 al., 2009

DBTBS

Statistical

Clustering

DNA

TFBS motifs in upstream intergenic regions.

Database http://dbtbs.hgc.jp

Sierro et al., 2008

GIMSAN

Statistical

Gibbs sampling

DNA

De novo motif discovery

Tool

Ng and Keich, 2006

JASPAR

Statistical

PWM

DNA

Transcription factor binding sites

Database http://jaspar.cgb.ki.se

Sandelin et al., 2004

MATLIGN

Statistical

PWM

DNA

Transcription factor binding site

Tool

http://ekhidna.biocenter. helsinki.ﬁ/poxo/matlign/ matlign

Kankainen and Loytynoja, 2007

MCAST

Statistical

HMM

DNA

Detecting occurrences of regulatory modules in genomic DNA.

Tool

http://genome.jouy.inra. fr/doc/genome/analysedesequences/mast/mcast. html

Bailey and Noble, 2003

MDscan

Statistical

Bayesian statistical model

DNA

Transcription factor binding sites among upstream sequences

Tool

http://ai.stanford.edu/ ∼xsliu/cgi-bin/ MDsearch.cgi

Liu et al., 2002

Module Master

Statistical

Clustering and PWM

DNA

Cis-regulatory modules and TFBS

Tool

http://www.ra.cs.unituebingen.de/software/ ModuleMaster/

Wrzodek et al., 2010

MUSA

Statistical

Clustering algorithm

DNA

Identiﬁcation of de novo binding site consensus sequences

Tool

http://www.yeastract. com/discoverer/musa. php

Mendes et al., 2006

PhyloGibbs

Statistical

Gibbs sampling

DNA

Identify the binding site motifs for transcription factors in cis-regulatory sequences of DNA

Tool

http://www.phylogibbs. unibas.ch/cgi-bin/ phylogibbs.pl

Siddharthan et al., 2005

SCOPE

Statistical

PWM

DNA

Transcription factor binding sites

Tool

http://genie.dartmouth. edu/scope/

Carlson et al., 2007

SOMBRERO

Enumerative

SOM

DNA

Regulatory sites

Tool

http://bioinf.nuigalway. ie/sombrero/download. html

Mahony et al., 2005

TRANSFAC

Statistical

PWM

DNA

Transcription factors

Database http://www.generegulation.com/ pub/databases.html

Matys et al., 2005

WebMOTIFS

Statistical

Clustering

DNA

Identiﬁcation of regulatory sequence motif

Tool

http://fraenkel.mit.edu/ webmotifs/form.html

Romer et al., 2007

Weeder

Enumerative

Weeder algorithm

DNA

Searches for overrepresented nucleotide patterns

Tool

http://www.pesolelab.it/

Pavesi et al., 2004

binding

http://www.cs.cornell.edu/ ∼ppn3/gimsan/

250

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

Continued Type Name

Approaches

Principle

of

Tool/

Purpose

Database

Motif

Webpage

Reference

YMF

Enumerative

YMF algorithm

DNA

Detects statistically overrepresented motifs

Tool

http://bio.cs.washington. edu/software.html

Sinha et al., 2000; Sinha et al., 2002

cERMIT

Statistical

Clustering

DNA and RNA

Cis-regulatory ments

ele-

Tool

http://tools.genome. duke.edu/generegulation/ transcription/cERMIT/

Georgiev et al., 2010

Improbizer

Statistical

Expectation maximization (EM)

DNA and RNA

Finds motifs in DNA or RNA sequences occurring with improbable frequency

Tool

http://users.soe.ucsc.edu/ ∼kent/improbizer/ improbizer.html

Ao et al., 2004

BEST

Uses AlignACE, BioProspector, CONSENSUS and MEME based on users need

DNA

Binding site prediction

Tool

http://www.fas.harvard. edu/∼junliu/BEST

Che et al., 2007

BLOCKS

Statistical

Gibbs Sampling

Protein Collection of patterns or proﬁles

Database http://blocks.fhcrc.org/

Pietrokovski et al., 1996

ChloroP

Statistical

ANN

Protein Chloroplast Transit Peptides (cTP) and their corresponding cleavage sites.

Tool

Emanuelsson et al., 1999

ELM

Enumerative

Regular Expression (EA based)

Protein Predicts putative functional sites

Database http://www.elm.eu.orgn

Puntervoll et al., 2003

InterProScan

Enumerative

Sequence Retrieval System (SRS) based algorithm

Protein Protein signatures

Database http://www.ebi.ac.uk/ Tools/InterProScan

Zdobnov and Apweiler, 2001

MINER

Enumerative

Partition metric algorithm

Protein Phylogenetic motif (PM) identiﬁcation

Tool

La et al., 2005

Motif Scan

Statistical

HMM

Protein Protein motif identiﬁcation

Database http://myhits.isbsib.ch/cgibin/motif scan

MOTIFIND

Statistical

ANN

Protein Rapid and sensitive protein family identiﬁcation

Tool and database

Pfam

Statistical

HMM

Protein Protein classiﬁcation

Database http://www.sanger.ac.uk /resources/ databases/ pfam.html

Finn et al., 2010

PRINTS

Statistical

Probabilistic approach

Protein Protein ﬁngerprints

Database http://www.bioinf. manchester.ac.uk/ dbbrowser/PRINTS/ index.php

Attwood et al., 2004

PROSITE

Enumerative

Regular Expression (EA based)

Protein Collection of patterns or proﬁles

Database http://prosite.expasy.org /scanprosite

Hulo et al., 2006

http://www.cbs.dtu.dk/ services/ChloroP/

http://www.pmap. csupomona.edu/ MINER/

http://diana.uthct.edu/

Pagni et al., 2007 Wu et al., 1995

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

251 Continued

Type Name

Approaches

Principle

of

Tool/

Purpose

Webpage

Reference

Database

Motif Gibbs Motif Sampler

Statistical

Gibbs sampling

DNA and Protein

Identify motifs, conserved regions,

Tool

http://bayesweb.wadsworth.org/gibbs/gibbs. html

Neuwald et al., 1996

MEME

Statistical

EM

DNA and Protein

Transcription factor binding sites and protein domains

Tool

http://meme.nbcr.net

Bailey et al., 1994

gn

Enumerative

Subgraph Enumeration

Gene Network

Transcriptional regulation Network

Tool

http://www.cs.unh.edu/∼tfogal/gn/

predicted motif to be an observed motif. These scores are used to ﬁnd a consensus for a particular class of motifs as a characterization feature. Though several tools are reviewed above, there exists certain tools speciﬁc to identify a particular class of motifs. SCOPE is speciﬁc for the identiﬁcation of potential regulatory motifs in sets of co-regulated genes. Bioprospector is a DNA motif discovery tool dynamically designed for identiﬁcation of regulatory motifs. cERMIT is an eﬃcient enumerative strategy for identifying genome wide cis-regulatory elements. Besides, MCAST, ModuleMaster, PhyloGibbs are also well-known tools for the identiﬁcation of cis-regulatory motifs. For the identiﬁcation of the overrepresented motifs the suitable tools are MUSA, SOMBRERO, AMD, Weeder and YMF. In particular, SOMBRERO distinguishes the weak motifs form the strong motifs while AMD works independent of motif length (long or short) and nature (gapped or continuous). Two major DNA motif databases speciﬁc for TFBS are JASPAR and TRANSFAC whereas MEME, Matlign, PhyloGibbs, MDScan are eﬃcient tools for the identiﬁcation of the novel TFBS motifs. Though these tools are based on diﬀerent algorithms or models such as EM, PWM, Gibbs Sampling and Bayesian model, they all meant for TFBS motif discovery. Another tool based on Gibbs sampling is GIMSAN, which gives biologically realistic and reliable and statistically signiﬁcant motifs. COMIT is used to detect functional non-coding motifs in coding regions from sequence conservation. Improbizer is an eﬃcient tool used for identiﬁcation of nucleotide sequence motifs occurring with improbable frequency. DBTBS is a species speciﬁc DNA motif database of transcriptional regulation in Bascillus subtilis. Besides all these, BEST and WebMotifs are the two opensource, web-enabled software packages helps in motif ﬁnding using a combination of diﬀerent algorithms. In the context of protein motifs, PROSITE, Pfam

and BLOCKS are well known and most commonly used databases. Motif Scan is an HMM based motif ﬁnder which uses the available motif sources such as PeroxiBase, HAMAP, PROSITE and Pfam for protein motif identiﬁcation. Similarly, InterProScan is another motif scanner which uses protein signatures of InterPro database. MINER is a suitable tool based on MSA for the identiﬁcation of phylogenetic motifs. For discovery of the putative functional sites in eukaryotic proteins, ELM web server may be used. ChloroP has the speciality of identifying chloroplast Transit Peptides (cTP) and their corresponding cleavage sites. MOTIFIND, an ANN based faster program may be used for the identiﬁcation of rapid and sensitive protein family. PRINTS is well known database for identiﬁcation of a group of conserved motifs used in the characterization of proteins, as ﬁnger prints. MEME and Gibbs Motif Sampler are two widely used tools for both DNA and Protein motif identiﬁcation. Interestingly, gn is a tool as well as a domain speciﬁc language which helps in discovering the motifs in interspecies gene networks but it does not support Network randomization. In addition to the above discussed merits and speciﬁcity of the approaches/ algorithms, a comparative analysis of the advantages and limitations of most commonly used algorithms is given in Table 2 that can guide the reader to use a suitable motif ﬁnder of his interest. It is evident from the literature reviewed above that Artiﬁcial Intelligence techniques can be successfully applied to discover motifs. However, Evolutionary Computation (EC) approaches remain a niche activity within this ﬁeld, with practicing biologists preferring to use tried-and-tested approaches based around dynamic programming, statistical techniques and neural networks. The tools and databases reviewed above provide a guideline to the researchers and scientists for their use in motif discovery.

252

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

Table 2

List of major algorithms discussed in this study with their advantages and limitations

Algorithms

Approaches

Advantages

Limitations

References

EA

Enumerative

Non exhaustive and unbiased global search. Representational ﬂexibility. Independent of the way of solution derivation and scoring.

Do not scale to larger motifs and continuous motif models due to combinatorial growth in the search space.

Lones and Tyrell, 2005

SOM

Enumerative

Visualization of low-dimensional views of high-dimensional data. Find motifs with up to two insertions, deletions or their combinations. More accurate and costs less computation than existing algorithms.

Relative inﬂexibility while mapping sequence data to inputs. Diﬃculty of interpreting neural models, particularly when hidden layers are present within the Network.

Liu et al., 2005

PALS

Enumerative

High sensitivity and speciﬁcity. Eﬀective in deriving patterns close to the known consensus patterns for real sequences. Have small space complexities.

Slower for the datasets having number of sequences more than 1000 and sequence length more than 1000.

Ning and Leong, 2006

Clustering

Statistical

Automatically assign genotypes to distinct groups. Consensus of a particular family helps to identify a recognizable pattern as a functional motif.

Diﬃcult to know which distance metric to choose, especially for structured data such as images or sequences. Hard to access the reliability of a cluster, to compare to other models, to make predictions and cluster new data into an existing hierarchy.

Johnson and Wichern, 2007; Heller et al., 2005

Fuzzy Logic

Statistical

Quantify imprecision and uncertainty that is not easily captured by standard mathematical models, particularly in pattern recognition problems. Genomic sequence comparison

Lacks the capability of machine learning as well as a neural Network-type memory and pattern recognition. Veriﬁcation and validation of fuzzy based system is an expensive aﬀair. Diﬃcult to determine exact fuzzy rules and membership functions Stability is an important concern for fuzzy control.

Akerkar et al., 2009

HMM

Statistical

Ability to learn parameters directly from a set of observations. Freeing the user from having to set the parameters a priori.

Constant statistics within HMM state. Conditional independence of observations given the state sequence, without increasing the number of model parameters.

Zen et al., 2004

PWM

Statistical

Versatility to extract motifs from a wide range of data sets which lead to discovery of novel biologically relevant protein motifs directly from a raw protein

Assumes independence of positions within the binding motif. Predictions based on PWMs are usually not very speciﬁc to known functional sites.

Siddharthan et al., 2010

Gibbs Sampling

Statistical

Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics

More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space Repeat DNAs can be confused as motifs

Krishnan, 2005

Expectation Maximization

Statistical

Simplicity and ease of implementation. Does not usually require heavy preparatory analytical work. Easy to program. Easily parallelized and its memory requirements tend to be modest compared to other methods. Numerically very stable.

Slow linear convergence in some cases. Increase the complexity of the implementation. Does not provide an estimate of the information matrix on the vector of parameters as a by-product.

Couvreur et al., 1996

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

Acknowledgements This study was supported by World Bank Funded – National Agricultural Innovation Project (NAIP), ICAR Grants NAIP/Comp4/C4/C-30033/2008-09 and 30(68)/2009/Bio Informatics/NAIP/ O&M.

References [1] Akerkar, R., Sajja, P. 2009. Fuzzy Logic, In Knowledge-Based Systems. Jones & Bartlett Learning, Burlington, USA. [2] Alon, U. 2006. An Introduction to Systems Biology: Design Principles of Biological Circuits. Boca Raton: CRC, FL, USA. [3] Ao, W., Gaudet, J., Kent, W.J., Muttumu, S., Mango, S.E. 2004. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305, 1743–1746. [4] Arredondo, T.V., Neelakanta, P.S., DeGroﬀ, D. 2005. Fuzzy Attributes of a DNA complex: Development of a fuzzy inference engine for codon-“junk” codon delineation. Artif Intell Med 35, 87–105.

253 [14] Couvreur, C. 1996. The EM Algorithm: A Guided Tour. In: Preprints of 2nd IEEE European Workshop on Computer-Intensive Methods in Control and Signal Processing (CMP’96), Pragues, Czech Rep., 115–120. [15] Cundari, T.R., Russo, M. 2001. Database mining using soft computing techniques. An integrated neural network-fuzzy logic-genetic algorithm approach. J Chem Inf Comp Sci 41, 281–287. [16] Emanuelsson, O., Nielsen, H., Heijne, G.V. 1999. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 8, 978–984. [17] Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R., Bateman, A. 2010. The Pfam protein families database. Nucl Acid Res 38, D211–D222. [18] Frith, M.C., Saunders, N.F.W., Kobe, B., Bailey, T.L. 2008. Discovering sequence motifs with arbitrary insertions and deletions. PLOS Comp Biol 4, e1000071. [19] Georgiev, S., Boyle, A.P., Jayasurya, K., Ding, X., Mukherjee, S., Ohler, U. 2010. Evidence-ranked motif identiﬁcation. Genome Biol 11, R19.

[5] Attwood, T.K., Bradley, P., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G. 2004. The PRINTS protein ﬁngerprint database: functional and evolutionary applications. In: Encyclopaedia of Genetics, Genomics, Proteomics & Bioinformatics. John Wiley and Sons, Chichester, UK.

[20] Gordon, D.B., Nekludova, L., McCallum, S., Fraenkel, E. 2005. TAMO: A ﬂexible, object-oriented framework for analyzing transcriptional regulation using DNAsequence motifs. Bioinformatics 21, 3164–3165.

[6] Bailey, T.L. 2007 Discovering sequence motifs. Method Mol Biol 395, 271–292.

[22] Heller, K.A., Ghahramani, Z. 2005. Randomized algorithms for fast bayesian hierarchical clustering. PASCAL Workshop on Statistics and Optimization of Clustering, Windsor, UK.

[7] Bailey, T.L., Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park California, 28–36.

[21] Gribskov, M., Luthy, R., Eisenberg, D. 1990 Proﬁle analysis. Method Enzymol 183, 146–159.

[23] Hubley, R.M., Zitzler, E., Roach, J.C. 2003. Evolutionary algorithms for the selection of single nucleotide polymorphisms. BMC Bioinformatics 4, 30.

[8] Bailey, T.L., Noble, W.S. 2003. Searching for statistically signiﬁcant regulatory modules. Oxford University Press, Seattle WA.

[24] Hudak, J., Mcclure, M.A. 1999. A comparative analysis of computational motif-detection methods. Paciﬁc Symposium on Biocomputing 4, 138–149.

[9] Blekas, K., Fotiadis, D.I., Likas, A. 2003. Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19, 607–617.

[25] Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.A. 2006. The PROSITE database. Nucl Acid Res 34, D227–D230.

[10] Carlson, J.M., Chakravarty, A., DeZiel, C.E., Gross, R.H. 2007. SCOPE: A web server for practical de novo motif discovery. Nucl Acid Res 35 (Suppl. 2), W259– W264. [11] Che, D., Jensen, S., Cai, L., Liu, J.S. 2005. BEST: Binding-site estimation suite of tools. Bioinformatics 21, 2909–2911. [12] Chudova, D., Smyth, P. 2002. Analysis of pattern discovery in sequences using a bayes error framework. Data Min Knowl Disc 7, 273–299. [13] Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. 2003. Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci 100, 3339– 3344.

[26] Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A.F., Selengut, J.D., Sigrist, C.J.A., Thimma, M., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H., Yeats, C. 2009. InterPro: The integrative protein signature database. Nucl Acid Res 37, 211–215. [27] Jacob, E., Sasikumar, R., Nair, K.N. 2005. A Fuzzy guided genetic algorithm for Operon Prediction. Bioinformatics 21, 1403–1407.

254 [28] Johnson, R.A., Wichern, D.W. 2007. Applied Multivariate Statistical Analysis, 6th Edition. Prentice Hall. Inc., New Jersey, USA. [29] Kankainen, M., Loytynoja, A. 2007. MATLIGN: A motif clustering, comparison and matching tool. BMC Bioinformatics 8, 189. [30] Kicinger, R., Arciszewski, T., De Jong, K.A. 2005. Evolutionary computation and structural design: A state of the art. Comput Struct 83, 23–24. [31] Kohonen, T., Honkela, T. 2007. Kohonen network. Scholarpedia 2, 1568. [32] Krishnan, A. 2005. CS262: Computational Genomics, Lecture 15. http://robotics.stanford.edu/∼ seraﬁm/cs262/Spring2003/Notes/ln14.pdf [33] Kural, D., Ding, Y., Wu, J., Korpi, A.M., Chuang, J.H. 2009. COMIT: Identiﬁcation of noncoding motifs under selection in coding sequences. Genome Biol 10, R133. [34] La, D., Livesay, D.R. 2005. MINER: Software for phylogenetic motif identiﬁcation. Nucl Acid Res 33, 267– 270. [35] Liu, J.S. 1994. The collapsed gibbs sampler in Bayesian computations with applications to a gene regulation problem. JASA 89, 958–966. [36] Liu, X., Brutlag, D.L., Liu, J.S. 2001. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Paciﬁc Symposium on Biocomputing 6, 127–138. [37] Liu, X.S., Brutlag, D.L., Liu, J.S. 2002. An algorithm for ﬁnding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol 20, 835–839. [38] Liu, D., Xiong, X., Hou, Z.G., Dasgupta, B. 2005. Identiﬁcation of motifs with insertions and deletions in protein sequences using self-organizing neural networks. Neural Networks 18, 835–842. [39] Lones, M.A., Tyrell, A.M. 2005. The evolutionary computation approach to motif discovery in biological sequences. In: Rothlauf, F. (Ed.) Proceedings of GECCO Workshop Program, Workshop Biological Applications of Genetic and Evolutionary Computation, Washington, USA, 1–11. [40] Lones, M., Tyrell, A. 2007. Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE T Compt Biol Bioinformatics 4, 403–414. [41] MacIsaac, K.D., Gordon, D.B., Nekludova, L., Odom, D.T., Schreiber, J., Giﬀord, D.K., Young, R.A., Fraenkel, E. 2006. A hypothesis-based approach for identifying the binding speciﬁcity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22, 423–429. [42] Mahony, S., Benos, P.V., Smith, T.J., Golden, A. 2006. Self-organizing neural networks to support the discovery of DNA-binding motifs. Neural Networks 19, 950– 962.

Interdiscip Sci Comput Life Sci (2012) 4: 239–255 [43] Mahony, S., Hendrix, D.V., Smith, T.J., Golden, A., Roshkar, D.S. 2005. Self-organizing maps of position weight matrices for motif discovery in biological sequences. Artif Intell Rev 24, 397–413. [44] Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E. 2006. TRANSFAC and its module TRANSCompel: Transcriptional gene regulation in eukaryotes. Nucl Acid Res 34, D108–D110. [45] Mendes, N.D., Casimiro, A.C., Santos, P.M., Correia, I.S., Oliveira, A.L., Freitas, A.T. 2006. MUSA: A parameter free algorithm for the identiﬁcation of biologically signiﬁcant motifs. Bioinformatics 22, 2996–3002. [46] Narayanan, A., Keedwell, E., Olsson, B. 2003. Artiﬁcial intelligence techniques for bioinformatics. Appl Bioinformatics 1, 191–222. [47] Ng, P., Keichn, U. 2006. Apples to apples: Improving the performance of motif ﬁnders and their signiﬁcance analysis in the twilight zone. Bioinformatics 22, 393– 401. [48] Ning, K., Leong, H.W. 2006. Finding patterns in biological sequences by longest common subsequences and shortest common supersequences. In: Proceedings of the 6th IEEE Symposium and Bioengineering, Washington, USA, 53–60. [49] Pagni, M., Ioannidis, V., Cerutti, L., Zahn-Zabal, M., Jongeneel, C.V., Hau, J., Martin, O., Kuznetsov, D., Falquet, L. 2007. MyHits: Improvements to an interactive resource for analyzing protein sequences. Nucl Acid Res 35, 433–437. [50] Paul, T.K., Iba, H. 2006. Identiﬁcation of Weak Motifs in Multiple Biological Sequences Using Genetic Algorithm. Dept of Frontier Informatics, University of Tokyo, GECCO, Seattle, USA. [51] Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G. 2004. Weeder Web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucl Acid Res 32, W199–W203. [52] Petsko, G.A., Ringe, D. 2004. Protein Structure and Function. New Science Press Ltd., London, UK. [53] Pietrokovski, S., Henikoﬀ, J.G., Henikoﬀ, S. 1996. The blocks database - a system for protein classiﬁcation. Nucl Acid Res 24, 197–200. [54] Puntervoll, P., Linding, R., Gem¨ und, C., ChabanisDavidson, S., Mattingsdal, M., Cameron, S., Martin, D.M., Ausiello, G., Brannetti, B., Costantini, A., Ferre, F., Maselli, V., Via, A., Cesareni, G., Diella, F., Superti-Furga, G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork, P., Rychlewski, L., K¨ uster, B., Helmer-Citterich, M., Hunter, W.N., Aasland, R., Gibson, T.J. 2003. ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucl Acid Res 31, 3625–3630. [55] Romer, K.A., Kayombya, G.R., Fraenkel, E. 2007. WebMOTIFS: Automated discovery, ﬁltering and scoring of DNA sequence motifs using multiple programs

Interdiscip Sci Comput Life Sci (2012) 4: 239–255

255

and Bayesian approaches. Nucl Acid Res 35, W217– W220.

[67] Torres, A., Nieto, J.J. 2003. The fuzzy polynucleotide space: Basic properties. Bioinformatics 19, 587–592.

[56] Sadegh-Zadeh, K. 2000. Fuzzy genomes. Artif Intell Med 18, 1–28.

[68] Wallace, J.C., Henikoﬀ, S. 1992. PATMAT: A searching and extraction program for sequence, pattern and block queries and databases. Comp Appl Biosci 8, 249– 254.

[57] Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W., Lenhard, B. 2004. JASPAR: An open-access database for eukaryotic transcription factor binding proﬁles. Nucl Acid Res 32, D91–D94. [58] SantaLucia, J., Hicks, D. 2004. The thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struc 33, 415–440. [59] Schwartz, D., Gygi, S.P. 2005. An iterative statistical approach to the identiﬁcation of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol 23, 1391–1398. [60] Shen-Orr, S.S., Milo, R., Mangan, S., Alon, U. 2002. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31, 64–68. [61] Shi, J., Yang, W., Chen, M., Du, Y., Zhang, J., Wang, K. 2011. AMD, an automated motif discovery tool using stepwise reﬁnement of gapped consensuses. PLoS ONE 6, e24576. [62] Siddharthan, R. 2010. Dinucleotide weight matrices for predicting transcription factor binding sites: Generalizing the Position Wight Matrix. PLoS One 5, e9722. [63] Siddharthan, R., Siggia, E.D., Van-Nimwegen, E. 2005. PhyloGibbs: A Gibbs sampling motif ﬁnder that incorporates phylogeny. PLOS Comp Biol 1, e67. [64] Sierro, N., Makita, Y., de Hoon, M., Nakai, K. 2008. DBTBS: A database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucl Acid Res 36, D93–D96. [65] Sinha, S., Tompa, M. 2002. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucl Acid Res 30, 5549–5560. [66] Stormo, G.D. 2000. DNA binding sites: Representation and discovery. Bioinformatics 16, 16–23.

[69] Wingender, E., Dietze, P., Karas, H., Knuppel, R. 1996. TRANSFAC: A database on transcription factors and their DNA binding sites. Nucl Acid Res 24, 238–241. [70] Wrzodek, C., Schroder, A., Drager, A., Wanke, D., Berendzen, K.W., Kronfeld, M., Harter, K., Zell, A. 2010. ModuleMaster: A new tool to decipher transcriptional regulatory networks. BioSystems 99, 79–81. [71] Wu, C.H., Zhao, S., Chen, H., Lo, C., McLarty, J. 1996. Motif identiﬁcation neural design for rapid and sensitive protein family search. Comp Appl Biosci 12, 109–118. [72] Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad, K., Lander, E.S., Kellis, M. 2005. Systematic discovery of regulatory motifs in human promoters and 3[prime] UTRs by comparison of several mammals. Nature 434, 338–345. [73] Xu, D., Bondugula, R., Popescu, M., Keller, J. 2006. Bioinformatics and fuzzy logic. In: Proceedings of the 15th IEEE International Conference on Fuzzy Systems, Vancouver Canada, 817–824. [74] Zdobnov, E.M., Apweiler, R. 2001. InterProScan an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848. [75] Zen, H., Tokuda, K., Kitamura, T. 2004. A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features. In: Proceedings of International Conference on Acoustics Speech and Signal Processing, Montreal, Canada, 837–840.

Computational approaches, databases and tools for in silico motif discovery

Recommend Documents