Interdiscip Sci Comput Life Sci (2012) 4: 239–255 DOI: 10.1007/s12539-012-0141-x
Computational Approaches, Databases and Tools for in silico Motif Discovery Tanmaya Kumar SAHU1 , A.R. RAO1∗ , Shuchi VASISHT2 , Nishtha SINGH1 , U.P. SINGH1 1
(Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi 110012, India) 2 (Delhi Institute of Advanced Studies, Guru Gobind Singh Indraprastha University, Sector-16 C, Dwarka, New Delhi 110075, India)
Received 17 October 2011 / Revised 12 April 2012 / Accepted 13 June 2012
Abstract: Motifs are the biologically significant fragments of nucleotide or peptide sequences in a specific pattern. Motifs are categorized as structural motifs and sequence motifs. These are discovered by phylogenetic studies of similar genes across species. Structural motifs are formed by three dimensional arrangements of amino acids consisting of two or more α helices or β strands whereas sequence motifs are formed by the nucleotide fragments appearing in the exons of a gene. The arrangement of residues in structural motifs may not be continuous while it is continuous in sequence motifs. Sequence motifs may encode to the structural motifs. The algorithms used for motif discovery are important part of the bio-computational studies. The purpose of motif discovery is to identify patterns in biopolymer (nucleotide or protein) sequences to understand the structure and function of the molecules and their evolutionary aspects. The main aim of this paper is to provide systematic compilation of a review on different approaches, databases and tools used in motif discovery. Key words: motif discovery, artificial intelligence, artificial neural networks, enumerative approaches, evolutionary algorithms, phylogeny, biopolymers.
1 Introduction Discovery of the regulatory motifs is an important topic in computational biology. Indeed, these motifs control the expression of genes. But direct experimental determination of these motifs is not practical or efficient in many biological systems (Conlon et al., 2003). The purpose of motif discovery is to discover patterns in bio-polymeric (nucleotide or peptide) sequences (Bailey et al., 2007) and connectivity-patterns in gene or protein interaction networks to understand and unravel the structure and function of the bio-molecules represented by these motifs. Algorithms meant for motif discovery are important for computational biologists to identify phylogeny and function of the gene products. It is well known that, most of the functional elements are not coding for proteins and are expected to include regulatory signals, genes for RNA and other structural elements. Although the largest and most conserved elements are readily identified, the vast majority of non-coding functional elements remain unknown (Xie et al., 2005), in particular, recognition of short motifs, such as the DNA binding sites for transcription ∗
Corresponding author. E-mail:
[email protected]
factors. Also, function prediction of unknown proteins is another challenging problem in functional genomics. Several computational approaches have been developed for detection of short sequences/motifs of amino acids that are conserved within a family of closely related protein sequences (Hudak and McClure, 1999). Conserved blocks/motifs in the sequences of a family can often highlight features responsible for structural similarity between proteins and can be used to predict the three dimensional structure of a protein as protein structures are vital from their functional view point. Motifs may also enclose powerful diagnostic features, generating rules for determining whether or not an unknown sequence belongs to a family and thus define a characteristic function for families (Blekas et al., 2003). Though the ultimate aim of motif discovery is to identify, classify and characterize an individual based on its bio-molecules or bio-molecular interaction networks, the motif structure varies based on the object considered for this purpose. DNA motifs generally function as regulatory elements in biological systems (e.g. transcription regulatory motifs) whereas protein motifs are generally the functional sites (e.g. catalytic and DNA binding sites). The DNA and protein motifs are the nucleotide or amino acid sequence based, while the network motifs are the species or family specific con-
240
nectivity patterns in the interaction (gene or protein) networks. Therefore, all the tools and algorithms for finding these motifs are different. DNA and Protein motif algorithms use different position based scoring systems or matrices whereas network motif algorithms are generally based on structural stability scoring systems. However exploration on the importance of these motifs, showed the necessity to provide a review on the approaches, tools and databases for identifying the sequence, structural and network motifs in the DNA, protein and their interaction networks respectively. Hence the aim of the paper is to provide an insight into the algorithms and approaches used for motif discovery and their applications in the development of online servers, offline tools and software. It also provides a brief review on the online databases for motif finding with the appropriate references and URLs.
2 Classification of motifs Motifs are generally classified into three major types, i.e. Sequence, Structure and Network motifs. Besides, motifs are also categorized based on the nature of bio molecules such as DNA, RNA and Protein motifs where the DNA and RNA motifs are often referred as nucleotide motifs and protein motifs as peptide motifs. It is to be noted that the sequence and structure motifs are the conserved patterns of either nucleotides or peptides but network motifs are conserved sub graphs of gene or protein interaction networks. 2.1 Sequence motifs Sequence motifs are nucleotide or peptide sequence patterns, presumed to have a biological significance. They are either DNA sequence motifs or protein sequence motifs. DNA motifs are short, recurring patterns often indicate sequence-specific binding sites for proteins like nucleases and transcription factors (TF). Few others are involved in many RNA level processes that include ribosome binding, mRNA processing and transcription termination. On the other hand, protein sequence motifs are signatures of protein families and often determine the function of proteins. These motifs are formed by the continuous amino acid sequences and often represents catalytic or active sites of metabolic enzymes and regulatory proteins. Protein sequence motifs can often be recognized by simple inspection of the amino-acid sequence of a protein, and when detected provide strong evidence for biochemical function. Sequence motifs explain and recognize the specific characteristics of DNA, RNA and protein sequences such as TF binding sites, splice junctions and protein-protein interaction sites (Martin et al., 2008). 2.2 Structure motifs In a protein or nucleic acid chain, a structural motif is recognised as a super-secondary structure. DNA structural motifs occur in unimolecular folded DNAs as
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
well as bimolecular hybridization. These structural motifs include a variety of loop motifs including hairpins, bulges, internal loops, and multi-branched loops (SantaLucia et al., 2004). In case of proteins, the structural motifs are formed by three dimensional arrangements of continuous or discontinuous amino acids and therefore distinguished from sequence motifs. These motifs often include loops of variable length and unspecified structure, which in effect create the “slack” necessary to bring together in space two elements that are not encoded by immediately adjacent DNA sequences in a gene. Certain structural motifs having a particular functional significance or having a contribution to an independently folded functional domain are referred to as functional motifs. For example, the helix-turn-helix motifs found in many DNA-binding proteins do not exist as a stably folded domain, if expressed separately from the functional domain. But when detected in a DNA binding domain of a protein, behaves like a candidate motif for the DNA binding proteins (Gregory et al., 2004). Beta hairpin, Greek key, Omega loop, Helixloop-helix, Zinc finger are the well known examples of such functional motifs. 2.3 Network motifs The network motifs are the connectivity-patterns (sub-graphs), which occur very commonly in specific types of gene and protein interaction networks. Each type of network seems to display its own set of characteristic motifs. Alon et al. (2006) first presented the concept of network motifs when such motifs were discovered in the gene regulation (transcription) network of the bacteria E. coli (Shen-Orr et al., 2002). Understanding the network motifs in gene regulatory networks is very important as these networks control the gene expression in response to biological signals. In the gene regulatory networks the genes represent the nodes, and directed edges represent the control of one gene by a transcription factor (regulatory protein that binds DNA) encoded by another gene. Thus, network motifs are patterns of genes regulating each other’s transcription rate. It is worth noted that the sequence motifs may or may not always be associated with structural and vice versa as two biopolymers may share the same motif without any considerable primary structure similarity and the existence of a sequence motif does not necessarily imply a distinctive structure.
3 Approaches for motif discovery The application of artificial intelligence (AI) in the field of bioinformatics is growing at an exponential rate. This is the area of computer science which especially deals with the problems through the use of heuristics and probabilistic approaches. AI approaches stand out when dealing with the problems having no require-
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
ment for a strong constraint but requirement of a weak constraint (Narayanan et al., 2003). Many problems in bioinformatics do not have strong constraints, thus there is plenty of scope for the application of AI techniques in Bioinformatics. In particular, these AI approaches can be divided into two categories: (i) enumerative approaches and (ii) statistical approaches. The former attempts to enumerate the search space of possible motifs and test each for significance, while the latter attempts to optimize the parameters of a motif model by statistical analysis of the sequence data. Enumerative algorithms find an optimal solution for the discrete representations of relatively short motifs, but do not scale to larger motifs and continuous models due to combinatorial growth in the search space (Lones et al., 2007). Statistical approach such as position frequency matrices is used for finding continuous motif models. The statistical approaches used in motif discovery tools are Expectation Maximization and Gibbs sampling algorithms. “Multiple Expectation Maximization for Motif Elicitation” (MEME) is perhaps a best known tool which uses the above approaches. 3.1 Enumerative approaches 3.1.1 Evolutionary computation approach/genetic algorithms Evolutionary Computation (EC) is a modern search technique which uses computational models of processes of evolution and selection. Concepts and mechanisms of Darwinian evolution and natural selection are encoded in evolutionary algorithms (EAs) and among computational paradigm, EC is now recognized as particularly appropriate for various traditional and novel computational applications in structural engineering (Kicinger et al., 2005). EAs are stochastic population based search algorithms that carry out evolutionary search process loosely modelled upon biological evolution. EAs require a user to define a cost function so that alternative solutions can be scored appropriately. Lones and Tyrell (2005) have highlighted one of the 3 characteristics of EA that make them suitable for motif discovery: global search that is neither exhaustive nor biased by specific heuristics, representational flexibility, and no dependence between the way in which solutions are derived and the way in which they are scored. Lones and Tyrell (2007) have described a novel EA for regulatory motif discovery in DNA promoter sequences by using data clustering to logically distribute the evolving population across the search space. Similarly, Hubley and Roach (2003) have implemented a modified version of the Strength-Pareto EA (SPEA2) in java i.e. Multiobjective Analyzer for Genetic Maker Acquisition (MAGMA) for the selection of Single Nucleotide Polymorphisms (SNPs) at motifs. This set is very useful for the design of large scale studies, involving disease identification, genetic mapping, population
241
studies and haplotype block elucidation. Broadly, EAs are sub-divided into: Evolution Strategies (ES), Evolutionary Programming (EP), and Genetic Algorithms (GAs). Evolution Strategies (ES) are probabilistic optimization techniques and EPs are meant for the evolution of a simple program to a very larger and highly reliable software system by making the code more robust. GAs are a family of computational models inspired by evolution which encode a potential solution to a specific problem on a simple chromosome like data structure and apply recombination operators to these structures so as to preserve critical information. These algorithms are often viewed as function optimizers being applied to a broad range of problems. However, in all the above EC methods, different methods of fitness calculation are used and only one motif per sequence is assumed. But, in a sequence, multiple similar motifs may exist, and identification of those motifs is equally important for identification of a single motif per sequence. Paul and Iba (2006) proposed a novel GA based method for identification of multiple motifs in each of the given sequences. This method can handle longer motifs and can identify multiple positions of the motif instances of a consensus motif. For a broader review of EC and structural design, the reader may refer to Kicinger et al. (2005). Another major EA developed recently is Genetic Programming (GP). GP is an extension of GA in which an initial population of computer programs composed of the primitive functions is generated and then each program in the population is executed and its fitness at solving the problem is evaluated. A new generation of programs is then created by applying reproduction and crossover operators. 3.1.2 Self Organizing Maps (SOM) SOM is a type of ANN, trained using unsupervised learning, to produce a low-dimensional and discretized representation of input space of the training samples. SOMs use a neighbour-hood function to conserve the topological properties of the input space. This makes SOM useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. Liu et al. (2005) used self-organizing neural networks for identifying motifs with insertions/deletions, by introducing a new definition for calculating the distance between a pair of subsequences. This algorithm can find motifs with up to two insertions, deletions or their combinations. The results showed that the performance of this algorithm is more accurate and costs less computation than existing algorithms. Self organizing neural networks have been applied to genome signature analysis, endogenous retrovirus clustering, various classification problems and efficient design of DNA microarrays for SNP analysis. Mahony et al. (2005) proposed an approach based on SOM where the output layer neuron weight vectors are replaced by Position Weight Matrices (PWMs). This
242
approach is used as an aid in overrepresented motif discovery as it characterises the features present in a set of sequences. Further, Mahony et al. (2006) evaluated the performance of three Self-organizing neural networks to solve the motif finding problem. Here, they integrated a SOM-based motif-finder, SOMBRERO, with a SOMbased method for automatic construction of generalized models for structurally related motifs and a selforganizing tree method to display the relationships between various motifs. Kohonen et al. (2007) described SOM and hence it is also known as Kohonen map. Selforganizing neural networks have wide applicability to the study of DNA-binding motifs and familiar binding profiles. Neural networks pose two significant problems when applied to motif discovery: (i) their relative inflexibility when mapping sequence data to inputs, (ii) the difficulty of interpreting neural models, particularly when hidden layers are present within the network. 3.1.3 Longest Common Subsequences (LCS) and their Shortest Common Supersequences (SCS) approach Under this approach, a set of related sequences is given as input for deriving patterns in a set of biological sequences and the goal is to find a set of patterns that are common to all or most of the sequences in the set. Ning and Leong (2006) proposed a PAtterns by LCS & SCS, (PALS) algorithm which discovers patterns in a biological sequences profile by first generating the results for LCS & SCS of sequences by heuristic method. Consequently, it derives patterns from the obtained results. PALS have high sensitivity & specificity, and they are effective in deriving patterns close to the known consensus patterns for real sequences. 3.2 Statistical approaches The successful applications of sophisticated statistical models in computational biology have been observed during the past decade. These statistical models have been used to find short repetitive pattern in a set of DNA or protein sequences based on either Probabilistic models or Bayesian models. 3.2.1 Clustering Clustering deals with the assignment of a set of observations into subsets or clusters, so that observations in the same cluster are similar in some as compared to the observations in the other clusters. A detailed description on different clustering methods and distance measures is given in Johnson and Wichern (2007). This method is based on unsupervised learning, and is a common technique used for the analysis of data in many fields like pattern recognition, image analysis and bioinformatics. As far as transcriptomics is concerned, clustering is used to build the groups of genes with related expression patterns which are called as co-expressed genes. Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated and expected
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
to have some common regulatory motifs. In sequence analysis, clustering is used to group homologous sequences into a particular gene family based on sequence similarity. In high-throughput genotyping platforms, clustering is used to automatically assign genotypes to distinct groups. Consensus of a particular family helps to identify a recognizable pattern as a functional motif. 3.2.2 Fuzzy Logic Approach Fuzzy set theory and fuzzy logic have been used in bioinformatics, but to much less extent than graph theory and machine learning. These are the ideal frameworks for describing some biological systems/objects and provide suitable computational methods for a broad spectrum of bioinformatics problems. Fuzzy models attempt to quantify imprecision and uncertainty that is not easily captured by standard mathematical models, particularly in pattern recognition problems. Arredondo et al. (2005) developed a fuzzy inference engine based on information theory for genomic sequences (DNA) to predict coding regions. Fuzzy approaches are also suitable to describe and compare molecular structures. Cundari et al. (2001) used fuzzy logic in the analysis of a database of small molecular structures. Sadegh-Zadeh et al. (2000) and Torres et al. (2003) used polynucleotides as fuzzy sets and introduced a means to measure dissimilitudes between them as points in a hypercube. They used this method as a tool to compare different genomic sequences. Jacob et al. (2005) used fuzzy scoring functions based on diverse biological information to predict operon, an important structure in bacterial genomes. Xu and Bondugula (2006) discussed the fuzzy concepts and methods in studying biological problem. They have presented two applications of fuzzy logic: one is fuzzy measurement of ontological similarity and its application in bioinformatics and the other is the application of fuzzy ‘k’ nearest neighbor algorithm in protein secondary structure prediction. 3.2.3 Hidden Markov Models (HMM) HMM is a statistical model where the system being modeled is assumed to be a Markov process with unobserved state. Normally, an HMM is considered as the simplest dynamic Bayesian Network. In a regular Markov model, the observer can directly visualize the state and therefore the state transition probabilities are the only parameters. In an HMM, the state is not directly visible, but output dependent on the state is visible. Each state has a probability distribution over the probable output tokens. Therefore the sequence of tokens generated by a HMM gives information on the sequence of states. Chudova et al. (2002) have derived the Bayes error rate for the problem of motif discovery in DNA sequences under a Markov assumption. They have used the first order Markov framework as their base model where there is a hidden layer for each posi-
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
tion in the pattern and a background state to model the background. They found that the detection of structured patterns in a Markov context presents a more difficult problem over the detection of random patterns. The advantage of using HMMs is in their ability to learn parameters directly from a set of observations, freeing the user from having to set the parameters a priori. 3.2.4
Position Weight Matrices (PWM)
PWM contains log odds weights for computing a match score. To check whether an input sequence matches the motif or not is based on a threshold. A PWM is designed from Position frequency matrix (PFM), which records the position-dependent frequency of each residue. ‘Systematic Evolution of Ligands by Exponential Enrichment (SELEX)’ experiments or computationally discovered tools such as MEME using Hidden Markov Models are used to determine the PFMs experimentally. Schwartz and Gygi (2005) presented an iterative statistical approach to identify protein phosphorylation motifs from large data sets of naturally occurring phosphorylation sites. They took two parallel sequence data sets for background probability calculations. These two data sets are converted into position-weight matrices where each matrix contained information on the frequency of all residues at 6 positions upstream and downstream of the phosphorylation site. They created a 3rd matrix, the binomial probability matrix using the phosphorylation matrix and background matrix (created from the background dataset). A greedy recursive search on the sequence space is then applied to extract a list of significant motifs. This algorithm has the versatility to extract motifs from a wide range of data sets and may lead to discovery of novel biologically relevant protein motifs directly from a raw protein. 3.2.5
Bayesian models
Bayesian models are based on the principles of Bayes theorem and used to capture significant patterns in genomic sequences and lead to more accurate motif prediction. The framework provides a basic infrastructure for hierarchical model dependence and for dealing with nuisance parameters without leading to over-whelming analytical complexity. The Bayesian concept allows a “learning” capability from “historical” data that can be used in modelling a new, but similar problem. Here, the priors can be obtained from partial information on known motifs thereby improving estimation of information on novel motifs. Gibbs sampling is commonly used as a means of Bayesian inference. This algorithm is used to generate a sequence of samples from the joint probability distribution of two or more random variables. It is applicable when the joint distribution is not known explicitly or is difficult to sample from directly, but the conditional distribution of each variable is known and is easy to sample from (Liu et al., 1994).
243
4 Databases and tools for motif finding Several algorithms have been developed for discovery of DNA, Protein and Network motifs. Also, there exist several motif databases with dynamic search facility through web interface. Some of the databases and tools meant for motif discovery are discussed below. 4.1 Databases and tools for DNA motifs 4.1.1 Databases 4.1.1.1 TRANSFAC (TRANScription FACtors) TRANSFAC database has mutual cross-references with Flybase, EMBL and SwissProt databases. In appropriate cases, FACTORS are linked to PIR, and most CLASS entries point to the PROSITE database (Wingender et al., 1996). It is a widely used DNA motif database which contains information on transcription factors, their experimentally proven binding sites, nucleotide distribution matrices and regulated genes. Its broad compilation of binding sites allows the derivation of positional weight matrices (Matys et al., 2005). URL: http://www.gene-regulation.com/pub/databases.html 4.1.1.2 JASPAR JASPAR is an open access database of annotated matrix based transcription factor binding site profiles for multi-cellular eukaryotes. The profiles were derived exclusively from the sets of nucleotide sequences those are experimentally demonstrated to bind transcription factors. The web interface of the database is dynamically designed for browsing, searching and subset selection of DNA motifs. Besides, an online sequence analysis utility with a suite of programming tools for genome wide and comparative genomic analysis of regulatory regions are also incorporated in the website (Sandelin et al., 2004). URL: http://jaspar.cgb.ki.se 4.1.1.3 DBTBS (Database of Transcriptional regulation in Bacillus subtils) This database is enriched with experimentally validated gene regulatory relations and the corresponding transcription factor binding sites in upstream regions of Bacillus subtilis genes. DBTBS is based on 947 references and has a collection of the information on 120 binding factors and 1475 gene regulatory relations. For each promoter, all of its known cis-elements are listed according to their positions, while these cis-elements are aligned to illustrate the consensus sequence for each transcription factor. All probable transcription factors coded in the genome were classified using Pfam motifs (Sierro et al., 2008). URL: http://dbtbs.hgc.jp 4.1.2 Tools 4.1.2.1 BioProspector This tool is used for the discovery of conserved DNA regulatory sequence motifs in upstream regions of co-expressed genes. It is based on a C program with a Gibbs sampling strategy. It uses Markov background to model the base dependencies of nonmotif bases, which greatly improved the specificity of
244
the reported motifs. A new motif scoring function is adopted to allow each input sequences to contain zero to multiple copies of the motif. In addition, BioProspector can model gapped motifs and motifs with palindromic patterns, which are prevalent motif patterns in prokaryotes (Liu et al., 2001). URL: http://ai.stanford.edu/∼xsliu/BioProspector/ 4.1.2.2 MDscan(Motif Discovery scan) MDscan is used to find out conserved transcription factor binding sites among upstream sequences from chromatin immune precipitation experiments (CHIP) as well as from single microarray experiments. The used algorithm is strictly deterministic. Hence, it always gives the globally best conserved motif and do not discover the local minima. It is faster than most other algorithms and nearly as sensitive as BioProspector (Liu et al., 2002). URL: http://ai.stanford.edu/∼xsliu/cgibin/MDsearch.cgi 4.1.2.3 YMF (Yeast Motif Finder) YMF is a program that detects statistically overrepresented motifs in DNA sequences. YMF uses enumerative search algorithm that matches the specified characteristics, scoring each motif for its significance, and results the top motifs sorted by their significance. The significance of a motif is measured by the Z-score of its count in the input sequences (Sinha et al., 2000; Sinha et al., 2002). URL: http://bio.cs.washington.edu/software.html 4.1.2.4 Motif Cluster Alignment and Search Tool (MCAST) MCAST identifies the statistically significant clusters of non-overlapping “hits”(match) to the motifs in a query sequence from database search. The maximum allowed distance between the hits in a match is to be specified by the user. Two hits separated by more than the maximum allowed gap are reported in separate matches. Each match between the query and the subject sequences is assigned an E-value, and matches scoring below an E-value cut-off are displayed in order of increasing E-value. Bailey and Noble (2003) showed the predictive accuracy possible when MCAST is used to search for cis-regulatory modules in Drosophila and human sequences. URL: http://genome.jouy.inra.fr/doc/genome/analyse-desequences/mast/mcast.html 4.1.2.5 Weeder Weeder uses an algorithm for the automatic conserved motif discovery in a set of related regulatory DNA sequences. The identified motifs are in turn likely to be instances of binding sites for some transcription factor. The interface automatically starts different runs of the program, each one with different parameters, and provides an overall summary of the results to the user with suggestions on the statistical significance of the resulted motifs (Pavesi et al., 2004). URL: http://www.pesolelab.it/
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
4.1.2.6 Improbizer It searches for motifs in DNA or RNA sequences occurring with improbable frequency (just by chance) by using a variation of the expectation maximization (EM) algorithm. It accepts either one sequence per line or in FA format. The time consumption for running Improbizer is directly proportional to the input sequence length and also depends on the advanced settings chosen. Improbizer is available as a Web based CGI program as well as command line based executable. The executable for the Improbizer is called “ameme” (Ao et al., 2004). URL: http:// users.soe.ucsc.edu/∼kent/improbizer/improbizer.html 4.1.2.7 PhyloGibbs PhyloGibbs is used to identify the binding site motifs for transcription factors in cis-regulatory sequences of DNA, based on the Gibbs sampling algorithm with the following augmentations: (i) it scientifically accounts for non-functional conservation due to phylogeny for the sequences of closely related species and accordingly modifies the scoring. (ii) It circumvents the problems of estimating the number of motifs in the sequence, and of evaluating the significance of identified motifs, by using a two-stage motiffinding strategy: a) “simulated annealing” phase that finds a few high-quality groups of binding sites representing a few different motifs, and b) “tracking” phase keeps statistics on measures of togetherness of these groups and the other sites get co-clustered with these groups. The output consists of a list of putative binding sites and the fraction of time they were co-clustered in that group (Siddharthan et al., 2005). URL: http://www.phylogibbs.unibas.ch/cgibin/phylogibbs.pl 4.1.2.8 Self-Organizing Map for Biological Regulatory Element Recognition and Ordering (SOMBRERO) SOMBRERO is a SOM-based motif identifier which finds regulatory binding sites as overrepresented motifs in a DNA sequence set. It simultaneously characterizes a complete set of motifs for a given dataset. It also helps in separating weak motif signals from large datasets and improves motif detection performance in real biological datasets. The extension of SOMBRERO was initialised using a previously trained SOM on a set of known transcription factor binding matrices. Using such prior knowledge for initializing SOMBRERO considerably improves the accuracy, provided (i) known motifs are found to be present in the input data set and (ii) the accuracy is not negatively affected for the discovery of novel motifs. One of the specialities of SOMBRERO is that it can incorporate the entire transcription factor binding matrix databases as prior knowledge (Mahony et al., 2005). SOMBRERO v.1.1 with precompiled binaries for various operating systems is freely available for download.
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
It is released under the GNU General Public Licence. The source code for SOMBRERO and a parallelised version of SOMBRERO (written using Message Passing Interface) are also available for download. URL: http://bioinf.nuigalway.ie/sombrero/download.html 4.1.2.9 Motif finding using an UnSupervised Approach (MUSA) MUSA can be used either to autonomously find over-represented complex motifs or to estimate search parameters for modern motif finders. This method is based on a bi-clustering algorithm which operates on a co-occurrence matrix of small motifs. The performance and accuracy of this method is independent of the composite structure of the motifs being sought, making few assumptions about their characteristics (Mendes et al., 2006). URL: http://www.yeastract.com/discoverer/musa.php 4.1.2.10 GIbbsMarkov with Significance ANalysis (GIMSAN) GIMSAN is a novel web-server tool for de novo motif discovery. It is a Hybrid Gibbs Sampler for motif finding which uses a Bayesian prior on the percentage of sequences containing sites. It uses a maximum likelihood approach on the motif matrix. It gives biologically realistic and reliable statistical significance analysis using a 3-Gamma approximation scheme with factoring local sequence composition information and reports an approximate 95% confidence interval of the point estimator of the motif p-value (Ng and Keich, 2006). URL: http://www.cs.cornell.edu/∼ppn3/gimsan/ 4.1.2.11 Suite for Computational identification Of Promoter Elements (SCOPE) SCOPE is a novel parameter-free method proposed by Carlson et al. (2007) for the de-novo identification of potential regulatory motifs in sets of co-regulated genes. SCOPE identifies the best candidate motifs from its three component algorithms using an ensemble learning approach. It identifies motifs with a significantly higher level of accuracy from experimentally determined datasets. As SCOPE has no modifiable parameters, the web server has a perceptive interface, which requires only a set of gene names or sequences in FASTA formats and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. URL: http://genie.dartmouth.edu/scope/ 4.1.2.12 Binding-site Estimation Suite of Tools (BEST) BEST is a package of motif-finding programs, which includes 4 major programmes, viz., AlignACE, BioProspector, Consensus, and MEME. The optimization program BioOptimizer is configured for each of these programs. BEST is compiled on Linux, and thus it can only be run on Linux machines. Two compiled versions are provided by
245
BEST to satisfy different linux versions (Che et al., 2007). URL: http://www.cs.uga.edu/∼che/BEST and http://www.fas.harvard.edu/∼junliu/BEST 4.1.2.13 WebMOTIFS The WebMOTIFS is an online tool for motif discovery, scoring, analysis, and visualization. It allows the user to combine different programs, viz., Aligns Nucleic Acid Conserved Elements (AlignACE), Motif Discovery Scan (MDscan), MEME, and Weeder to search for DNA-sequence motifs, and evaluate the results. It is a combination of two tools, Tools for Analysis of MOtifs (TAMO) (Gordon et al., 2005) and THEME (MacIssac et al., 2006). TAMO uses a number of motif discovery programs viz. AlignACE, MDscan, MEME, Weeder on the input sequences, and then combines, analyzes, and clusters the results. THEME does Bayesian motif analysis, incorporating prior knowledge about plausible motifs (Romer et al., 2007). URL: http://fraenkel.mit.edu/webmotifs/form.html 4.1.2.14 Matlign (Matrix alignment) Sequence motifs representing transcription factor binding sites (TFBS) are normally encoded as position frequency matrices (PFM) or degenerate consensus sequences (CS). Matlign was developed to facilitate the post-processing of motifs, in both PFM and CS formats. These formats are used to represent the characterized TFBS profiles available in the transcription factor databases, as well as to represent the potential motifs predicted using computational methods. A combination of scoring functions is used to evaluate the similarity of motifs after aligning the sequences and the results are visualized by the hierarchical clustering. The alignment algorithm suitably aligns the motifs with an internal spacer by limiting the number of distinct gaps created. The best non-redundant motif set is selected from the merged repetitive motifs by cutting the hierarchical tree using silhouette values (Kankainen and Loytynoja, 2007). URL: http:// ekhidna.biocenter.helsinki.fi/poxo/matlign/matlign 4.1.2.15 Coding Motif Identification Tool (COMIT) Coding nucleotide sequences include myriad functions independent of their encoded protein sequences. COMIT algorithm is used to detect functional noncoding motifs in coding regions from sequence conservation, explicitly separating nucleotide from amino acid effects. COMIT concurs with various experimental datasets, including splicing enhancers, silencers, replication motifs, and microRNA targets, and predicts many novel functional motifs. Interestingly, COMIT scores are well-correlated to scores uncalibrated for amino acids, suggesting that nucleotide motifs often override peptide-level constraints (Kural et al., 2009). 4.1.2.16 (conserved) Evidence-Ranked Motif Identification Tool (cERMIT) cERMIT is used for analyzing the genome-wide
246
quantitative regulatory evidence, implementing an efficient enumerative strategy for identifying cisregulatory elements to address the reformulation of the motif finding problem. It operates on a set of non-coding regulatory regions and their corresponding evidence of direct or indirect regulation. It utilizes the information across all the sequence regions to search for high-scoring motifs instead of pre-selecting appropriate candidate sequences. Georgiev et al. (2010) applied cERMIT on a range of direct binding and over-expression datasets, which substantially outperforms state-of-the-art approaches on curated ChIP-chip datasets, and easily scales to mammalian ChIP-seq experiments with the data on thousands of non-coding regions. It was also validated extensively on curetted yeast datasets. The binary executable of cERMIT (compiled for Linux), the sample datasets, sample ChIP-chip, ChIP-seq, and microRNA datasets are available for download. URL: http://tools.genome.duke.edu/generegulation/ transcription/cERMIT/ 4.1.2.17 ModuleMaster It is a novel application for finding cis-regulatory modules (CRMs) in sets of co-expressed genes. It not only considers transcription factor binding information but also multivariate functional relationships between regulators and target genes to improve the identification of CRMs. The program includes all necessary data and algorithms to perform every step to find CRMs for a given result of a microarray and a subsequent clustering experiment. This workbench provides an easy-to-use GUI, together with job-processing and command-line options. The detected CRMs can be visualized and evaluated in various ways, i.e., generating GraphML (an XML-based file format for graphs), R-based whole regulatory Network visualizations and generating Systems Biology Markup Language (SBML) files for subsequent analytical processing and dynamic modeling. The tool is freely available to academic users as a webstart application including all comprehensive documentation (Wrzodek et al., 2010). URL: http://www.ra.cs.unituebingen.de/software/ModuleMaster/ 4.1.2.18 Automated Motif Discovery (AMD) Tool Using Stepwise Refinement of Gapped Consensuses Recently, Shi et al. (2011) developed an opensource software called Automated Motif Discovery (AMD) Tool, which allows for effective and efficient recognition of motifs in various datasets. AMD can efficiently discover motifs up to 20 nucleotide bases length, the most common size of the currently known motifs. AMD could be applied to aligned sequences to identify conserved motifs, regardless of whether the motifs are long or short and gapped or continuous. AMD also identifies overrepresented
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
motifs in a given input sequence set and compared with a background sequence set. Here, the method is divided into five sequential steps: core motif filtering, degeneration, extension, refinement and redundancy removal. The selected probable motifs in each step are subjected to the next step for further processing. Motif discovery in these clustered genes requires the consideration of multiple clusters simultaneously. URL: http://www.plosone.org/article/info%3Adoi% 2F10.1371%2Fjournal.pone.0024576 4.2 Databases and tools for protein motifs 4.2.1 Databases 4.2.1.1 BLOCKS The blocks for this database are made automatically by considering the strongly conserved regions in groups of proteins from the Prosite Database. However, the Prosite pattern for a protein group is not used to make the BLOCKS Database and the pattern may or may not be contained in one of the blocks representing a group. These blocks are then calibrated against the SWISSPROT database to obtain a measure of the chance distribution of matches, and these calibrated blocks that build up the BLOCKS Database (Pietrokovski et al., 1996). URL: http://blocks.fhcrc.org/ 4.2.1.2 The Eukaryotic Linear Motif (ELM) server ELM is an online bio-computational server for the prediction of putative functional sites using specific patterns in eukaryotic proteins. It applies context-based rules and logical filters to reduce the amount of false positives. Here, the positionally conserved motifs are identified if known ELM instances and predictions of the sequences are found similar to ELM instance sequences. Functional sites which fit to the description of linear motif are currently specified as patterns using Regular Expression rules. Context-based rules and logical filters are being developed to improve the predictive power and reduce the amount of false positives. It is currently the largest collection of linear motif classes with annotated and experimentally validated linear motif instances. The ELM structure filter aids users to assess the candidate motifs present in globular structural regions (Puntervoll et al., 2003). URL: http://www.elm.eu.orgn 4.2.1.3 PRINTS ‘PRINTS’ is a compendium of protein fingerprints. A fingerprint is referred as a group of conserved motifs used for the characterization of a protein family. The diagnostic power of PRINTS is refined by the iterative scanning of a SWISS-PROT/TrEMBL composite. Generally the motifs do not overlap, but are separated along a sequence, in spite of being contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than a single motif because the full diagnostic potency is deriving from the mutual context supplied by the motif
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
neighbors in a fingerprint (Attwood et al., 2004). URL: http://www.bioinf.manchester.ac.uk/dbbrowser/ PRINTS/index.php 4.2.1.4 PROSITE PROSITE database consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them. It includes a method of determining the function of uncharacterized proteins translated from genomic or cDNA sequences. It contains biologically significant sites and patterns that help in the identification and characterization of an unknown protein sequence. The signatures (patterns and profiles) of PROSITE have complete documentation that provides the background information on the structure and function of these proteins (Hulo et al., 2006). URL: http://prosite.expasy.org/scanprosite 4.2.1.5 Pfam (Protein Families) The Pfam is one of the important biological databases for protein classification. The database categorises 75 % of known proteins to form a library of protein. Each entry in this database includes a protein sequence alignment as well as an accompanying HMM. Pfam allows users to analyse their sequence data and find out the functional domain and motifs in the protein. Besides, it contains the information on higher level groupings of related protein families (clans), which are related by similarity of sequence, structure or by a statistical analysis of their associated HMM. The latest version of the Pfam database contains approaching 12,000 curated protein families (Finn et al., 2010). URL: http://www.sanger.ac.uk/resources/databases/pfam.html 4.2.2 Tools 4.2.2.1 MINER It is a web enabled tool for phylogenetic motif (PM) identification. PMs are fragments of a sequence that is conserved in the overall familial phylogeny. PMs have been shown to correspond to an extensive diversity of catalytic regions, substrate-binding sites and protein interfaces, predicting them as ideal functional sites. MINER takes multiple sequence alignment (MSA) as an input and it also aligns the unaligned sequences, using ClustalW (Thompson et al., 1994).The output of MINER provides an interactive interface for the analysis of PM sequences and structural visualization. An algorithm for a sliding sequence window is used in MINER to quantitatively evaluate the phylogenetic similarity between each sequence region and the whole sequence. Distance-based trees are calculated for the whole alignment and each window. Tree topology based phylogenetic similarity is calculated using the partition metric algorithm (Penny et al., 1985). The partition metric counts the number of topological differences between the two trees. Partition metric values are recast as Z-scores. Overlapping sequence windows, scor-
247
ing some preset phylogenetic similarity Z-score (PSZ) threshold are identified as PMs (La et al., 2005). URL: http://www.pmap.csupomona.edu/MINER/ 4.2.2.2 Motif Identification Neural Design (MOTIFIND) The tool was developed for rapid and sensitive protein family identification as an extension of gene classification artificial neural system. It employs new designs to enhance the detection of distant relationships. The accuracy of MOTIFIND is comparable to the BLAST database search method, but its speed is 20 times faster than BLAST search. It also compares favorably with BLocks IMProved Searcher (BLIMPS)(Wallace and Henikoff, 1992), the HMM and PROFILESEARCH (Gribskov et al., 1990) for discovering fragmentary sequences lacking complete motif regions and for identifying distant relationships, especially for the members of under-represented subgroups within a family (Wu et al., 1996). URL: http://diana.uthct.edu/ 4.2.2.3 ChloroP ChloroP is a neural Network based method to identify chloroplast transit peptides (cTP) and their corresponding cleavage sites. The performance accuracy of this tool is near about the only publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. ChloroP method is useful for the identification of putative transit peptides in genome-wide sequence data (Emanuelsson et al., 1999). The stand-alone version of ChloroP, ChloroP 1.1 software package is available with the same functionality as the online server of ChloroP. Ready-to-ship packages of the tool are compiled in UNIX platforms. It is free for the academic users and commercially available at CBS Software Package Manager. URL: http://www.cbs.dtu.dk/services/ChloroP/ 4.2.2.4 InterPro Scan InterProScan is a motif discovery tool which scans a given sequence of a protein against the protein signatures of the InterPro member databases. InterPro is an EMBL-EBI integrated database of predictive protein “signatures” (Hunter et al., 2009). InterProScan combines different protein signature recognition methods native to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotation. The number of signature databases and their associated scanning tools as well as the further refinement procedures make the problem complex (Zdobnov and Apweiler, 2001). InterProScan is designed to be a scalable and extensible system with a robust internal architecture. The Perl-based InterProScan implementation is available from the EBI ftp server ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/. URL: http://www.ebi.ac.uk/ Tools/InterProScan 4.2.2.5 Motif Scan Motif Scan is an open source motif scanner available
248
in MyHits (http://myhits.isb-sib.ch) database. MyHits is a free database developed for protein domains and includes many tools for the investigation of the relationships between protein sequences and their motifs. These motifs are defined by a heterogeneous collection of predictors, including regular expressions, generalized profiles and HMMs. Motif Scan finds all known motifs that occur in an input protein sequence from the well known motif sources such as PeroxiBase, HAMAP, PROSITE and Pfam (Pagni et al., 2007). URL: http://myhits.isb-sib.ch/cgi-bin/motif scan 4.3 Other important motif discovery tools This section includes the tools meant for both DNA and Protein motif discovery. Besides, here a Network motif discovery tool is also discussed. 4.3.1 Multiple Expectation Maximization for Motif Elicitation (MEME) MEME is one of the most commonly used tools for searching novel ‘signals’ in sets of biological sequences. It is applied for the discovery of new transcription factor binding sites and protein domains. MEME searches for repeated, ungapped sequence patterns occurring in the DNA or protein sequences supplied by the user. MEME searches can be performed at the web server hosted by the National Biomedical Computation Resource (Bailey et al., 1994). URL: http://meme.nbcr.net 4.3.2 Gibbs Motif Sampler The Gibbs Motif Sampler is a Bayesian model based motif discovery tool that allows the user to identify motifs and conserved regions in both DNA and protein sequences. It is a compendium of several analysis tools all developed from a basic algorithm called Gibbs sampling method (Neuwald et al., 1995). URL: http://bayesweb.wadsworth.org/gibbs/gibbs.html 4.3.3 gn It is both a tool and a domain specific language for motif discovery in inter-species gene Networks. gn identifies motifs by comparing the input Network with a series of randomly generated networks having similar properties but it does not support Network randomization. The properties like, edge weights specific to the kind of inter-genome analysis or restricting the set of nodes from particular organisms are easy to specify with gn. The software is developed on Linux systems. It requires recent versions of flex and bison to generate its frontend. gn does all its work by querying data stored in a PostgreSQL database. To query, gene Network data must be flattened into gn’s prescribed relational schema. It is an opensource software and licensed under the GNU General Public License, version 3. URL: http://www.cs.unh.edu/∼tfogal/gn/ A list of all the tools with approaches, principle, motif type, purpose, category, webpage and references, discussed in this study is presented in Table 1.
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
5 Discussion As biological data sets grow exponentially, tools for extraction of biologically relevant motifs have become more useful. Several types of motifs like regulatory motifs, transcription factor binding motifs, active site motifs, species specific and function specific motifs, Network motifs and highly conserved non coding motifs in a gene are vital from biological point of view. Motifs representing a particular functional class or family help in the annotation and characterization of novel nucleotide or protein sequences. Though, the discovery of such motifs is being done by many approaches, the enumerative and statistical approaches based on AI techniques are highly precise and used with a broad range of applicability. The Enumerative algorithms viz. EC/EA, SOM and PALS are suitable for finding an optimal solution for finding the relatively short motifs. However, they do not scale to larger motifs and continuous motif models. The Statistical approaches are best for finding continuous motifs by optimizing the parameters of the sequence data and using EM and Gibbs sampling algorithms. The EAs are based on the evolutionary search process and perform global search, which is flexibly represented, concise and not influenced by specific heuristics. They are suitable for DNA regulatory motif discovery. An implementation of EA as SPEA2 further modified as MAGMA is a faster application in terms of approximating a set of optimal tradeoff solutions for the large scale problems. GAs, a subtype of EAs has a great advantage over the preexisting ECs in terms of handling the problems related with long motifs and multiple similar motifs in a single sequence. These problems may be more easily tackled by the GPs, an improved extension of GAs. The SOMs are also important for finding DNA binding Motifs and familiar binding profiles but restricted to identification of two indel (insertions and/or deletions) combinations. However, these algorithms are cost effective and highly reliable. The PALS have high sensitivity and specificity and are applicable to derive the patterns closer to the known consensus patterns for real sequences. The statistical approaches viz. Clustering, Fuzzy Logic, HMM, EM and PWMs are widely used to solve the pattern recognition problems. Clustering is a good method for identification and characterization of coexpressed genes based on the expression patterns where as Fuzzy models have the advantage over the other standard mathematical models in terms of capturing the imprecision and uncertainty in pattern recognition problems. HMMs can be used for motif discovery in DNA sequence in a user friendly way as it learn parameters from the input set of observations. PWMs calculate the positional conservedness of a residue and it also computes the threshold for the likelihood of a
Interdiscip Sci Comput Life Sci (2012) 4: 239–255 Table 1
249
List of tools for motif discovery arranged by type of motif Type
Name
Approaches
Principle
of
Tool/
Purpose
Database
Motif
Webpage
Reference
AMD
Enumerative
PWM and Clustering
DNA
Transcription factor binding site, Cisregulatory elements
Tool
http://www.plosone.org/ article/info%3Adoi% 2F10.1371%2Fjournal. pone.0024576
Shi et al., 2011
BioProspector
Statistical
Gibbs sampling
DNA
Regulatory sequence motifs
Tool
http://ai.stanford.edu/ ∼xsliu/BioProspector/
Liu et al., 2001
COMIT
Statistical
COMIT algorithm
DNA
Functional noncoding motif in coding region
Tool
http://www.genomebiology. Kural et com/2009/10/11/R133 al., 2009
DBTBS
Statistical
Clustering
DNA
TFBS motifs in upstream intergenic regions.
Database http://dbtbs.hgc.jp
Sierro et al., 2008
GIMSAN
Statistical
Gibbs sampling
DNA
De novo motif discovery
Tool
Ng and Keich, 2006
JASPAR
Statistical
PWM
DNA
Transcription factor binding sites
Database http://jaspar.cgb.ki.se
Sandelin et al., 2004
MATLIGN
Statistical
PWM
DNA
Transcription factor binding site
Tool
http://ekhidna.biocenter. helsinki.fi/poxo/matlign/ matlign
Kankainen and Loytynoja, 2007
MCAST
Statistical
HMM
DNA
Detecting occurrences of regulatory modules in genomic DNA.
Tool
http://genome.jouy.inra. fr/doc/genome/analysedesequences/mast/mcast. html
Bailey and Noble, 2003
MDscan
Statistical
Bayesian statistical model
DNA
Transcription factor binding sites among upstream sequences
Tool
http://ai.stanford.edu/ ∼xsliu/cgi-bin/ MDsearch.cgi
Liu et al., 2002
Module Master
Statistical
Clustering and PWM
DNA
Cis-regulatory modules and TFBS
Tool
http://www.ra.cs.unituebingen.de/software/ ModuleMaster/
Wrzodek et al., 2010
MUSA
Statistical
Clustering algorithm
DNA
Identification of de novo binding site consensus sequences
Tool
http://www.yeastract. com/discoverer/musa. php
Mendes et al., 2006
PhyloGibbs
Statistical
Gibbs sampling
DNA
Identify the binding site motifs for transcription factors in cis-regulatory sequences of DNA
Tool
http://www.phylogibbs. unibas.ch/cgi-bin/ phylogibbs.pl
Siddharthan et al., 2005
SCOPE
Statistical
PWM
DNA
Transcription factor binding sites
Tool
http://genie.dartmouth. edu/scope/
Carlson et al., 2007
SOMBRERO
Enumerative
SOM
DNA
Regulatory sites
Tool
http://bioinf.nuigalway. ie/sombrero/download. html
Mahony et al., 2005
TRANSFAC
Statistical
PWM
DNA
Transcription factors
Database http://www.generegulation.com/ pub/databases.html
Matys et al., 2005
WebMOTIFS
Statistical
Clustering
DNA
Identification of regulatory sequence motif
Tool
http://fraenkel.mit.edu/ webmotifs/form.html
Romer et al., 2007
Weeder
Enumerative
Weeder algorithm
DNA
Searches for overrepresented nucleotide patterns
Tool
http://www.pesolelab.it/
Pavesi et al., 2004
binding
http://www.cs.cornell.edu/ ∼ppn3/gimsan/
250
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
Continued Type Name
Approaches
Principle
of
Tool/
Purpose
Database
Motif
Webpage
Reference
YMF
Enumerative
YMF algorithm
DNA
Detects statistically overrepresented motifs
Tool
http://bio.cs.washington. edu/software.html
Sinha et al., 2000; Sinha et al., 2002
cERMIT
Statistical
Clustering
DNA and RNA
Cis-regulatory ments
ele-
Tool
http://tools.genome. duke.edu/generegulation/ transcription/cERMIT/
Georgiev et al., 2010
Improbizer
Statistical
Expectation maximization (EM)
DNA and RNA
Finds motifs in DNA or RNA sequences occurring with improbable frequency
Tool
http://users.soe.ucsc.edu/ ∼kent/improbizer/ improbizer.html
Ao et al., 2004
BEST
Uses AlignACE, BioProspector, CONSENSUS and MEME based on users need
DNA
Binding site prediction
Tool
http://www.fas.harvard. edu/∼junliu/BEST
Che et al., 2007
BLOCKS
Statistical
Gibbs Sampling
Protein Collection of patterns or profiles
Database http://blocks.fhcrc.org/
Pietrokovski et al., 1996
ChloroP
Statistical
ANN
Protein Chloroplast Transit Peptides (cTP) and their corresponding cleavage sites.
Tool
Emanuelsson et al., 1999
ELM
Enumerative
Regular Expression (EA based)
Protein Predicts putative functional sites
Database http://www.elm.eu.orgn
Puntervoll et al., 2003
InterProScan
Enumerative
Sequence Retrieval System (SRS) based algorithm
Protein Protein signatures
Database http://www.ebi.ac.uk/ Tools/InterProScan
Zdobnov and Apweiler, 2001
MINER
Enumerative
Partition metric algorithm
Protein Phylogenetic motif (PM) identification
Tool
La et al., 2005
Motif Scan
Statistical
HMM
Protein Protein motif identification
Database http://myhits.isbsib.ch/cgibin/motif scan
MOTIFIND
Statistical
ANN
Protein Rapid and sensitive protein family identification
Tool and database
Pfam
Statistical
HMM
Protein Protein classification
Database http://www.sanger.ac.uk /resources/ databases/ pfam.html
Finn et al., 2010
PRINTS
Statistical
Probabilistic approach
Protein Protein fingerprints
Database http://www.bioinf. manchester.ac.uk/ dbbrowser/PRINTS/ index.php
Attwood et al., 2004
PROSITE
Enumerative
Regular Expression (EA based)
Protein Collection of patterns or profiles
Database http://prosite.expasy.org /scanprosite
Hulo et al., 2006
http://www.cbs.dtu.dk/ services/ChloroP/
http://www.pmap. csupomona.edu/ MINER/
http://diana.uthct.edu/
Pagni et al., 2007 Wu et al., 1995
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
251 Continued
Type Name
Approaches
Principle
of
Tool/
Purpose
Webpage
Reference
Database
Motif Gibbs Motif Sampler
Statistical
Gibbs sampling
DNA and Protein
Identify motifs, conserved regions,
Tool
http://bayesweb.wadsworth.org/gibbs/gibbs. html
Neuwald et al., 1996
MEME
Statistical
EM
DNA and Protein
Transcription factor binding sites and protein domains
Tool
http://meme.nbcr.net
Bailey et al., 1994
gn
Enumerative
Subgraph Enumeration
Gene Network
Transcriptional regulation Network
Tool
http://www.cs.unh.edu/∼tfogal/gn/
predicted motif to be an observed motif. These scores are used to find a consensus for a particular class of motifs as a characterization feature. Though several tools are reviewed above, there exists certain tools specific to identify a particular class of motifs. SCOPE is specific for the identification of potential regulatory motifs in sets of co-regulated genes. Bioprospector is a DNA motif discovery tool dynamically designed for identification of regulatory motifs. cERMIT is an efficient enumerative strategy for identifying genome wide cis-regulatory elements. Besides, MCAST, ModuleMaster, PhyloGibbs are also well-known tools for the identification of cis-regulatory motifs. For the identification of the overrepresented motifs the suitable tools are MUSA, SOMBRERO, AMD, Weeder and YMF. In particular, SOMBRERO distinguishes the weak motifs form the strong motifs while AMD works independent of motif length (long or short) and nature (gapped or continuous). Two major DNA motif databases specific for TFBS are JASPAR and TRANSFAC whereas MEME, Matlign, PhyloGibbs, MDScan are efficient tools for the identification of the novel TFBS motifs. Though these tools are based on different algorithms or models such as EM, PWM, Gibbs Sampling and Bayesian model, they all meant for TFBS motif discovery. Another tool based on Gibbs sampling is GIMSAN, which gives biologically realistic and reliable and statistically significant motifs. COMIT is used to detect functional non-coding motifs in coding regions from sequence conservation. Improbizer is an efficient tool used for identification of nucleotide sequence motifs occurring with improbable frequency. DBTBS is a species specific DNA motif database of transcriptional regulation in Bascillus subtilis. Besides all these, BEST and WebMotifs are the two opensource, web-enabled software packages helps in motif finding using a combination of different algorithms. In the context of protein motifs, PROSITE, Pfam
and BLOCKS are well known and most commonly used databases. Motif Scan is an HMM based motif finder which uses the available motif sources such as PeroxiBase, HAMAP, PROSITE and Pfam for protein motif identification. Similarly, InterProScan is another motif scanner which uses protein signatures of InterPro database. MINER is a suitable tool based on MSA for the identification of phylogenetic motifs. For discovery of the putative functional sites in eukaryotic proteins, ELM web server may be used. ChloroP has the speciality of identifying chloroplast Transit Peptides (cTP) and their corresponding cleavage sites. MOTIFIND, an ANN based faster program may be used for the identification of rapid and sensitive protein family. PRINTS is well known database for identification of a group of conserved motifs used in the characterization of proteins, as finger prints. MEME and Gibbs Motif Sampler are two widely used tools for both DNA and Protein motif identification. Interestingly, gn is a tool as well as a domain specific language which helps in discovering the motifs in interspecies gene networks but it does not support Network randomization. In addition to the above discussed merits and specificity of the approaches/ algorithms, a comparative analysis of the advantages and limitations of most commonly used algorithms is given in Table 2 that can guide the reader to use a suitable motif finder of his interest. It is evident from the literature reviewed above that Artificial Intelligence techniques can be successfully applied to discover motifs. However, Evolutionary Computation (EC) approaches remain a niche activity within this field, with practicing biologists preferring to use tried-and-tested approaches based around dynamic programming, statistical techniques and neural networks. The tools and databases reviewed above provide a guideline to the researchers and scientists for their use in motif discovery.
252
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
Table 2
List of major algorithms discussed in this study with their advantages and limitations
Algorithms
Approaches
Advantages
Limitations
References
EA
Enumerative
Non exhaustive and unbiased global search. Representational flexibility. Independent of the way of solution derivation and scoring.
Do not scale to larger motifs and continuous motif models due to combinatorial growth in the search space.
Lones and Tyrell, 2005
SOM
Enumerative
Visualization of low-dimensional views of high-dimensional data. Find motifs with up to two insertions, deletions or their combinations. More accurate and costs less computation than existing algorithms.
Relative inflexibility while mapping sequence data to inputs. Difficulty of interpreting neural models, particularly when hidden layers are present within the Network.
Liu et al., 2005
PALS
Enumerative
High sensitivity and specificity. Effective in deriving patterns close to the known consensus patterns for real sequences. Have small space complexities.
Slower for the datasets having number of sequences more than 1000 and sequence length more than 1000.
Ning and Leong, 2006
Clustering
Statistical
Automatically assign genotypes to distinct groups. Consensus of a particular family helps to identify a recognizable pattern as a functional motif.
Difficult to know which distance metric to choose, especially for structured data such as images or sequences. Hard to access the reliability of a cluster, to compare to other models, to make predictions and cluster new data into an existing hierarchy.
Johnson and Wichern, 2007; Heller et al., 2005
Fuzzy Logic
Statistical
Quantify imprecision and uncertainty that is not easily captured by standard mathematical models, particularly in pattern recognition problems. Genomic sequence comparison
Lacks the capability of machine learning as well as a neural Network-type memory and pattern recognition. Verification and validation of fuzzy based system is an expensive affair. Difficult to determine exact fuzzy rules and membership functions Stability is an important concern for fuzzy control.
Akerkar et al., 2009
HMM
Statistical
Ability to learn parameters directly from a set of observations. Freeing the user from having to set the parameters a priori.
Constant statistics within HMM state. Conditional independence of observations given the state sequence, without increasing the number of model parameters.
Zen et al., 2004
PWM
Statistical
Versatility to extract motifs from a wide range of data sets which lead to discovery of novel biologically relevant protein motifs directly from a raw protein
Assumes independence of positions within the binding motif. Predictions based on PWMs are usually not very specific to known functional sites.
Siddharthan et al., 2010
Gibbs Sampling
Statistical
Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics
More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space Repeat DNAs can be confused as motifs
Krishnan, 2005
Expectation Maximization
Statistical
Simplicity and ease of implementation. Does not usually require heavy preparatory analytical work. Easy to program. Easily parallelized and its memory requirements tend to be modest compared to other methods. Numerically very stable.
Slow linear convergence in some cases. Increase the complexity of the implementation. Does not provide an estimate of the information matrix on the vector of parameters as a by-product.
Couvreur et al., 1996
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
Acknowledgements This study was supported by World Bank Funded – National Agricultural Innovation Project (NAIP), ICAR Grants NAIP/Comp4/C4/C-30033/2008-09 and 30(68)/2009/Bio Informatics/NAIP/ O&M.
References [1] Akerkar, R., Sajja, P. 2009. Fuzzy Logic, In Knowledge-Based Systems. Jones & Bartlett Learning, Burlington, USA. [2] Alon, U. 2006. An Introduction to Systems Biology: Design Principles of Biological Circuits. Boca Raton: CRC, FL, USA. [3] Ao, W., Gaudet, J., Kent, W.J., Muttumu, S., Mango, S.E. 2004. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305, 1743–1746. [4] Arredondo, T.V., Neelakanta, P.S., DeGroff, D. 2005. Fuzzy Attributes of a DNA complex: Development of a fuzzy inference engine for codon-“junk” codon delineation. Artif Intell Med 35, 87–105.
253 [14] Couvreur, C. 1996. The EM Algorithm: A Guided Tour. In: Preprints of 2nd IEEE European Workshop on Computer-Intensive Methods in Control and Signal Processing (CMP’96), Pragues, Czech Rep., 115–120. [15] Cundari, T.R., Russo, M. 2001. Database mining using soft computing techniques. An integrated neural network-fuzzy logic-genetic algorithm approach. J Chem Inf Comp Sci 41, 281–287. [16] Emanuelsson, O., Nielsen, H., Heijne, G.V. 1999. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 8, 978–984. [17] Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R., Bateman, A. 2010. The Pfam protein families database. Nucl Acid Res 38, D211–D222. [18] Frith, M.C., Saunders, N.F.W., Kobe, B., Bailey, T.L. 2008. Discovering sequence motifs with arbitrary insertions and deletions. PLOS Comp Biol 4, e1000071. [19] Georgiev, S., Boyle, A.P., Jayasurya, K., Ding, X., Mukherjee, S., Ohler, U. 2010. Evidence-ranked motif identification. Genome Biol 11, R19.
[5] Attwood, T.K., Bradley, P., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G. 2004. The PRINTS protein fingerprint database: functional and evolutionary applications. In: Encyclopaedia of Genetics, Genomics, Proteomics & Bioinformatics. John Wiley and Sons, Chichester, UK.
[20] Gordon, D.B., Nekludova, L., McCallum, S., Fraenkel, E. 2005. TAMO: A flexible, object-oriented framework for analyzing transcriptional regulation using DNAsequence motifs. Bioinformatics 21, 3164–3165.
[6] Bailey, T.L. 2007 Discovering sequence motifs. Method Mol Biol 395, 271–292.
[22] Heller, K.A., Ghahramani, Z. 2005. Randomized algorithms for fast bayesian hierarchical clustering. PASCAL Workshop on Statistics and Optimization of Clustering, Windsor, UK.
[7] Bailey, T.L., Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park California, 28–36.
[21] Gribskov, M., Luthy, R., Eisenberg, D. 1990 Profile analysis. Method Enzymol 183, 146–159.
[23] Hubley, R.M., Zitzler, E., Roach, J.C. 2003. Evolutionary algorithms for the selection of single nucleotide polymorphisms. BMC Bioinformatics 4, 30.
[8] Bailey, T.L., Noble, W.S. 2003. Searching for statistically significant regulatory modules. Oxford University Press, Seattle WA.
[24] Hudak, J., Mcclure, M.A. 1999. A comparative analysis of computational motif-detection methods. Pacific Symposium on Biocomputing 4, 138–149.
[9] Blekas, K., Fotiadis, D.I., Likas, A. 2003. Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19, 607–617.
[25] Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.A. 2006. The PROSITE database. Nucl Acid Res 34, D227–D230.
[10] Carlson, J.M., Chakravarty, A., DeZiel, C.E., Gross, R.H. 2007. SCOPE: A web server for practical de novo motif discovery. Nucl Acid Res 35 (Suppl. 2), W259– W264. [11] Che, D., Jensen, S., Cai, L., Liu, J.S. 2005. BEST: Binding-site estimation suite of tools. Bioinformatics 21, 2909–2911. [12] Chudova, D., Smyth, P. 2002. Analysis of pattern discovery in sequences using a bayes error framework. Data Min Knowl Disc 7, 273–299. [13] Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. 2003. Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci 100, 3339– 3344.
[26] Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R.D., Gough, J., Haft, D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez, R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Mulder, N., Natale, D., Orengo, C., Quinn, A.F., Selengut, J.D., Sigrist, C.J.A., Thimma, M., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H., Yeats, C. 2009. InterPro: The integrative protein signature database. Nucl Acid Res 37, 211–215. [27] Jacob, E., Sasikumar, R., Nair, K.N. 2005. A Fuzzy guided genetic algorithm for Operon Prediction. Bioinformatics 21, 1403–1407.
254 [28] Johnson, R.A., Wichern, D.W. 2007. Applied Multivariate Statistical Analysis, 6th Edition. Prentice Hall. Inc., New Jersey, USA. [29] Kankainen, M., Loytynoja, A. 2007. MATLIGN: A motif clustering, comparison and matching tool. BMC Bioinformatics 8, 189. [30] Kicinger, R., Arciszewski, T., De Jong, K.A. 2005. Evolutionary computation and structural design: A state of the art. Comput Struct 83, 23–24. [31] Kohonen, T., Honkela, T. 2007. Kohonen network. Scholarpedia 2, 1568. [32] Krishnan, A. 2005. CS262: Computational Genomics, Lecture 15. http://robotics.stanford.edu/∼ serafim/cs262/Spring2003/Notes/ln14.pdf [33] Kural, D., Ding, Y., Wu, J., Korpi, A.M., Chuang, J.H. 2009. COMIT: Identification of noncoding motifs under selection in coding sequences. Genome Biol 10, R133. [34] La, D., Livesay, D.R. 2005. MINER: Software for phylogenetic motif identification. Nucl Acid Res 33, 267– 270. [35] Liu, J.S. 1994. The collapsed gibbs sampler in Bayesian computations with applications to a gene regulation problem. JASA 89, 958–966. [36] Liu, X., Brutlag, D.L., Liu, J.S. 2001. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium on Biocomputing 6, 127–138. [37] Liu, X.S., Brutlag, D.L., Liu, J.S. 2002. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nat Biotechnol 20, 835–839. [38] Liu, D., Xiong, X., Hou, Z.G., Dasgupta, B. 2005. Identification of motifs with insertions and deletions in protein sequences using self-organizing neural networks. Neural Networks 18, 835–842. [39] Lones, M.A., Tyrell, A.M. 2005. The evolutionary computation approach to motif discovery in biological sequences. In: Rothlauf, F. (Ed.) Proceedings of GECCO Workshop Program, Workshop Biological Applications of Genetic and Evolutionary Computation, Washington, USA, 1–11. [40] Lones, M., Tyrell, A. 2007. Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE T Compt Biol Bioinformatics 4, 403–414. [41] MacIsaac, K.D., Gordon, D.B., Nekludova, L., Odom, D.T., Schreiber, J., Gifford, D.K., Young, R.A., Fraenkel, E. 2006. A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics 22, 423–429. [42] Mahony, S., Benos, P.V., Smith, T.J., Golden, A. 2006. Self-organizing neural networks to support the discovery of DNA-binding motifs. Neural Networks 19, 950– 962.
Interdiscip Sci Comput Life Sci (2012) 4: 239–255 [43] Mahony, S., Hendrix, D.V., Smith, T.J., Golden, A., Roshkar, D.S. 2005. Self-organizing maps of position weight matrices for motif discovery in biological sequences. Artif Intell Rev 24, 397–413. [44] Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E. 2006. TRANSFAC and its module TRANSCompel: Transcriptional gene regulation in eukaryotes. Nucl Acid Res 34, D108–D110. [45] Mendes, N.D., Casimiro, A.C., Santos, P.M., Correia, I.S., Oliveira, A.L., Freitas, A.T. 2006. MUSA: A parameter free algorithm for the identification of biologically significant motifs. Bioinformatics 22, 2996–3002. [46] Narayanan, A., Keedwell, E., Olsson, B. 2003. Artificial intelligence techniques for bioinformatics. Appl Bioinformatics 1, 191–222. [47] Ng, P., Keichn, U. 2006. Apples to apples: Improving the performance of motif finders and their significance analysis in the twilight zone. Bioinformatics 22, 393– 401. [48] Ning, K., Leong, H.W. 2006. Finding patterns in biological sequences by longest common subsequences and shortest common supersequences. In: Proceedings of the 6th IEEE Symposium and Bioengineering, Washington, USA, 53–60. [49] Pagni, M., Ioannidis, V., Cerutti, L., Zahn-Zabal, M., Jongeneel, C.V., Hau, J., Martin, O., Kuznetsov, D., Falquet, L. 2007. MyHits: Improvements to an interactive resource for analyzing protein sequences. Nucl Acid Res 35, 433–437. [50] Paul, T.K., Iba, H. 2006. Identification of Weak Motifs in Multiple Biological Sequences Using Genetic Algorithm. Dept of Frontier Informatics, University of Tokyo, GECCO, Seattle, USA. [51] Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G. 2004. Weeder Web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucl Acid Res 32, W199–W203. [52] Petsko, G.A., Ringe, D. 2004. Protein Structure and Function. New Science Press Ltd., London, UK. [53] Pietrokovski, S., Henikoff, J.G., Henikoff, S. 1996. The blocks database - a system for protein classification. Nucl Acid Res 24, 197–200. [54] Puntervoll, P., Linding, R., Gem¨ und, C., ChabanisDavidson, S., Mattingsdal, M., Cameron, S., Martin, D.M., Ausiello, G., Brannetti, B., Costantini, A., Ferre, F., Maselli, V., Via, A., Cesareni, G., Diella, F., Superti-Furga, G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork, P., Rychlewski, L., K¨ uster, B., Helmer-Citterich, M., Hunter, W.N., Aasland, R., Gibson, T.J. 2003. ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucl Acid Res 31, 3625–3630. [55] Romer, K.A., Kayombya, G.R., Fraenkel, E. 2007. WebMOTIFS: Automated discovery, filtering and scoring of DNA sequence motifs using multiple programs
Interdiscip Sci Comput Life Sci (2012) 4: 239–255
255
and Bayesian approaches. Nucl Acid Res 35, W217– W220.
[67] Torres, A., Nieto, J.J. 2003. The fuzzy polynucleotide space: Basic properties. Bioinformatics 19, 587–592.
[56] Sadegh-Zadeh, K. 2000. Fuzzy genomes. Artif Intell Med 18, 1–28.
[68] Wallace, J.C., Henikoff, S. 1992. PATMAT: A searching and extraction program for sequence, pattern and block queries and databases. Comp Appl Biosci 8, 249– 254.
[57] Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W., Lenhard, B. 2004. JASPAR: An open-access database for eukaryotic transcription factor binding profiles. Nucl Acid Res 32, D91–D94. [58] SantaLucia, J., Hicks, D. 2004. The thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struc 33, 415–440. [59] Schwartz, D., Gygi, S.P. 2005. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol 23, 1391–1398. [60] Shen-Orr, S.S., Milo, R., Mangan, S., Alon, U. 2002. Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31, 64–68. [61] Shi, J., Yang, W., Chen, M., Du, Y., Zhang, J., Wang, K. 2011. AMD, an automated motif discovery tool using stepwise refinement of gapped consensuses. PLoS ONE 6, e24576. [62] Siddharthan, R. 2010. Dinucleotide weight matrices for predicting transcription factor binding sites: Generalizing the Position Wight Matrix. PLoS One 5, e9722. [63] Siddharthan, R., Siggia, E.D., Van-Nimwegen, E. 2005. PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny. PLOS Comp Biol 1, e67. [64] Sierro, N., Makita, Y., de Hoon, M., Nakai, K. 2008. DBTBS: A database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucl Acid Res 36, D93–D96. [65] Sinha, S., Tompa, M. 2002. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucl Acid Res 30, 5549–5560. [66] Stormo, G.D. 2000. DNA binding sites: Representation and discovery. Bioinformatics 16, 16–23.
[69] Wingender, E., Dietze, P., Karas, H., Knuppel, R. 1996. TRANSFAC: A database on transcription factors and their DNA binding sites. Nucl Acid Res 24, 238–241. [70] Wrzodek, C., Schroder, A., Drager, A., Wanke, D., Berendzen, K.W., Kronfeld, M., Harter, K., Zell, A. 2010. ModuleMaster: A new tool to decipher transcriptional regulatory networks. BioSystems 99, 79–81. [71] Wu, C.H., Zhao, S., Chen, H., Lo, C., McLarty, J. 1996. Motif identification neural design for rapid and sensitive protein family search. Comp Appl Biosci 12, 109–118. [72] Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad, K., Lander, E.S., Kellis, M. 2005. Systematic discovery of regulatory motifs in human promoters and 3[prime] UTRs by comparison of several mammals. Nature 434, 338–345. [73] Xu, D., Bondugula, R., Popescu, M., Keller, J. 2006. Bioinformatics and fuzzy logic. In: Proceedings of the 15th IEEE International Conference on Fuzzy Systems, Vancouver Canada, 817–824. [74] Zdobnov, E.M., Apweiler, R. 2001. InterProScan an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848. [75] Zen, H., Tokuda, K., Kitamura, T. 2004. A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features. In: Proceedings of International Conference on Acoustics Speech and Signal Processing, Montreal, Canada, 837–840.