Appl Bioinformatics 2005; 4 (2): 147-150 1175-5636/05/0002-0147/$34.95/0
APPLICATION NOTE
© 2005 Adis Data Information BV. All rights reserved.
ProLysED An Integrated Database and Meta-server of Bacterial Protease Systems Mohd Firdaus Raih,1,2 Hafiza Aida Ahmad,2 Mohd Yunus Sharum,2 Norazah Azizi2 and Rahmah Mohamed1,2 1 2
School of BioSciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia Malaysia Genome Institute, Universiti Kebangsaan Malaysia - Malaysia Technology Development Corporation (UKM-MTDC) Smart Technology Centre, Bangi, Malaysia
Abstract
Bacterial proteases are an important group of enzymes that have very diverse biochemical and cellular functions. Proteases from prokaryotic sources also have a wide range of uses, either in medicine as pathogenic factors or in industry and therapeutics. ProLysED (Prokaryotic Lysis Enzymes Database), our meta-server integrated database of bacterial proteases, is a useful, albeit very niche, resource. The features include protease classification browsing and searching, organism-specific protease browsing, molecular information and visualisation of protease structures from the Protein Data Bank (PDB) as well as predicted protease structures. Availability: ProLysED is integrated into the ProLysES (Prokaryotic Lysis Enzymes Site) website at http:// genome.ukm.my/prolyses/. Access to the ProLysED database is free for academic users upon registration. Contact: M. Firdaus Raih (
[email protected] or
[email protected])
Proteases are a class of enzymes that catalyse the cleavage of peptide bonds in other proteins. Proteases have had a long history of applications in the food, detergent and leather industries. These proteins also represent one of the three largest groups of industrial enzymes and account for up to 60% of the total worldwide sale of enzymes.[1] Bacterial proteases have a diverse range of function and mechanisms of action. They can be responsible for complex processes under normal physiological circumstances as well as in abnormal patho-physiological conditions. Current advances in DNA sequencing technology have made available a large number of unannotated and as yet unstudied datasets alongside well studied and well characterised models. These datasets, when integrated with current extrapolative tools and used in conjunction with well studied models, can yield useful information, which can then be used for comparative analysis leading to the laboratory refinement of experiments. We set out to make use of publicly available data on bacterial proteases for this purpose by adding value to information in these datasets and organising it into an integrated meta-server interface for electronic access and analysis. It is hoped that the resulting comparative
analysis and related information will enable refinement of current work on bacterial proteases, as well as assist in generating new knowledge on bacterial proteases in general, be it for industrial applications, pharmaceuticals or academic research. The resulting database, ProLysED (Prokaryotic Lysis Enzymes Database), was not intended to be an extensive exploration into proteolytic enzymology as a whole, such as the MEROPS database.[2] It was instead targeted as a niche resource in prokaryotic protease systems, which may include non-proteolytic associated datasets. An existing resource, which similarly explores a niche protease dataset, is the HIV Protease Database (http:// mcl1.ncifcrf.gov/hivdb/). This archive is more focused on a dataset sourced from HIV-related crystallography experiments. In this article we report on the ProLysED database, which is dynamically annotated, covering a dataset of bacterial protease systems. ProLysED is further enhanced by the use of a meta-server to provide constantly updated information by cross-referencing other relevant data resources, including the MEROPS database of Rawlings et al.[2]
148
Firdaus Raih et al.
BLAST®, GenBank®, Swiss-Prot, PDB, PDBsum, SCOP, CATH, MEROPS, PHD, 3D-PSSM, SAM-T99, PSIPRED
Meta-server
Manual annotation
User module Sequences
User
ProLysED interface
ProLysED user database
Protein structure prediction (MODELLER)
Annotation
Structures
Update agent MySQL®
Initial dataset Data updates
Dataset
Query dataset for new data
Manual check: all data from bacterial source Search restricted to: protease, bacteria (organism) Administration module
PDB
Swiss-Prot
Fig. 1. Organisation of the ProLysED database. The user module shows how a user accesses the database via a user database system. The administration module illustrates the relationship between the keyword-based searches and the scripts used for data mining during the initial database population, the scripts used for the update agent and the database curator interface. PDB = Protein Data Bank.
Database Organisation Bacterial protease data were retrieved from the Swiss-Prot database (sequences and annotation)[3] and the Protein Data Bank (PDB) [sequence and structural information].[4] This data retrieval process was carried out using a combination of keyword searches and limiting the searches to only bacterial species. The selected data were further filtered using a Perl script for confirmation of bacterial source. This was done by parsing and confirming the organism source field. These entries were then further annotated with extrapolated information acquired using computational tools. Literature references were manually added to each data entry and each reference was linked to its PubMed abstract and the online version of the paper if available. The overall organisation of the database is illustrated in figure 1. The MySQL® database (http://www.mysql.com) was used as the core relational database, whereas the interfacing was done mainly using PHP (http://www.php.net). ProLysED was con© 2005 Adis Data Information BV. All rights reserved.
structed in a modular fashion with future expandability and portability in mind. ProLysED is structured according to interlinked databases for basic data, sequences, structures, internal annotation and user interface. The basic data database contains information for protease catalytic class, bacterial species, GenBank® accession number, basic description, Enzyme Commission (EC) number and data source. The sequences and structures contain sequence data in FASTA format and structural data as PDB-formatted files, respectively. The internal annotation database contains further annotation of protease activity, inhibitor information and any other extra annotation that is not available in the original data source. The user database controls user access and usage sessions. Data without protein structure information can be submitted to the external services PHD,[5] 3D-PSSM,[6] SAM-T99[7] and PSIPRED[8] through user sessions, with results being returned by the respective services via email. A basic registration process is required in ProLysED to acquire this information from the meta-servers. Appl Bioinformatics 2005; 4 (2)
ProLysED
Users who do not wish to register can opt to browse by a default ‘demo’ account that does not have the meta-server capability enabled. The ProLysED database is currently searchable by searching specifically through the following categories: organism name; GenBank® accession number; protease class; protease definition (functional keywords and data description); or as an integration of all these fields. Default data listings for all available bacterial species and all entries with available PDB files are also included as part of a more general search. The ProLysED database can be automatically updated using a Perl agent at periodical intervals. Updated datasets are extracted in XML format using the specified search terms. The Perl script then parses the XML file for entries not currently in the database, authenticates organism source and automatically updates the MySQL® database (figure 1). Any further annotation is then carried out by human intervention, aided by reviews of existing literature. A curator uses an interface similar to the one accessed by the user, but with data editing functions available. Initial population of the database is done from the UNIX® command line by running the Perl script mentioned earlier. Cross-references to GenBank®,[9] the PDB, the SCOP database,[10] MEROPS,[2] PDBsum[11] and the CATH structural database,[12] as well as an external BLAST® to Swiss-Prot, are also interfaced from ProLysED. A specific ProLysED BLAST®[13] server with the database limited to bacterial proteases and related systems is also provided. Data entries with corresponding structural data in the form of PDB-formatted files are also viewable via the Chime molecular viewer plug-in (http://www.mdli.com). These structures can be selected from the bacterial protease structure data integrated within ProLysED. Data from GenBank® were further annotated for biochemical activity, cellular function and inhibitor information. Corresponding EC numbers and protease classification were also added to datasets that were not initially assigned EC numbers and a protease classification. This process is again aided by the manual review of available literature cross-references, which have been manually attached to the data entries after the database population phase. Comparatively modelled structures were also generated for certain data entries with homologous templates using MODELLER version 6.2.[14] Date entries with corresponding EC numbers totalling 447 sequences, which enabled classification of their catalytic mechanisms, were clustered on that basis using BLASTClust.[15] This subset represented the structurally active protease within the dataset. Other members, such as those that represented signal peptides, inhibitors, chaperone proteins, secretory factors and other associated proteins that may form a part of the active protease © 2005 Adis Data Information BV. All rights reserved.
149
system, were not included for this analysis. Clustering, when done for non-redundancy of sequences at 25% identity, yielded 61 serine, 17 metallo, 9 aspartic and 3 cysteine proteases for a total of 95 non-redundant sequences. At 30% identity, 19 metallo and 10 aspartic proteases were detected, while the numbers for serine and cysteine proteases remained the same. At cutoffs of 50%, 75% and 90% identity, the number of non-redundant sequences was 147, 241 and 279, respectively. Further representations and modelling of this, particularly for whole-genome data, may give an idea of the diversity of protease sequences, structures and functions present in the prokaryotic world. Conclusions and Future Directions ProLysED attempts to explore the organisation of niche datasets for a wide variety of purposes such as industrial applications, pharmaceutical and academic research. The organisation of such niche data is even more important in the post-genomic era to facilitate quick access to valued-added specific datasets. The reorganisation of the multitudes of presently available data into a dynamic interface is hoped to enable more specific studies to be carried out faster and more efficiently. This database is being constantly updated, with continuous development in terms of annotation and interfacing technologies. We believe that an important area to explore is the development of interfacing technologies that can be applied to currently available, well developed tools for use against a niche dataset. This is especially relevant when taking into consideration the quantity of data and tools currently available compared with the practically useful biological information derived from these enormous amounts of data. Acknowledgements We acknowledge the National Biotechnology and Bioinformatics Network (NBBnet), Malaysia, which is hosting ProLysED, and funding from the Microbial Genomics for Gene and Natural Product Discovery IRPATOPDOWN grant (09-02-02-002 BTK/TD/03) from the Ministry of Science, Technology and Innovation, Malaysia. We thank Fuad Muhammad and Noraslinda for miscellaneous technical input and feedback. The authors have provided no information on conflicts of interest directly relevant to the content of this article.
References 1. Rao MB, Aparna AM, Ghatge MS, et al. Molecular and biotechnological aspects of microbial proteases. Microbiol Mol Biol Rev 1998; 62: 597-635 2. Rawlings ND, O’Brien EA, Barrett AJ. MEROPS: the protease database. Nucleic Acids Res 2002; 30: 343-6 3. Boeckmann B, Bairoch A, Apweiler R, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003; 31: 365-70 Appl Bioinformatics 2005; 4 (2)
150
4. Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235-42 5. Rost B. PHD: predicting 1D protein structure by profile based neural networks. Methods Enzymol 1996; 266: 525-39 6. Kelley LA, MacCallum RM, Sternberg MJE. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299: 499-520 7. Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998; 14: 846-56 8. McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics 2000; 16: 404-5 9. Benson DA, Karsch-Mizrachi I, Lipman DJ, et al. GenBank. Nucleic Acids Res 2003; 31: 23-7 10. Murzin AG, Brenner SE, Hubbard T, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995; 247: 536-40 11. Laskowski RA. PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res 2001; 29: 221-2
© 2005 Adis Data Information BV. All rights reserved.
Firdaus Raih et al.
12. Orengo CA, Michie AD, Jones S, et al. CATH: a hierarchic classification of protein domain structures. Structure 1997; 5: 1093-108 13. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389-402 14. Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993; 234: 779-815 15. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990; 215: 403-10
Correspondence and offprints: Dr Rahmah Mohamed, Malaysia Genome Institute, Heliks Emas Block, UKM-MTDC Smart Technology Centre, 43600 UKM Bangi, Malaysia. E-mail:
[email protected]
Appl Bioinformatics 2005; 4 (2)