ISSN 0013-8738, Entomological Review, 2016, Vol. 96, No. 8, pp. 1008–1014. © Pleiades Publishing, Inc., 2016. Original Russian Text © S.G. Medvedev, R.G. Khalikov, 2016, published in Parazitologiya, 2016, Vol. 50, No. 6, pp. 471–482.
Experience in Application of Databases of Bloodsucking Insects to Zoological Studies S. G. Medvedev and R. G. Khalikov Zoological Institute, Russian Academy of Sciences, St. Petersburg, 199034 Russia e-mail:
[email protected] Received March 18, 2016
Abstract—The paper summarizes our long-term experience of accumulating and summarizing the faunistic information by means of separate databases (DB) and information-analytic systems (IAS), and also prospects of its representation by modern multi-user information systems. The experience obtained during development and use of the IAS PARHOST1 for the study of the world flea fauna and work with partial databases created for the study of bloodsucking insects (lice and blackflies) is analyzed. Research collection material of the type series of 57 species and subspecies of fleas of the fauna of Russia was made available via a multi-user information retrieval system on the Internet portal of the Zoological Institute of the Russian Academy of Sciences. The system provides the means of storing information in its authentic form as well as its gradual transformation, i.e., unification and structuring. In order to ensure ceaseless DB update, the possibility of work of operators with different levels of competence is provided. DOI: 10.1134/S0013873816080066
By now, the efforts of many generations of researchers have resulted in accumulation of extensive data on the taxonomic position, morphology, distribution, and ecology of various living organisms. This information is concentrated in numerous special publications, but a much greater part of primary material, kept in the vast museum collections, remains in little demand. For example, the main collection of the Zoological Institute of the Russian Academy of Sciences (ZIN RAS) includes over 60 million items pertaining to 260 thousand species of animals. Efficient tools for accumulation and analysis of the already available and newly obtained data are the collection databases (DB) and faunistic informationanalytic systems (IAS). They can be used to solve a number of important problems. Databases not only provide the possibility to correct the data easily, but also make greater amounts of facts available for comprehension, as compared with “manual” data processing; they also ensure continuity of research. Broad access to Internet-based resources of the collections stimulates public interest to research, thus facilitating support of fundamental science. The collection DB are also of great importance for narrower applied tasks including, in particular, control of the naturofocal infections transmitted by bloodsucking and parasitic insects and ticks. Finally, DB promote the develop-
ment of an integrated approach to study and analysis of great amounts of heterogeneous data. Despite their obvious efficiency, DB are still represented by individual uncoordinated projects. The development of DB is hindered by the common assumption that a more or less complete DB would be impossible to build. In view of this, it is important to outline the conditions which would stimulate the development and application of DB in biological systematics. It is clear that a perfectly complete DB is an ideal case that can hardly be realized. The amount of the already accumulated and newly obtained data exceeds the capacities of the research community. The situations of the personnel, finances, and time being insufficient for a comprehensive study of a particular group of organisms are by no means temporary or isolated. In view of this, the development of DB should proceed by specific stages, each stage being oriented to meet the goals of a specific research project. The concrete results obtained with the help of IAS would stimulate their use not only for storage and exchange of information but also as a research tool. One of the tasks set within the framework of Project 15-29-02457 of the Russian Foundation for Basic Research was the development of means of represent-
1008
EXPERIENCE IN APPLICATION OF DATABASES OF BLOODSUCKING INSECTS
ing data on the type specimens of various animal species kept in the ZIN collections. For this purpose we selected the type specimens of 57 species and subspecies of fleas. The information contained in the labels and original descriptions was summarized, unified, and made publicly accessible for the first time. It was supplemented with the digitized images of the type specimen preparations which were required for further studies of the flea fauna in Russia. In this communication we consider some specific features of the collection DB and faunistic IAS which should be taken into account during their development. Besides, we propose a method of consecutive processing of information that includes the stages of interpretation, unification, structuring, and analysis. THE OBJECTIVES OF THE FAUNISTIC DATABASE OF BLOODSUCKING INSECTS The need for new research tools for accumulation and analysis of great amounts of heterogeneous information becomes the most evident when organisms of great practical importance are studied. The species affecting human health attract the greatest attention. One of such examples is the order of fleas (Siphonaptera). Fleas were intensively studied during the first three quarters of the XX century, starting with the discovery of their role as plague vectors. Fleas are also known to be important vectors of endemic typhus, myxomatosis, and other dangerous naturofocal diseases of man and animals. During the last decade the rates of description of new species have noticeably decreased, and presently about 2200 species and 900 subspecies are known in 242 genera of fleas (Medvedev, 2002, 2013). Extensive collections of fleas from all the continents have been accumulated in museums, universities, and research institutions. There are over 10 thousand publications devoted to the taxonomic diversity, morphology, distribution, ecology, and epidemiological significance of various representatives of this relatively small order. In order to actualize this vast and diverse information, i.e., include it in analysis, an analytic tool is needed which would allow one to study the numerous facts with respect to both individual aspects and their interactions. The information-analytic system PARHOST1 of the world fauna of fleas has been developed at the Zoological Institute since 1996. This IAS combines the capabilities of the taxonomic, identification, and collection DB (Medvedev et al., 2004). ENTOMOLOGICAL REVIEW Vol. 96 No. 8 2016
1009
THE PRIMARY AND GENERALIZED DATA The IAS PARHOST1 was originally designed for processing “generalized” data. Such data, contained in various publications, result from expert interpretation of primary facts. For example, the range of a species is characterized not by the list of individual finding localities but by reference to geographic objects or political and administrative units, such as countries and their regions. In the study of parasitic organisms, for each collection specimen it should be known to what species the parasite belongs and on what host species, in what locality, where, and by whom it was found. This is the basic or primary information which is required for assessment of the composition of the regional fauna or ecological traits of individual species. The primary data should be provided on the label accompanying each sample. For such ectoparasites as fleas, the “sample” unites all the specimens collected (combed) off one mammal, which was captured by the collector on a certain date in a certain place. A sample may include specimens of one or both sexes of one or several species of fleas. Besides the temporary or field label, the primary information concerning each sample is entered under the corresponding number in the expedition journal, which also constitutes a documentary source. In the museum, the primary data are copied onto the label of the permanent collection specimen, and often additionally stored in record cards and journals. The collection locality is traditionally specified by reference to the names of physiographic objects and/or the nearest populated places. A typical label may have the following text: “North foothills of Range L, 24 km northeast of settlement N, in District P, Province S, country R.” However, human settlements may disappear, and their names and especially administrative identity may change. Therefore, such references are not exactly reliable and should be supplemented with geographic coordinates of the collection localities. Earlier, geographic coordinates could only be determined with the use of topographic maps, which were not always available to researchers. Now, with the development of satellite and computer technologies, digital maps, and GPS devices, finding the coordinates of the field research sites has become a routine procedure. However, the need for precise georeferencing of collection localities is often ignored by
1010
MEDVEDEV, KHALIKOV
researchers. At the same time, reliability is not the only reason why knowledge of the exact position of the finding localities is required. Indication of geographic coordinates of the finding localities makes the data on the distribution of a particular species available for various means of generalization. In particular, such data may be processed using the modern geoinformation systems (GIS). The mapping of collection localities on the various GIS layers allows one to assess the distribution and ecology of the species in relation to the geographic gradients of temperature and humidity, and also the climatic, hydrological, edaphic, and floristic diversity of individual territories. Precise georeferencing of collection localities is also important for practical needs, for example, for control of regional naturofocal infections. Thus, the description of a species’ range should be supplemented with geographic coordinates of its finding localities, which should be included in publications or, in case of large amounts of data, made available online on specialized sites. In the great majority of earlier publications, the authors merely indicated the fact of finding of a particular species in a certain large region but did not specify the collection localities or biotopic associations. THE AUTHENTIC PRIMARY DATA AND THEIR INTERPRETATION When working with research collections, it should be borne in mind that they have an unlimited time of storage, i.e., they are not to be purposefully discarded or renewed. The demand for particular material is unpredictable, and individual acts of data retrieval may be separated by decades. In view of this, the structure of the collection DB and the formats of data storage should provide the means of repeated reconsideration and reinterpretation of primary data. For this purpose, information should be entered, stored, and represented in the DB in both the authentic and transformed formats. The authentic data are stored in the exact form of the primary source. They may be represented, for example, by a full quotation from a literary source in the original language or by photographs of the specimen together with its collection labels. The photographs should be made with sufficient resolution to make the contents of the labels clearly legible. The importance of storage of the authentic primary data is determined not only by the need to preserve the
original material. One more, seldom addressed problem is that the same set of primary data may be differently interpreted by different experts. The reasons for this are diverse; for example, the primary data in the labels may be incomplete, and the text itself may be illegible or even partially destroyed. Besides, as mentioned above, the finding localities may be referred to by obsolete geographic names. All these factors may result in different interpretations of the primary data by researchers, so that the existing conclusions may need to be repeatedly revised. As the result of the authorized interpretation of the primary data, a body of secondary, or transformed data is formed. Transformation in this case implies that the primary data are considered at the level contemporary to the particular expert. Besides, transformation also includes unification and structuring of data, which will be discussed below. Thus, the BD should be able to store both types of information and to allow the user to compare the primary and transformed data. INTERPRETATION AND UNIFICATION OF THE PRIMARY DATA Interpretation of the primary data, i.e., their correlation with the contemporary names, is not always a trivial task. Processing of such information may be very time-consuming. For example, integration of data on the type specimens of 57 species and subspecies of fleas proved to be difficult because the specimens had been collected in the late XIX and in the first third of the XX centuries. The descriptions of species and information of their type series appeared in rare publications in various languages. The collection localities were designated by obsolete names of geographic objects or settlements that exist no more. The hosts were not identified to species (for example, some fleas were indicated to have been collected “off a mouse”) or referred to by invalid and often long-forgotten Latin names. In addition, the specimens from the type series of fleas were later separated. They are presently kept, besides ZIN RAS, also in I.G. Ioff’s National collection of fleas of Stavropol Antiplague Research Institute and in Hamburg Zoological Museum. The next stage of interpretation of the primary data is their unification. This term refers to the procedure of recasting the secondary information into the unified format of the DB tables. For this purpose, the DB structure includes the substitution lists, or glossaries of names. The “free” input of names in the DB tables ENTOMOLOGICAL REVIEW Vol. 96 No. 8 2016
EXPERIENCE IN APPLICATION OF DATABASES OF BLOODSUCKING INSECTS
produces a virtually unlimited variation of their spelling in the full and abbreviated forms. For example, the names of such objects as natural areas, ranges, rivers, and especially the animal species can be worded in a variety of ways. However, without unification of the entries, the tables cannot be used for information retrieval, analysis, and integration. THE STRUCTURING OF INFORMATION The DB fields are heterogeneous if their cells contain not one but several elements of information. For example, data on the distribution of a species may be represented by a list of physiographic objects and administrative regions entered in one cell of the DB table. The functions of such a syncretic DB would be restricted to retrieval of individual elements, because the content of its cells is not structured. To provide the means for analysis, the secondary information in the DB should be structured; for example, it may be represented by separate, further indivisible elements stored in separate cells of the corresponding field of the table. For instance, if we have many similar reports of “parasitic species a, b, c found on host species n, m in localities p, s during a certain period of time,” then after the sorting of elements of these reports into the fields “Parasite species,” “Host species,” “Locality,” “Time,” and “Author,” analysis will be restricted to selection of entries by the presence of a common element. For example, we may obtain the list of flea species recorded on the host n or, vice versa, the list of host species for the flea species a. Much more complex integrations may be performed for unified elements with defined relations between them. For example, elements of the same type may be united in variously sized groups forming a hierarchical classification. In IAS PARHOST1 the relations between unified elements of the same type are defined by special classifier tables, each classifier being a hierarchical thesaurus (Medvedev et al., 2004). IAS PARHOST1 contains classifiers of the taxa of fleas and their mammal and avian hosts in scope of their world faunas. Besides, the system includes hierarchical glossaries of political and administrative territories (3835 provinces, states, and districts and over 3565 names of regions) and physiographic objects of the world (1730 mountains, 1400 ranges, 1200 lakes, 2000 islands, 6300 rivers, etc.), together with the synonyms and spelling variants. ENTOMOLOGICAL REVIEW Vol. 96 No. 8 2016
1011
All the hierarchical tables of IAS PARHOST1 are based on the ZOOCOD3 standard (Medvedev and Lobanov, 1999). The ZOOCOD standard was developed in the late 1980s by A.L. Lobanov (ZIN) for the purpose of conversion of hierarchical classifications into a flat relational table preserving all the data contained in the taxonomic list. The efficiency and flexibility of this standard make it possible to take into account the whole range of taxonomic categories including intermediate ones; to perform automated integration of data on subordinate taxa for the higher taxa; to use synonyms alongside with valid names of taxa; to introduce any level of detail in the hierarchical classification schemes. Thus, the classifier tables are used to determine the taxonomic reference for each secondary data element. At the same time, the classifiers form the base for the entries of the factual tables. Information is entered in the factual tables of IAS PARHOST1 not in the form of arbitrary textual fragments but indirectly, by selection of the matching entry from the classifier. For example, the factual table of host-parasite associations describes the relation between a certain taxon of fleas and one of its host taxa and specifies the type of parasitic association and the source of data. ANALYSIS OF THE STRUCTURED SECONDARY DATA Systematics studies complex multifaceted phenomena. The specific features of structure, distribution, and trophic associations inherent in a given species or taxon are part of the diversity of the phenomena described by morphology, geography, and ecology, respectively. Structuring of secondary data with the use of classifiers reflecting the hierarchy of morphological structures, geographic objects, and types of parasitic relations allows these data to be integrated with respect to any of the above aspects and at any level of classification of each aspect (Medvedev, 2001, 2005). Since all the classifiers are built by the same principle and each of them reflects a strictly specific aspect, some complex algorithms can be applied to analysis of the factual tables. At present, the logical structure and size of IAS PARHOST1 provide the possibilities of the following types of analysis. (1) Grouping of facts according to one specific aspect of a complex phenomenon: in particular, generation of lists of host taxa for the given taxon of fleas or lists of flea taxa associated with the given host taxa of specified ranks; generation of tables of the types of
1012
MEDVEDEV, KHALIKOV
relations with different hosts for the given flea taxon or relations with different flea taxa for the given host taxon; classification of the types of ranges of species, genera, and families of fleas; generation of indices and dendrograms of faunistic similarity for 40 regions of the world and tables showing the number of taxa and composition of the regional flea faunas. (2) Assessment of relations between two aspects of a complex phenomenon: for example, distribution of flea taxa over host taxa and vice versa; distribution of morphological features over flea taxa and the presence of particular morphological features in the given flea taxa. (3) Assessment of relations between three aspects of a complex phenomenon: for example, distribution of flea species over host taxa within the given zoogeographic unit or distribution of host taxa by the presence of flea taxa within the given zoogeographic unit; distribution of flea taxa with a particular morphological type over host taxa. (4) Assessment of the pathways of flea evolution by means of three-dimensional tables estimating the effect of morphological features of species and genera of fleas on the taxonomic range of their mammal and avian hosts. This procedure allows the researcher to detect the influence of a certain set of properties including the given character and its different states on the adaptive potential of the studied taxon. THE PRINCIPLE OF STEPWISE INFORMATION PROCESSING Our work with field and collection material as well as literary and original data on the structure, distribution, and host associations of fleas, mosquitoes, and blackflies has given us experience in the development of collection DB and data entry. Considering the large amounts of available information which could not be entered in the DB at once in its entirety, we developed the concept of stepwise data processing, with each stage focused on a relatively simple task leading to a certain intermediate result. If the input of information in the DB tables is realized as a set of similar operations, the operator soon acquires the practical skills needed for processing extensive data. For example, the development of the classifier table of the flea taxa in the world fauna was divided into several stages yielding intermediate results. At the first stage, all the family-group taxonomic names of fleas
(from tribes to superfamilies) were entered in the table; at the second stage the family names were supplemented with the names of all the known genera and subgenera, and at the third stage the names of the genera were supplemented with those of species and subspecies. The genera which were considered the most important for the ongoing research were the first to be supplemented. Thus, the DB table had the potential of a functional information resource at any stage of its development. It accepted not only queries for the lists of taxa of a specified rank but also analytic queries about the number of genera and subgenera in tribes, subfamilies, and families, the number of taxa described in particular years, the number of taxa described by a given author in different years, etc. Since processing of field and collection material is a virtually continuous task, the structure of the collection DB should also provide the possibility of stepwise improvement of its tables and fields. This would facilitate the relevance of the DB and give impetus to its development. For example, identification of collected specimens is often postponed for an indefinite time due to the absence of an expert in a particular group of organisms. However, even though the material remains unidentified, the minimal information available in the field journals can and should be already entered in the DB. This stage is critical in the sense that it determines the fate of the primary data. Since the actual collection of material and its processing (identification and inclusion in the collections) are often separated by long periods of time, the field journals may be lost, or the data contained in them may become to a certain extent obscure to the later researchers. Considering the current research potentials, the input of taxonomic and faunistic information in the DB may be divided into four stages (Medvedev and Lyanguzov, 2003). The goal of the first stage is to transform the data from field journals, record cards, collections, and literary sources from the manuscript form to the simplest DB. These tasks should be made sufficiently simple, so that they could be performed even by an operator with no competence in biology. At the first stage, the primary data are copied “as they are” in the original sources, i.e., labels, field journals, and collection records. Further processing involves the DB fields containing the simplest data which do not have to be emended. The operator enters the information linking the sample to a particular collection trip (for example, Novgorod Province, field season of 2003), and the inventory numbers, to a particular muENTOMOLOGICAL REVIEW Vol. 96 No. 8 2016
EXPERIENCE IN APPLICATION OF DATABASES OF BLOODSUCKING INSECTS
seum and place within its storage facilities. In addition, the total number of specimens (not yet identified to species) may be given for each sample or record unit. The indication of the collection locality should be supplemented with its geographic coordinates. The first stage may also include the different ways of interpretation of the collection locality, i.e., indication of the biotope type or its position relative to a certain physiographic object, landscape area, or administrative region. At the second stage, the primary DB are processed by a qualified operator, whose task is to unify the tables, i.e., bring them in correspondence with the classifiers of the taxa and regions. At the third stage, the unified information pertaining to individual specimens (samples, storage units, etc.) is aggregated by an expert biologist for the species as a whole and entered in the summary analytic tables. At the fourth stage, the analytic tables (in IAS PARHOST1 these are the tables of relations between flea taxa and host taxa, flea taxa and geographic regions, flea taxa and characters of skeletal structures) are analyzed. Data analysis can be performed using SQL tools at all the stages, and using specially constructed queries, at the final stage. During our work on clarification and typification of the ranges of mosquitoes, blackflies, biting midges, and horseflies in the Northwest of European Russia we had to study the primary data, namely the entries in field journals and the collection material. However, we realized that most of the material of field collections carried out over 30–40 years ago in Murmansk, Arkhangelsk, Vologda provinces and probably in Karelia and Komi republics had been scattered or lost. Such examples of incompleteness of field data demonstrate the importance of storing data in the digital form, in databases or at least in Microsoft Excel tables (Medvedev, 2011, 2012). The operation of Excel tables is the most userfriendly. In the initial phase, these tables may serve as a prototype DB and an efficient tool of its construction. The primary data of field collections may be gradually distributed over different columns of Excel tables, i.e., subdivided into further indivisible units of data. Although Excel data tables become too large and cumbersome when the number of columns and entries exceeds 1500, such digital means of information storage would at least partly serve the problem of preservation of data and continuity of research. The comENTOMOLOGICAL REVIEW Vol. 96 No. 8 2016
1013
plete solution of this problem, however, requires preservation of the collection facilities and development of the digital means of storage of the corresponding information. CONCLUSIONS During the last century, several hundred volumes in the Fauna of the USSR and Fauna of Russia series have been published in the Zoological Institute. These fundamental reviews, summarizing extensive data obtained from collection material and existing publications, are still relevant. Presently, however, reviews of this kind can hardly be published without supplementary electronic resources providing interactive access to data on the morphology and distribution of particular species. The collection DB and faunistic IAS act as constantly updated collectors of such data (Krivokhatsky et al., 2003). The effectiveness of their use for information retrieval and analysis is determined by the principles considered above. According to them, DB should provide the means of not only storage but also standardization and structuring of heterogeneous information on the morphology and distribution of species. Data stored in both authentic and structured formats create the basis for complex study and longterm monitoring of biological diversity. ACKNOWLEDGMENTS This work was financially supported by the Russian Foundation for Basic Research (grant 15-29-02457). The authors are grateful to A.V. Khalin for making the photos and to N.K. Brodskaya for help in collection of literary data and in data input. REFERENCES 1. Krivokhatsky, V.A., Lobanov, A.L., Medvedev, G.S., Belokobylsky, S.A., Dianov, M.B., Smirnov, I.S., and Khalikov, R.G., “An Internet Based Information System on Entomological Collections,” Trudy Russkogo Entomologicheskogo Obshchestva 74, 59–70 (2003). 2. Medvedev, S.G., “An Attempt to Create a Database of the Morphology of Fleas (Siphonaptera),” Entomologicheskoe Obozrenie 80 (2), 527–539 (2001) [Entomological Review 81 (5), 511–519 (2001)]. 3. Medvedev, S.G., “Distribution and Host Associations of Fleas (Siphonaptera). I,” Entomologicheskoe Obozrenie 81 (3), 737–753 (2002). 4. Medvedev, S.G., “On Systemic Analysis of the Evolution of the Order of Fleas (Siphonaptera),” in N.A. Kholodkovsky Memorial Lectures, Issue 57 (2) (2005), pp. 1–170 [in Russian].
1014
MEDVEDEV, KHALIKOV
5. Medvedev, S.G., “The Fauna of Bloodsucking Insects of the Gnus Complex (Diptera) of Northwest Russia. Analysis of Distribution,” Entomologicheskoe Obozrenie 90 (3), 527–547 (2011) [Entomological Review 91 (9), 1092–1107 (2011)]. 6. Medvedev, S.G., “Problems of Development of Databases on Bloodsucking Insects (Diptera: Culicidae, Simuliidae, Tabanidae, Ceratopogonidae), by the Example of the Fauna of the Northwest of European Russia,” in Proceedings of the XIV Congress of the Russian Entomological Society (Russia, St. Petersburg, August 27 to September 1, 2012) (St. Petersburg, 2012), p. 280. 7. Medvedev, S.G., “The Palaearctic Centers of Taxonomic Diversity of Fleas (Siphonaptera),” Entomologicheskoe Obozrenie 92 (3), 684–702 (2013) [Entomological Review 94 (3), 345–358 (2014)].
8. Medvedev, S.G. and Lobanov, A.L., “InformationAnalytic System of the World Fauna of Fleas (Siphonaptera): Results and Prospects,” Entomologicheskoe Obozrenie 78 (3), 732–748 (1999) [Entomological Review 79 (6), 654–665 (1999)]. 9. Medvedev, S.G. and Lyanguzov, I.A., “Stages of Processing of Faunistic Information,” in Abstracts of Papers, the International Symposium “Information Systems on Biodiversity of Species and Ecosystems,” St. Petersburg, December 1–4, 2003 (Zool. Inst., St. Petersburg, 2003), pp. 17–18. 10. Medvedev, S.G., Lobanov, A.L., Lyanguzov, I.A., and Kunkova, E.V., “Information Processing by Database Facilities in Faunal and Taxonomic Studies,” Entomologicheskoe Obozrenie 83 (4), 924–936 (2004) [Entomological Review 84 (7), 825–834 (2004)].
ENTOMOLOGICAL REVIEW Vol. 96 No. 8 2016