ISSN 0147-6882, Scientific and Technical Information Processing, 2017, Vol. 44, No. 5, pp. 329–337. © Allerton Press, Inc., 2017. Original Russian Text © D.A. Devyatkin, R.E. Suvorov, I.V. Sochenkov, 2016, published in Iskusstvennyi Intellekt i Prinyatie Reshenii, 2016, No. 1, pp. 37–46.
An Information Retrieval System for Decision Support: An Arctic-Related Mass Media Case Study D. A. Devyatkin*, R. E. Suvorov**, and I. V. Sochenkov*** Institute for Systems Analysis, Computer Science and Control Federal Research Center, Russian Academy of Sciences, Moscow, 119333 Russia *e-mail:
[email protected] **e-mail:
[email protected] ***e-mail:
[email protected] Abstract⎯This paper discusses the problem of building a comprehensive information retrieval system that facilitates the decision-making process in a specified wide topic. We analyze the requirements for such a system, types of information sources, and typical search queries and propose an architecture and an integrated pipeline. We also present a case study in the field of Arctic exploration (oil & mining, ecology issues, etc.). The results are also presented, including vibrant topics and typical associations between entities. Keywords: information retrieval, mass media monitoring, event detection, information extraction, relation extraction, knowledge base, decision support DOI: 10.3103/S0147688217050033
INTRODUCTION Information extracted from the media and social media is often used to inform decision making. Dedicated systems exist to address quite a narrow range of tasks, for example, certain tasks to support rescue operations or fighting crime in a specific region [1]. Due to the specialization and specifics of the existing approach based on the analysis of publications on each subject by their frequency of appearance and their distribution over time intervals, there are no universal systems to allow complex queries. This paper explores intellectual methods to retrieve and identify the descriptions of events within a given topic from information flows on the Internet related to a case study of the situation in the Arctic zone. An approach is proposed to integrate methods of intellectual analysis of data and texts to facilitate decision support based the analysis of the mass media within a set topic. The novelty of the approach lies in the task we addressed and the integral approach we employed. What is specific about the task is the opportunity for an expert to pose a wide range of queries to the database, for example, What events occurred in such place and/or at such time?, Who was involved in such events?, In what other events were the parties of the discussed event involved?, etc. The knowledge base integrates structured data retrieved by extracting named entities and relationships between them with time and geographic references (by means of geotagging).
The sources of these information flows are the websites of newswires, blogs, social media, forums and other Internet resources. A test sample of data was formed using these sources and analyzed. An experimental assessment of this approach was conducted using this data sample. Consideration was also given to the architecture of software tools for decision support based on mass media monitoring within a set topic. RELATED WORKS This paper belongs to the methodological chapter for dedicated information retrieval systems that continuously monitor and analyze mass media, which includes the retrieval of information on events, the actors involved (countries, organizations, and individuals), statements, agreements, affiliations, etc. The task of automated identification of events within a set topic based on information flow analysis in the mass media was studied in [2]. The paper proposed a computationally efficient method to identify the first appearance of a target event in the mass media. The TEDAS system, which facilitates the identification and analysis of various high-profile events using Twitter posts, was described in [3]. To arrange the focused data collection, this system relies on reinforcement learning. Many papers have focused on decision support for emergency responses based on social media monitor-
329
330
DEVYATKIN et al.
ing. As an example, [4] introduced a monitoring system to monitor information on emergencies using a focused web-crawler. To facilitate a focused crawl and the channeling of data from information sources, the authors created an ontology for various emergencies. The Tweedr platform, which is used for information support of rescue crews was discussed in [5]. The system extracts named entities from Twitter feeds, including affected objects and types of emergencies. A classifier was also implemented in order to refer rescuers only to reports that make reference to infrastructure damage or injuries. Information retrieval from the texts of reports relies on Conditional Random Fields [6]. The creation of decision-support system often involves the use of subject field ontologies [7, 8]. However, the creation of an ontology on a broad subject, such as the situation in the Arctic zone, is a nontrivial problem. In contrast to a subject area with welldefined terminology and lexis, such a theme may include information relating to different subject areas within the topic. The broadness of such topics requires the creation of new methods for approaching terminology and lexis to retrieve target information from texts. Manually created glossaries, thesauruses, and ontologies can only serve as a departure point to solve such problem. In this case, one should take their incompleteness into account and the tendency to quickly become obsolete as a result of the emergence and development of new subtopics within the general topic. Let us consider the basic functional elements inherent in decision-support systems based on media monitoring. These elements include the subsystems of data collection, extraction of named entities, and fulltext searches. One of the widely used approaches to focused data collection is the shark search algorithm [9]. Its input is a search query and a set of web pages that are the origination points for information collection (seeds). These pages are analyzed for relevance to the query and are also used to extract links to other pages. For each link, its measure of relevance to the query is calculated. Links are then extracted and are grouped to make a prioritized sequence, where the priority of each link is based on the relevance of its “parent” page and the relevance of its context. When the link relevance goes below a certain threshold, such a link is omitted from the sequence, i.e., the collection of data in the respective direction halts. At the next step, another link is extracted from the sequence to download the respective page. The algorithm continues until the sequence is empty. The methods to improve the efficiency of the algorithm were given in [10]. Another approach to focused information collection is intellectual crawling [11]. The core of this approach is searching in the original query for concepts of a pre-established subject area ontology and expanding this query with proximate related similar
concepts. The authors introduce the metric of semantic similarity, which is calculated based on the similarity of concepts in the ontology. This metric is used to calculate the relevance of downloaded pages to the query. One more approach to focused information collection involves the application of machine-learning methods trained on large marked sets of web pages to predict correct transitions between pages in crawling. The use of the hidden Markov models as such machine-learning methods was proposed in [12]. Quite often, it is not full pages that need to be considered, but rather their fragments (for example, articles and the accompanying comments). There are some works (see, for example, [13]) where rules automatically generated by examples are used to extract relevant fragments of HTML pages. The extraction of named entities from texts often relies on supervised learning methods. [14] introduces a supervised method. Such an approach enables high precision in entity extraction, however it requires the marking of considerable training data sets. This drawback becomes significant when entities need to be extracted from a multi-lingual set of documents. A partially supervised approach that avoids the requirement to use marked text corpora was introduced in [15]. It is arranged as follows. First, each word taken from Wikipedia [16] is provided with an extension using a multi-layer neural network, which implicitly codes the word’s semantic and syntactic properties. Secondly, each article in Wikipedia is used to extract the contexts of links to other pages. If these articles are identified in Freebase [17] as named entities, their contexts are to be marked as positive training examples for the classifier. The trained classifier can then be used for entity extraction. The most common approach to implement full-text searching is the application of inverted search indices [18]. An improvement of inverted search indices to enable a computationally efficient search, which takes the existence of syntactic and semantic relations between text elements into account was proposed in [19]. The paradigm of open information extraction can be useful in building a generalized information-support system for decision making. Open information extraction methods and systems are intended to extract binary relationships between entities from massive text collections (in the triplet form as follows 〈entity, relationship, entity〉). In this case, the set of relationships is not pre-determined, and it is deemed that each verb can be a relationship identifier. A relationship extraction system which employs statistically derived selected syntactic templates to extract triplets was considered in [20]. An approach to extracting triplets by applying the results of linguistic analysis based on the training of a classifier that controls the finite state machine that performs the crawling of the syntactic tree was considered in [21]. The relevance of the
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
AN INFORMATION RETRIEVAL SYSTEM FOR DECISION SUPPORT
331
Data collection module News feeds, articles
Crawler
Data filtering module
Linguistic analyzer
Focused crawler Entity extraction module
Search module
Full-text index
User interface Graph database
Indexer Relation extraction module Data integration module
Topic modeling module
Background knowledge
User
Fig. 1. The architecture of information-support system for decision making.
problem is shown by the recent rounds of public competitions in this subject, such as the competition in automated construction of the KBP knowledge base (Knowledge Base Population) [22]. the resolution of conflicts and the unification of synonympus relationships [23] is a major problem in open text extraction. As well, the recall and precision of relationship extraction are far from perfect at 0.6–0.7. Therefore, despite the numerous works dedicated to specific stages of information processing in information-retrieval systems, the relevance of this paper is clear given the lack of complex methods and integrated retrieval systems that offer the ability for information support in decision making within a broad topic (without orientation at particular types of events, as in dedicated systems of emergency response). SOFTWARE ARCHITECTURE This software for information support of decision making based on media monitoring has the following major components (Fig. 1). A data-collection module downloads information resources on the target topic. Such resources can be the RSS feeds of newswires, media web sites, and popular science magazines. To download data from specialized resources only, a universal web crawler will be used based on the Scrapy platform [24]. The collec-
tion of data from broadly-themed resources will be performed using a focused crawler. A filtering module will sort informative texts relevant to the given topic from auxiliary and advertising information. We propose to implement a rough classifier, which may be susceptible to first-type errors (false positive results) but is almost absolutely protected from false negative decisions (omissions of relevant information). The search for terminology, proper names, and keywords on the set topic can be such a classifier. The Exactus linguistic analyzer [25] is used to perform full semantic and syntactic analysis of texts. The result of such analysis is the representation of a text as an inhomogeneous semantic network. A named entity-extraction module performs the extraction from texts of various types of objects, such as organizations, geographical locations, countries, military objects, natural resources, and people. The solution of this task is proposed via a combined method that employs both partial supervised training [15] and a method employing additional linguistic resources (thesaurus) to extract entities such as organizations, geographical locations, people, military objects, and titles. A relationship-extraction module identifies various relationships between entities including actions (relationships determined by verbs), attribution, and asso-
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
332
DEVYATKIN et al.
Person
Military object
Position
Event Action
Subordination
Organisation
Resource
Jurisdiction Country
Attribution
Location
Fig. 2. Major entities and relationships between them.
ciation. It is presumed that the module will use a combined algorithm relying on the outcome of the syntactic and semantic analysis (elements of predicate structures) for initial relationship extraction and selection of relationships using statistics [26]. The extracted relationships can be used either to build a fact base or for semi-automated creation of a classifier of association rules, which will then be used to identify situations that are potentially of interest for experts. The Exactus indexer [19] conducts the processing and saving of texts to a search inverted index. The topic-modeling module extracts main subjects contained in the analyzed texts. For this purpose, the latent Dirichlet allocation (LDA) method will be used [27], as well as the authors’ original method, which better suits the analyses of major collections [28]. As well, several iterations of clustering with different granularity will be conducted to improve the recall and precision of theme extraction from information flows. A data-integration module is responsible for the aggregation of named entity and relationship extraction results within a single database, which is a heterogeneous neural network. The base only contains facts that can be extracted from texts directly or by comparing entities to the entries of existing thesauruses (Background knowledge in Fig. 1). The main entities, their attributes and potential types of relationships between them are shown in Fig. 2. Not all relationships are marked where such types of relationships are obvious or depend on the situation. Dashed lines going out of the Action group of entities imply that these entities are elements of multiple relationships of various arity (at least two). At the same time, there can be multiple actions, each corresponding to a certain group of verbs that are similar in their meaning and have their own range of characteristic relationships. Resources are understood to be objects that can become subject to conflict of interest or simply the
objects of action (oil, gas, rare species, etc.). Events are construed as meetings, conferences, contract signings, acquisitions, mergers, drilling, field exploration, etc. All relationships and Event and Action entities are matched with the moments of time when the respective facts are thought to have occurred. The entities of Location are matched with geographic coordinates to represent their position on the map. The geographical tagging is done with the use of OpenStreetMap [29]. All entities and relationships have links to documents in the full-text index where they were found. The fact base is maintained in the Titan graph database [30] coupled with Cassandra storage [31]. The advantage of this database is its peculiar composition of internal index structures. In contrast to a relational database-management system where a separate index is usually created for each query, whose index covers all table entries, Titan maintains several levels of indices: global indices for the retrieval of vertices, and local indices for the retrieval of incident edges and proximate vertices (stored separately for each vertex). Such an index structure enables computationally efficient deep crawls with multiple transitions. A relational database-management system would thus experience so-called joint explosion and denial of service [32]. The search module transforms the descriptions of user-defined search criteria set in a graphic interface toward a set of concrete queries to the fact base and the full-text index and merges search results from different sources. The Gremlin query language is used for searching in the fact base [33]. All components either support simultaneous operation or are specifically designed to be started in a wide environment (for example, Titan or Cassandra); therefore, the above structure enables the processing and retrieval of almost unlimited volumes of information.
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
AN INFORMATION RETRIEVAL SYSTEM FOR DECISION SUPPORT List of sources, thesauruses Data collection
333
Full-text index Marked training data set Glossaries, thesauruses Glossaries, types of entities
Data filtering News feeds, research articles
Indexing Topic models Indexer
Linguistic analysis
Pre-selected data
Data on the Arctic zone situation
Association rule sets
Building of hierarchical topic models Rule templates
Latent Dirichlet allocation method
Entity extraction
Crawler, topic crawler Classifier
Graph index
Building of association rules
Structured representation of texts Exactus linguistic analyzer
Rule filtering
Entity extraction module
Building of heterogeneous network
Rule generation module Association rule sets
Tagged text
Data integration module
Fig. 3. The diagram of filling information base of decision-support system.
THE GENERAL OPERATION ALGORITHM OF SOFTWARE TOOLS
A PRELIMINARY STUDY OF ARCTIC-RELATED MEDIA SITUATION
Let us consider the process of extending information bases (full-text and graph database) in the discussed system (Fig. 3). Text data related to the studied subject area and collected from newsfeeds, media sites, and magazines are analyzed by a linguistic analyzer, and each text element has its morphologic, syntactic, and semantic characteristics determined. Named entity extraction is then performed, as well as the identification of text elements that correspond to the names of geographic locations, countries, military objects, natural resources, titles, and people. The next step is the indexing of these text data and the extraction of relationships that characterize the links between the entities. Topic models are generated that characterize the set of current themes in the analyzed subject area. The elements of the resulting knowledge base are then saved to the graph database.
The main goal of the preliminary study is to map out the main elements of the media situation related to the Far North and the Arctic zone, stories, typical lexis, actors, etc. Another goal is the test of the proposed architecture and algorithms. The method of conducting the study partially echoes the described framework of the system. For a more expedient result, some processing stages are simplified (for example, syntactic and semantic analyses are not conducted). As part of the preliminary study, over 18000 documents were collected both in Russian (14000) and English (4000). The sources were dedicated news and analytical and scientific information portals, which made it possible to avoid the filtering stage (the assumption was that the documents in these sites were related to the set subject area). The linguistic analysis of the documents included segmentation, the extraction of the names of geographic objects, organizations, and people with the use of the Polyglot library [15], the lemmatization and retrieval of substrings with the use of the Exactus linguistic analyzer [25] and the extraction of further entities (titles, countries, resources, military objects, and events) by employing a dedicated algorithm. The algorithm uses glossaries, lemmatization and the retrieval of substrings using the
The general algorithm of operation is quite typical for such systems; its primary distinction is the integration of the methods of intellectual crawling, deep text data analysis, statistical algorithms of rule generation and the contemporary systems of storing large volumes of data.
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
334
DEVYATKIN et al.
Aho–Corasick algorithm [34]. The glossaries were compiled partially with the involvement of experts to set reference lexis or select existing resources useful for the building of such glossaries. Some glossaries (events, military objects, resources, and titles) were expanded with words that are close in their context of use by an algorithm [35] trained on Wikipedia. The tagging of the lexical elements was followed by filtering: the lexical units that are part of named entities were not included as separate items in the resulting set. Following the processing, each document was represented as a range of references to entities. As an example, the sentence reading, “The Defense Minister of the RF plans to assign nearly 7.19bn rubles to Alexandra Land for the construction of an airfield and stationary objects for the forces,” was transformed as NOUN-ruble COUNTRY-Russia VERB-assign I-ORG-rf_defense_ministry POSITION-minister VERB-plan RES-island NOUN-billion NOUNconstruction MIL_OBJ-forces MIL_OBJ-airfield I-LOC-alexandra_land NOUN-objects ADJ-stationary. The analysis of the available documents resulted in deriving samples containing 13 576 sets (each representing the elements of an individual document) to a total count of more than 2 million elements (for the Russian language) and 4315 sets of over 600000 elements (for the English language). Thus derived data sets were used to build topic models at detalization levels of 2, 5, 10, 20, 50 and 100 topics. Below are the sets of the most relevant key words with their most probable interpretations (the detalization level is ten topics): ⎯Hydrocarbon production in the Yamal-Nenets autonomous okrug and offshore in the Arctic Ocean: месторождение (field), компания (company), проект (project), нефть (oil), газ (gas), строительство (construction), добыча (production), спг (lng), миллион (million), газпром (gazprom), договор (contract), миллиард (billion), тонна (ton), янао (yanao), роснефть (rosneft), шельф (shelf). ⎯Transport issues, the Northern Sea Route: морской (marine), северный порт (northern port), путь (route), ледокол (ice-breaker), судно (ship), груз (cargo), навигация (shipping season), река (river), миллион (million), рыба (fish), атомный (nuclear), Сабетта (Sabetta), Севморпуть (Northern Sea Route), инфраструктура (infrastructure), флот (fleet), танкер (tanker ship). ⎯Public steps to support northern regions and indigenous peoples: арктический (arctic), развитие (development), народ (people), международный (international), регион (region), коренной (indigenous), форум (forum), сотрудничество (collaboration), совет (council), президент (president), губернатор (governor), проблема (problem), Россия (Russia), конференция (conference), правительство (government), малочисленный (low-numbered).
⎯Scientific expeditions to assess the scope of climate change: лед (ice), ученый (scientist), станция (station), арктический (arctic), исследование (study), климат (climate), полярный (polar), изменение (change), температура (temperature), потепление (warming), площадь (area), океан (ocean), таяние (melting), экспедиция (expedition), глобальный (global). ⎯Government support programs and infrastructure development: рубль (ruble), Якутия (Yakutia), чукотский (Chukotka), программа (program), правительство (government), округ (okrug), развитие (development), военно-транспортный самолёт (military transport aircraft), миллион (million), аэропорт (airport), Чукотка (Chukotka), средства (money), работа (work), связь (communications), бюджет (budget), вертолёт (helicopter), авиация (aviation), инфраструктура (infrastructure). ⎯Tourism development in the European north: проект (project), туризм (tourism), фестиваль (festival), культура (culture), конкурс (competition), пройти (be held), музей (museum), мурманская область (murmansk oblast), международный (international), архангельская область (arkhangelsk oblast), парк (park), выставка (exhibition), программа (program). ⎯Marine scientific expeditions: экспедиция (expedition), судно (ship), северный (northern), море (sea), исследование (study), ледокол (icebreaker), работа (work), полюс (pole), ученый (scientist), российский экипаж (russian crew), земля (land), научно-исследовательский (research), Новая земля (Novaya zemlya), Карский (Kara). − Rare species protection issues: медведь (bear), белый (polar), животное (animal), олень (deer), вид (species), wwf, природа (nature), человек (human), территория (territory), популяция (population), морж (walrus), оленевод (deer herder), птица (bird). ⎯Deployment of the Arctic forces group: остров (island), северный флот (northern fleet), земля (land), сила (force), военный корабль (military watercraft), оборона (defence), архипелаг (archipelago), новый (new), пресс-служба (press office), территория (territory), учения (drills), вооружение (weapons). ⎯Political interaction over offshore area ownership: шельф (offshore shelf), российский (russian), страна (country), северный морской (northern sea), компания (company), США (USA), освоение (development), проект (project), ресурс (resource), море (sea), Норвегия (Norway), международный (international), вопрос (issue), развитие (development), нефть (oil), совет (council), континентальный (continental), граница (border), министр (minister). A similar list of topics for the English language (for the detalization of two topics):
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
AN INFORMATION RETRIEVAL SYSTEM FOR DECISION SUPPORT
⎯Hydrocarbon production, military presence and the Northern sea route: oil russia company, drill, Norway,Barents, project, region, minister, take, part. ⎯Ecological issues: ice, polar climate, change, scientist, world, research, global, Jeffrey Gleason, bear, warm, climate change. Topic analysis at other levels of detalization proves the adequacy of such mapping of the media situation by such themes (for the Russian language). The volume of information collected in the English language proved to be much narrower compared to that in Russian; therefore, a more detailed topic analysis returned repeated themes (oil production, northern sea route, military presence, Arctic Sunrise conflict). Thus, the international media paid more attention to conflict issues rather than the development of infrastructure, tourism or environment protection, although the focal areas in general are quite similar. Beside the topic analysis based on the derived sets, the generation of association rules was carried out in order to establish the types of relationships between entities, actions, etc. The Apriori algorithm was used for this purpose [26]. The setup of parameters (the minimum coverage and confidence levels) was done by an exhaustive method with a step varied at an exponential rate, starting from 0.9 for coverage and 0.9 for confidence with steps of 0.5 and 0.7, respectively. The exhaustive process ended when either minimum acceptable levels were reached for coverage (0.0001, which corresponds to ten documents for the Russian language) or confidence 0.55, or when the number of generated rules reached over 1000. Overall, 2533 rules were generated for the Russian-language materials and 2832 for English-language materials. Below are some examples of the generated rules: ⎯RES-oil RES-shelf COUNTRY-Russia REShydrocarbon RES-field ~ EVENT-production ⎯POSITION-scientist MIL_OBJ-ship EVENTstudy ~ EVENT-expedition ⎯RES-oil RES-gas I-ORG-Gazprom ~ EVENTproduction RES-field ⎯MIL_OBJ-station I-LOC-Murmansk ~ EVENT-expedition ⎯EVENT-expedition RES-wrangel_island POSITION-scientist COUNTRY-Russia ~ EVENTstudy ⎯MIL_OBJ-exploration I-ORG-Rosneft ~ RESshelf ⎯EVENT-meeting COUNTRY-USA ~ COUNTRY-Russia ⎯EVENT-exploration I-ORG-shell RES-climate I-LOC-Alaska ~ EVENT-drill ⎯MIL_OBJ-ship EVENT-arrest EVENT-protest ~ I-ORG-Greenpeace ⎯LOC-Murmansk EVENT-drill EVENT-protest ~ I-ORG-Greenpeace
335
⎯RES-oil RES-shelf COUNTRY-Russia REShydrocarbon RES-field ~ EVENT-production ⎯POSITION-scientist MIL_OBJ-ship EVENTresearch ~ EVENT-expedition ⎯RES-oil RES-gas I-ORG-Gazprom ~ EVENTproduction RES-field ⎯MIL_OBJ-station I-LOC-Murmansk ~ EVENT-expedition ⎯EVENT-expedition RES-wrangel_island POSITION-scientist COUNTRY-Russian ~ EVENT-study ⎯MIL_OBJ-exploration I-ORG-Rosneft ~ RESshelf ⎯EVENT-meeting COUNTRY-USA ~ COUNTRY-Russia ⎯EVENT-exploration I-ORG-shell RES-climate I-LOC-Alaska ~ EVENT-drill ⎯MIL_OBJ-ship EVENT-arrest EVENT-protest ~ I-ORG-Greenpeace ⎯I-LOC-Murmansk EVENT-drill EVENT-protest ~ I-ORG-Greenpeace Such rules can be used both for generating templates for further detection of situations and for extension of the fact base. As a method to do this, for example, Action entities for each verb or element with the tag EVENT and relationships can be added to the entities corresponding to the other elements of the rule. With this, it is reasonable to generate rules for limited overlapping time periods (the sliding window principle). CONCLUSIONS This paper provides an analytical overview of the methods and systems of information support for decision making based on media analysis for a set problem. Focused data-collection methods are discussed, as well as the methods of extracting information in natural language, open information extraction and relationship extraction. An approach is proposed to develop an integrated method and experimental software relying on the methods of intellectual analysis of data and texts for information support of decision making, and the architecture is described of potential software to implement this approach. A preliminary analysis of the media situation is conducted and its goals successfully reached. The viability of the proposed architecture and the combination of algorithms is proved. Major relevant themes are identified, as well as some association rules describing typical situations. Further research will develop experimental software for information support of decision making, which will integrate the methods of full-text search and ad-hoc analysis of knowledge bases. There are also plans to develop a method for and carry out a fullfledged experimental assessment of the quality of
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
336
DEVYATKIN et al.
solution to the ad-hoc search task with the involvement of experts in subject area. ACKNOWLEDGMENTS This paper was supported by the Russian Foundation for Basic Research a part of research projects nos. 15-29-06031 оfi-m and 15-29-06082 ofi-m.
14.
15.
REFERENCES 1. Imran, M., et al., Processing social media messages in mass emergency: A survey, ACM Comput. Surv., 2015, vol. 47, no. 4, p. 67. 2. Petrovic, S., Real-Time Event Detection in Massive Streams, 2013. 3. Li, R., et al., Tedas: A twitter-based event detection and analysis system, 2012 IEEE 28th International Conference on Data Engineering (ICDE), 2012, pp. 1273–1276. 4. Li Zheng, Chao Shen, Liang Tang, et al., Disaster SitRep – A vertical search engine and information analysis tool in disaster management domain, Proceedings of 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI), 2012, pp. 457– 465. 5. Ashktorab, Z., Brown, C., Nandi, M., and Culotta, A., Tweedr: Mining Twitter to inform disaster response, Proceedings of ISCRAM, 2014, pp. 354–358. 6. Xiaohua, L., Shaodian, Zh., Furu, W., and Ming, Zh., Recognizing named entities in tweets, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2011, pp. 359–367. 7. Bhattacharya, A., Tiwari, M.K., and Harding, J.A., A framework for ontology based decision support system for e-learning modules, business modeling and manufacturing systems, J. Intell. Manuf., 2012, vol. 23, no. 5, 1763–1781. 8. Rao, L., Mansingh, G., and Osei-Bryson, K.M., Building ontology based knowledge maps to assist business process re-engineering, Decis. Support Syst., 2012, vol. 52, no. 3, pp. 577–589. 9. Hersovici, M., et al., The shark-search algorithm. An application: Tailored web site mapping, Comput. Networks ISDN Syst., 1998, vol. 30, no. 1, pp. 317–326. 10. Chen, Z., et al., An improved shark-search algorithm based on multi-information, IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007, 2007, vol. 4, pp. 659–658. 11. Su, C., et al., An efficient adaptive focused crawler based on ontology learning, IEEE Fifth International Conference on Hybrid Intelligent Systems, 2005. HIS'05, 2005, p. 6. 12. Liu, H., Janssen, J., and Milios, E., Using HMM to learn user browsing patterns for focused web crawling, Data Knowl. Eng., 2006, vol. 59, no. 2, pp. 270–291. 13. Blanvillain, O., Kasioumis, N., and Banos, V., BlogForever Crawler: Techniques and algorithms to harvest
16. 17.
18.
19.
20. 21.
22.
23.
24.
25.
26.
27. 28.
modern weblogs, Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), ACM, 2014, p. 7. Florian, R., et al., Named entity recognition through classifier combination, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003, vol. 4, pp. 168–171. Al-Rfou, R., et al., Polyglot-NER: Massive multilingual named entity recognition, Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, 2015. Wikipedia. http://wikipedia.org. Cited January 20, 2016. Bollacker, K., et al., Freebase: A collaboratively created graph database for structuring human knowledge, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 2008, pp. 1247–1250. Manning, C.D., et al., Introduction to Information Retrieval, Cambridge: Cambridge University Press, 2008, vol. 1, p. 496. Sochenkov, I.V. and Suvorov, R.E., Services of full-text search in the information-analytical system (Part 1), Inf. Tekhnol. Vychisl. Sist., 2013, no. 2, pp. 69–78. Takase, S., Okazaki, N., and Inui, K., Fast and LargeScale Unsupervised Relation Extraction, 2015. Angeli, G., Premkumar, M.J., and Manning, C.D., Leveraging linguistic structure for open domain information extraction, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL, 2015, pp. 26–31. TAC Knowledge Base Population, NIST Information Technology Laboratory, 2015. http://www.nist.gov/ tac/2015/KBP/. Cited January 20, 2016. Hoffmann, R., et al., Knowledge-based weak supervision for information extraction of overlapping relations, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, vol. 1, pp. 541–550. Scrapy. A Fast and Powerful Scraping and Web Crawling Framework. http://scrapy.org/. Cited January 20, 2016. Osipov, G., et al., Relational-situational method for intelligent search and analysis of scientific publications, Proceedings of the Integrating IR Technologies for Professional Search Workshop, 2013, pp. 57–64. Agrawal, R., Imielinski, T., and Swami, A., Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, vol. 22, pp. 207–216. Blei, D.M., Probabilistic topic models, Commun. ACM, 2012, vol. 55, no. 4, pp. 77–84. Devyatkin, D.A., Suvorov, R.E., and Sochenkov, I.V., A method of thematic clustering of large-scale collec-
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017
AN INFORMATION RETRIEVAL SYSTEM FOR DECISION SUPPORT tions of scientific and technical documents, Inf. Tekhnol. Vychisl. Sist., 2013, no. 1, pp. 33–42.
337
31. Lakshman, A. and Malik, P., Cassandra: A decentralized structured storage system, ACM SIGOPS Oper. Syst. Rev., 2010, vol. 44, no. 2, pp. 35–40.
Graph-Oriented and Relational Database Query Languages, 2015. 33. Rodriguez, M.A., The Gremlin graph traversal machine and language (invited talk), Proceedings of the 15th Symposium on Database Programming Languages, 2015, pp. 1–10. 34. Aho, A.V. and Corasick, M.J., Efficient string matching: An aid to bibliographic search, Commun. ACM, 1975, vol. 18, no. 6, pp. 333–340. 35. Al-Rfou, R., Perozzi, B., and Skiena, S., Polyglot: Distributed word representations for multilingual nlp, arXiv Preprint arXiv:1307.1662, 2013.
32. Joishi, J. and Sureka, A., Vishleshan: Performance Comparison and Programming Process Mining Algorithms in
Translated by N. Bokareva
29. Haklay, M. and Weber, P., Openstreetmap: User-generated street maps, Pervasive Comput., 2008, vol. 7, no. 4, pp. 12–18. 30. Titan: Distributed Graph Database, DataStax, 2016. http://thinkaurelius.github.io/titan/. Cited January 20, 2016.
SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING
Vol. 44
No. 5
2017