Benchmark Investigation/Identification Project* J E A N N E T T E G. NEAL,** E L I S S A L. FEIT*** and C H R I S T I N E A. MONTGOMERY*
**Calspan Corporation, 4455 Genesee Street, Buffalo, NY 14224, U.S.A. emaih
[email protected],(716)631-6844 ***SUNY at Buffalo, C.S. Department, Buffalo, NY 14260, U.S.A. emaih feit@ cs.buffalo.edu, (716)636-3180 ~Language Systems, Inc., 6269 Variel Avenue, Suite F Woodland Hills, CA 91367, U.S.A. emait:
[email protected], (818) 703-5034
ABSTRACT: Under the Benchmark FI program, an evaluation methodology is being developed for determining the linguistic competence of natural language processing (NLP) systems. The goal is a procedure that is, insofar as possible, independent of application, domain, and system type and that produces descriptive profiles of NLP systems. The methodology is embodied in an evaluation procedure which is based on a detailed classification of linguistic phenomena and currently includes over 450 test items. The Benchmark Procedure has been applied to three NL database query and two MUC-3 systems. This paper focuses on the methodology and lessons learned. 1. INTRODUCTION
As the number of applications requiring NLP technology increases, the need for reliable methods of evaluating the linguistic competence of such systems is an issue of growing concern among the producers and consumers of NLP products (Neal and Walter, 1991 a; Palmer, Finin, and Walter, 1989). While NLP system developers look to evaluation as a means of measuring capabilities and tracking improvements in their evolving systems, consumers are concerned with comparative evaluation of different NLP systems as a basis for selecting NLP systems that best fit the commnication requirements of particular applications. NLP producers and * This research is supported by Rome Laboratory under Contract No. F30602-90-C0034. Jeannette G. Neal has been a Principal Scientist at Calspan Corporation since earning her Ph.D. in Computer Science from the State University of New York at Buffalo in 1985. Her research has focused on natural language processing, intelligent multi-media interfaces, and evaluation of natural language processing systems. Elissa Felt received her Master's in Computer Science at the State University of New York at Buffalo in 1990. She is currently at SUNY Buffalo working on her Ph.D. dissertation, tentatively titled "A Cognitive Linguistic Approach to Natural Language Understanding." Christine A. Montgomery, President of Language Systems, Inc., is a linguist whose particular NLP interest is text understanding. Her NLP career began with Russian-English MT research, and current work includes a project on machine-aided voice translation. Other work on evaluation includes comparisons of automated vs. human text extraction.
Machine Translation 8: 77-84, 1993. (~) 1993 Kluwer Academic Publishers. Printed in the Netherlands.
78
JEANNETTE G. NEAL ET AL.
consumers - as well as sponsors of NLP research and development - also see evaluation as a means of assessing technical progress and growth in the field. Of the many dimensions along which NLP systems can be evaluated, this project focuses on linguistic phenomena, including lexical, syntactic, semantic, and discourse capabilities. There are several problems in evaluating NLP systems with regard to linguistic competence. First is the need for clear definitions of the linguistic phenomena being tested and the classification scheme being used. For example, what is meant by the claim that a system handles "comparatives," "ellipsis" or "anaphoric references"? Since there are various types of each of these phenomena, do the users of these terms mean that their systems handle all types of the particular phenomenon, or just some? Second, a problem that has occurred with most evaluation approaches to date is that they are restricted to a particular application (e.g., information extraction, database retrieval), domain (e.g., terrorism, Navy situation reports, company employee database information), and/or NLP system type (e.g., text understanding systems, interactive NL front ends to application systems). For example, a corpus-based evaluation approach is restricted to the specific domain(s) used in the corpus and the type of systems for which the corpus was designed. Examples of corpora that have been developed for evaluation of database query systems include the BBN corpus (BBN, 1988), the LADDER corpus (Hendrix et al. 1976), the Malhotra corpus (Malhotra, 1975), and the HP corpus (Flickinger et al. 1987). The recent MUC-3 (Message Understanding Conference #3) (Sundheim, 1991, Lehnert and Sundheim, 1991) evaluation effort was also restricted to a particular domain and NLP application. The MUC-3 evaluation exercise used a domain restricted to terrorism activities and focused on the performance of text analysis systems for information extraction. Application domain and system-type dependencies can cause several difficulties. First, for a domain-dependent evaluation method such as the MUCs, the task of porting NLP systems to the particular domain can be prohibitive. This may limit the number of participating systems. Also, if there is a lack of resources (e.g., people hours, time to accomplish the task), then the result may be a system that does not fully utilize, exercise, and demonstrate the functionality of the underlying processing components. Second, some linguistic phenomena may be ignored because they do not arise in the particular application domain or task of the evaluation exercise(s). As a result, linguistic capabilities that may be useful for other applications and for systems of the future, may be ignored. This would become a significant problem if the computational linguistics research community focused too strongly on certain application domains and tasks to the exclusion of others. Because of shortcomings such as those discussed above, the goal of the
BENCHMARKINVESTIGATION/IDENTIFICATION PROJECT
79
Benchmark lfl project is to develop a procedure that is, insofar as possible, independent of application, domain, and NLP system type and is based on a database of terminology with associated definitions and a linguistic phenomena classification scheme. Read et al. (Read et aL 1988, Read et aL 1990) also advocated this type of evaluation methodology in discussions of their work on the Sourcebook. Their Sourcebook provides a collection of exemplars of numerous types of linguistic phenomena, but it does not provide a procedure or method for evaluating systems' perfon~ance on processing these phenomena. Thus, as part of the Benchmark Investigation/Identification (Benchmark I/I) Project (Neal et al. 199t) we are developing a methodology and procedure/tool for evaluating NLP systems that is an attempt to meet the goal stated above. First, we are developing a classification scheme for linguistic phenomena (it is evolving, with no claim of completeness). Second, we are developing a procedure/tool for testing a system's competence with regard to the linguistic phenomena in the classification hierarchy. This procedure/tool assists the evaluator in developing test sentences and provides for the recording of results/scores. Third, we have developed a capability for displaying a profile of evaluation scores across all the phenomena in the classification hierarchy. The profiling facility can describe a system's ability to process linguistic phenomena in terms of fairly coarse-grained, broadly defined classes of phenomena as well as in terms of detailed, fine-grained, narrowly defined classes of phenomena. The Benchmark I/I project also includes assessment of the Benchmark Procedure at the end of each of the three six-month development phases of the project. This assessment consists of having interface technologists, who are not linguists, apply the Benchmark Procedure to several well-developed NLP systems. The interface technologists have had no involvement in the development of the Benchmark Procedure, but have received detailed training in the application of the procedure. The results of the assessment by the interface technologists provide feedback to the developers of the procedure during its incremental development. Among the systems used thus far for the assessment activity are three commercially available NL database query systems and two of the MUC-3 participant systems. 2. THE CLASSIFICATION SCHEME One of the problems with classification of linguistic phenomena is that the phenomena frequently cannot be categorized into unique categories, but should be classified in multiple categories. For example, comparatives commonly include the use of ellipsis, e.g., "Is J o h n as o M as D a v e ? " Should this type of linguistic phenomenon be classified as ellipsis, comparatives, both, or in an elliptical-comparatives class that may be a subclass of both ellipsis and comparatives?
80
JEANNETTEG. NEALET AL.
Another problem is in determining the level of granularity in the classification scheme. For example, since at least four types of processing (e.g., lexical, syntactic, semantic, and discourse ) are typically identified in NLP systems, should each of the four be separately identified for each linguistic phenomenon? For example, handling pronominal anaphoric references would typically entail lexical capabilities such as recognizing/determining the lexical features of the pronoun, recognition of the pronoun as a syntactic component of the clause or phrase being processed, use of semantic knowledge about other entities in the discourse, and the use of discourse knowledge in being able to select from among the candidate referents for the pronoun. Should each of these component types of capabilities be identified in the classification scheme (e.g., classes called pronoun anaphora lexical, pronoun anaphora - syntactic, pronoun anaphora - semantic, and pronoun anaphora - discourse)?
This type of categorization scheme, however, is not appropriate if a black box approach is being used on NLP systems, treated as whole systems, and access to the components is not part of the evaluation. For example, if a system fails on a particular test item, such as pronoun anaphoric references, it is very difficult to determine which of the four component capabilities really caused the processing failure. Since the current goals of the Benchmark I/I project are focused on evaluation of whole NLP systems rather than individual components, our approach has been to develop test items that address the detailed individual component capabilities to the best of our ability, while treating the systems as whole systems and not expecting to examine the inputs and outputs of any system's components. We have also opted for a tree structured classification scheme that includes verbal pointers to other sections of the classification scheme for classes that could be classified under multiple parents. 3. THE BENCHMARK PROCEDURE Since a goal of our Benchmark Procedure design effort is to achieve domain or application independence, insofar as possible, the procedure is being designed such that it does not include nor rely on a particular corpus of natural language text or sentences. Instead, the test sentences or paragraphs to be processed by an NLP system are composed by the evaluator either during, or prior to, the administration of the evaluation procedure. The Benchmark Procedure is designed to assist the evaluator with the creation, modification, or tailoring of test sentences. In order to make the Benchmark Procedure sensitive to individual linguistic capabilities, it is being developed so that, for untested individual NLP capabilities, each Procedure item tests just one NLP capability at a time, to the extent possible, and combinations are tested after the individual
BENCHMARK INVESTIGATION/IDENTIFICATIONPROJECT
81
capabilities are tested. The procedure is being designed to progress from very elementary sentence types containing simple constituents to more complex sentence (or sentence group) types. The idea is that each time a test sentence (group) is presented to the NLP system being evaluated, the sentence (group) should contain only one new (untested) linguistic capability or one new untested combination of tested capabilities. The other capabilities required for processing the input should already have been tested and the NLP system should already have succeeded on these other issues. In administering the Benchmark Procedure, the evaluator must avoid combining tests for several capabilities in the same test sentences, since the Benchmark would then be insensitive to the individual capabilities. For example, a test of ellipsis only in the context of question-answering dialogue would not be usable with a system that is not designed to handle questions (e.g., a text understanding system designed for an information extraction task, which typically processes declarative sentences, but not interrogatives or imperatives). Each Procedure item includes: A brief explanation and definition of the linguistic phenomenon or capability being tested, along with any special instructions for testing. Pattems/descriptions that define the structure and features of the test sentences to be composed and input to the NLP system under evaluation. Example sentences to aid the evaluator in composing test sentences. A box for the evaluator's test sentences. A box for the evaluator's score. For each test item in the Benchmark Procedure, the evaluator submits a NL input to the NLP system being evaluated and determines whether or not the response indicates that the system understood and processed the input correctly. The evaluator has four choices of scores to award for each test item, with the ability to split between the score types: Success: The system successfully processed the NL input and indicated by its response that it understood the input. Indeterminate: The evaluator was unable to determine whether the system understood the NL input, even after trying more than once to test the particular NL capability. No test input: The evaluator was unable to compose a NL test input for the procedure item (e.g., the NLP system lacks the vocabulary to express a test input). Failure: The system failed to understand the NL input. -
-
-
-
4. EVALUATION PROFILES
The Benchmark I/I Evaluation Procedure is designed to produce descriptive profiles that describe NLP systems in terms of the scores assigned for
82
JEANNETTE G. NEAL ET AL.
Table 1. A system profile: top level only. Category 1. Basic Sentences 2. Simple Verb Phrases 3. Noun Phrases 4. Adverbials 5. Verbs & Verb Phrases 6. Quantifiers 7. Comparatives 8. Connectives 9. Embedded Sentences 10. Reference 11. Ellipsis 12. Event Semantics
Success No. WtAv 22 100.00 7 100.00 82.5 84.94 5 100.00 1 5 . 5 80.56 49 58.91 29.5 45.00 25.5 75.25 3 60.00 6 50.00 5 29.41 19 48.85
Failure No. WtAv 0 0.00 0 0.00 26.5 14.26 0 0.00 3.5 19.44 23 29.73 3 1 . 5 49.00 8.5 24.75 2 40.00 4 33.33 11 64.71 17 43.82
No Test Input No. WtAv 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 0 0.00 3 6.00 0 0.00 0 0.00 i 8.33 0 0.00 0 0.00
Indeterminate No. WtAv 0 0.00 0 0.00 3 0.80 0 0.00 0 0.00 8 11.36 0 0.00 0 0.00 0 0.00 ! 8.33 1 5.88 3 7.33
the processing of specified linguistic phenomena. The profiles are hierarchically organized according to the classification scheme discussed in Section 3. The profiles can be viewed or examined at any level of granularity (levels of granularity corresponding to the number of hierarchy levels to be displayed). At any level other than the bottom level, the scores of the lower level items or classes are combined in a weighted average to produce the score for the parent class or category. Table 1 shows a sample system profile consisting of only the top level of the hierarchy. The percentages are weighted averages of the scores produced for the sub-categories, and are not simple percentages based on the total number of items at the leaf-nodes of the hierarchy. 5. LESSONS LEARNED FROM ASSESSMENT OF THE B E N C H M A R K PROCEDURE
As mentioned briefly in the Introduction, the Benchmark Procedure is being assessed periodically in order to provide feedback to the developers of the procedure and to determine whether the procedure is meeting its objectives. At each of the two previous assessment milestones, three interface technologists applied the procedure to each of three NLP systems and the results were analyzed for consistency and errors. The interface technologists were also requested to comment on the individual items, any difficulties in applying the procedure, and suggestions for improvement. During the assessment activities to date, the procedure has been applied to a group of NLP systems that includes commercial systems and advanced
BENCHMARK INVESTIGATION/IDENTIFICATION PROJECT
83
research prototypes that participated in MUC-3. The difficulties with the Benchmark I/I evaluation approach revealed by the assessment of the Benchmark Procedure seem to fall into several categories including: the limitations of a black box methodology applied to whole systems (not their components), the procedure design and scoring method, the limitations imposed by the fixed forms of output that certain types of systems produce (e.g., MUC-3 templates), and the limitations imposed by the nature of the domains for which NLP systems are implemented. A difficulty that arose from a combination of the particular template form of output produced by MUC-3 systems and the scoring choices of the Procedure occurred when testing possessive genitives with one of the MUC-3 systems. The test input was the sentence "terrorists kidnapped the mayor of Cuilapa's son." The particular MUC-3 system correctly extracted the string "the mayor of Cuilapa's son" as the descriptor of the victim, but also incorrectly determined that the victim was a government official (the mayor, one assumes), indicating a lack of understanding of the genitive construction. Difficulties can stem from the use of a black box methodology applied to a whole system. For example, when testing a database query system's handling of the progressive participle as a nominal, the system exhibited different behavior for the two inputs "Who does the hiring?" and "Who does the selling?" It processed the first query appropriately, but did not recognize the nominal "selling" at all. Since the procedure is black box, the evaluator had difficulty determining which of the two contradictory results was representative. Based on lessons learned during the application and assessment of the Benchmark Procedure, w e are making changes and refinements to the procedure including: revising the scoring algorithm and choices, adding clear evaluation criteria to each procedure item, requiring multiple tests for each Procedure item, and improving evaluator training. 6. SUMMARY This paper discusses an approach to evaluation of NLP systems that is designed to be independent of application domain and NLP system type. As part of the Benchmark I/I project, we are developing a set of well-defined terminology for describing systems' linguistic competencies, a procedure for determining the competencies, a scoring method, and a mechanism for producing hierarchically organized descriptive profiles of NLP systems. The procedure is being periodically assessed through application to several NLP systems, including both database NL query systems and MUC-3 systems. As a result of these assessment activities, the Procedure is being revised and refined.
84
JEANNETTEG. NEALET AL.
REFERENCES BBN Systems and Technologies Corp. (1988), Draft Corpus for Testing NL Data Base Query Interfaces, NL Evaluation Workshop, Wayne, PA. Flickinger, D., Nerbonne, J., Sag, I., and Wasow, T. (1987), Toward Evaluation of Natural Language Processing Systems. Technical Report, Hewlett-Packard Laboratories. Hendrix, G.G., Sacerdoti, E.D. and Slocum, J. (1976), Developing a Natural Language Interface to Complex Data, Technical Report, Artificial Intelligence Center, SRI International. Lehnert, W. and Sundheim, B. (1991), A Performance Evaluation of Text-Analysis Technologies, AIMagazine, V. 12 (No. 3.), pp. 81-94. Malhotra, A. (1975), Design Criteria for a Knowledge-Based Language System for Management: An Experimental Analysis, MIT/LCS/TR- 146. Neal, J.G. and Walter, S.M. (1991), Proceedings of the 1991 Workshop on Evaluation of Natural Language Processing Systems, Rome Laboratory Technical Report on the Workshop held in Berkeley, CA, June, 1991. Neal, J.G., Feit, E.L., and Montgomery, C.A. (1991), The Benchmark Investigation/Identification Project: Phase I, Proceedings of the 1991 Workshop on Evaluation of Natural Language Processing Systems, Rome Laboratory Technical Report. pp. 41-69. Palmer, M., Finin, T., and Walter, S.M. (1989), Workshop on the Evaluation of Natural Language Processing Systems, RADC-TR-89-302, RADC Technical Report on the Workshop held in Wayne, PA, in December 1988. Read, W., Quilici, A., Reeves, J., Dyer, M., and Baker, E. (1988), Evaluating Natural Language Systems: A Sourcebook Approach, COLING-88, pp. 530-534. Read, W., Dyer, M., Baker, E., Mutch, P., Butler, E, Quilici, A., and Reeves, J. (1990), Natural Language Sourcebook, Office of Naval Research Technical Report. Sundheim, B., ed. (1991), Proceedings of the Third Message Understanding Conference, Morgan Kaufmann Publishers.