C o m p u t e r s a n d the H u m a n i t i e s , Vol. 12, pp. 3-12 (1978).
Pergamon Press. Printed in the U.S.A.
0010-4817/78/010003-10502.00/0 Copyright 9 1978 Pergamon Press, Inc.
Some Considerations Concerning Encoding and Concording Texts M I C H A E L J. PRESTON and SA MU E L S. COLEMAN
Diverse needs exist among those who use computers in humanities research and instruction. That diversity ought to be respected, even encouraged, and yet at times it seems almost as if the reverse were common practice. Not long ago William Ingram pointed out that " a fairly clear set of desiderata [regarding concordances] has been in existence" for a number of years. 1 His statement is similar to that of Roy Wisbey a decade earlier: '~The actual situation in which an individual compiler finds himself is in part dependent on the expectations of scholars in his field.'"-' Our view is that the "expectations ~' of those in a traditional academic discipline---for example, German--differ substantially from the " s e t of desiderata" of those who have worked with computers for a number of years. The difficulty, of course, is that significance in one area is not necessarily significance in another. A computationally simple project may be of importance in a particular discipline and should not be assigned cavalierly to the computational hell of the "trivial." Analogously, experimental research may yield computational results with minimal application in a traditional discipline, but show promise of further development and eventual broad utility. Multiple and perhaps even contradictory viewpoints ought to be characteristic of those who use computers in humanistic research because the technology is being used for so many different purposes; this attitude is particularly encouraged by the current subsiding of the old fears about computers intruding into the humanities. Indeed, a kind of backlash in favor of computer research in the humanities seems to present the greatest danger--that of complacency--especially for
those of us at universities where there is a substantial group of computing humanists. Unquestionably the best starting point for any project, as well as for this discussion of encoding and concording texts, is Wisbey's observation that it "appears sensible to adapt one's approach to the needs of a particular text, "'3 being guided primarily by one's intimate knowledge of that text. This requires a much more flexible approach to humanistic computing than is usual. Too often we are dominated by what others have done. This reinforces the negative view of concordance-makers, common enough in the profession, which was recently formulated with obvious scorn: " T h e r e are many useful jobs which can be done by men who do not like to think. They can dig ditches, clean automobiles, and compile concordances. ''4 Certainly professional "expect at i ons" are clearly secondary to the requirements of a particular research project, as are the financial, mechanical, and technical obstacles which must be overcome as best one can in our less-than-ideal world. The beginning of any project, once the flush of inspiration has passed, is the planning and implementation of what are generally rather rigid steps of action. One can get caught up in dreams of automatic scanners, thus putting the burden of data entry on a mechanical clerk, but this is so far from practical even for modern texts that those of us with primary interests in older or more problematic texts had best leave it as an intriguing possibility and set about the laborious business of getting our texts into machinereadable form. What in a text we want or need to encode depends in great part upon what we intend to do. ~
Michael J. Preston is Director of the Center for Computer Research in the Humanities (CCRHi at the UmversiD' of Colorado at Boulder. Samuel S. Coleman is Computer Scientist for the Lawrence Livermore Laboratory in Califorma and consultant to CCRH.
4
MICHAEL J. PRESTON AND SAMUEL S. COLEMAN
Certainly, whether working with a printed book or a manuscript, one must be faithful to one's text. If an edited text, it certainly must be the best edition, because anything less as a starting point means that one's work, h o w e v e r timeconsuming and otherwise careful, is questionable to the extent that the text is flawed. 6 Having begun with a sound text, one must encode only to the extent that encoding will affect the results of the project. Unfortunately, it is extremely difficult to foresee all the uses to which machine-readable data will eventually be put. In 1968 nearly half the data base for the Concordance to the Middle English Shorter Poem 7 was k e y p u n c h e d (150,000 words of text), with just those graphic features encoded which could be represented on an extremely limited Control Data Model 501 line-printer. It required nearly six months to rectify that mistake. A mistake of such magnitude ought never be repeated, and yet the recurrent theme in informal discussions of encoding is how little is sufficient, ff one is producing data for oneself, one can be as idiosyncratic as one pleases, but data so prepared is usually not worth borrowing. We agree fully with J.B. Smith that a professional attitude requires that one think of others as well as one's own immediate needs, especially since it rarely takes more than an additional ten percent effort to encode with reasonable thoroughness.* Specifically, how a text is encoded is a relative matter as long as it is thorough. Smith noted that there are always "trade-offs, ''9 but this need not be an overriding concern when one knows his text well enough to determine what are the more important and what are the less important features. As long as a text is encoded so that it can be mechanically r e p r o d u c e d in a way that does not differ substantially from the original, it is adequate. The method of actual data input differs considerably from institution to institution and from individual to individual.i~ H e r e is one area in which the economics of our institutions affect our work. Despite the growing use of terminals, we still rely heavily on keypunch machines, primarily because they are so readily available. This is reinforced by so many supposedly impossible accidents having happened o v e r the years that keeping hard-copy backup data seems desirable until the more convenient means o f storage b e c o m e as reliable as they are claimed to be.11 In the preparation of our data, one student
keyboards a text; this is proofread by a different student. The editor then keyboards the same text himself in the conviction that the closer one is to one's text on all levels the less probability there is for undetected anomalies. These decks are then verified with a low-level collating program. Working from the output, the editor " c a n nibalizes" the two decks to produce one nearly perfect deck; approximately 0.2% of the corresponding cards contain errors and require manual correction. The corrected deck is then printed on a " d a i s y - w h e e l " printer to reproduce the original text as closely as possible. Rarely are errors found at this stage, but those few tend to be of the most glaring nature, usually brought to light by the conversion from the input conventions to those which make full use of the daisyw h e e l printer's graphic capabilities. After these corrections, a text as large as 100,000 words contains at most one or two errors. This level of accuracy is humanly possible when a text is input twice; repeated proofreadings of a text input but once generally do not result in similarly accurate data. The set o f conventions employed should be, as Smith points out, "well-defined and as unambiguous as possible . . . . [but] the simplest set that gets the job done. ''12 We prefer a singlecharacter-for-single-character set of conventions, using escape characters and substitutions for differences in case, face, and kind. Thus we may use a + W for an uppercase w and a +9 for an uppercase p ; the computer converts the simpler convention for keypunching (+9) to a more complex convention (+*9) for processing. It seems there ought to be a distinction made between how a text is initially encoded (simple and unambiguous within the text itself) and how it is encoded for processing (as complicated as necessary), particularly if one intends to combine data at a later time. F o r the more complex conventions, one might reserve various escapes for different degrees---say, a plus for a change in case, a not-equal sign for a change in face, and an asterisk for a different " k i n d " of letter, such as a 3. Ambiguous uses of punctuation, such as the period following an abbreviation, can be input as an escape-period sequence and thus resemble a period while remaining distinguishable. An alternative would be to input a unique symbol and convert it to an escape-period. One can string escape characters together, such as for an italicized, superscript m, and thus represent vir-
SOME CONSIDERATIONS CONCERNING ENCODING AND CONCORDING TEXTS
5
= = + X + I + V 45 Haue a 3ong suster fer be-3ondyn lae se, many be lae drowryis laat che sente me.
I
+:~I +HAUE A *3ONG SUSTER FER BE-*3ONDYN *8E SE, MANY BE *8E DROWRYIS *8AT CHE SENTE ME.
che sente me lae cherye with-outyn ony ston, & so che ded lae dowe with-outyn ony bon.
CHE SENTE ME *8E CHERYE WlTH-OUTYN ONY STON, *7 SO CHE DED *8E DOWE WlTH-OUTYN ONY BON.
sche sente me lae brer with-outyn ony rynde, sche bad me loue my lemman with-oute longgyng.
SCHE SENTE ME *8E BRER WlTH-OUTYN ONY RYNDE, SCHE BAD ME LOUE MY LEMMAN WlTH-OUTE LONGGYNGE.
Figure 1. First twelve lines from "I have a 3ong suster." tually any graphic feature of a text without undue strain on those preparing the data. Many of our views concerning encoding are embodied in our newly operational concording system, U N I C O R N . Written entirely in CDC F O R T R A N IV, it has replaced our library of more restricted concordance-generators and text formatters. U N I C O R N can handle virtually any machine-readable text, but various options require particular input conventions o f a somewhat general nature. Figure 1 shows the first twelve lines o f " I haue a 3ong suster" from Rossell H o p e Robbins' edition in Secular Lyrics of the X I V a n d X V Centuries 13and, to the right, the text encoded with complex processing conventions. A text line per input line is assumed, as is a standard o f one space between words. E x c e s s spaces are squeezed out, but desired additional spaces may be retained by inputting escape-blank sequences. Continuation lines, in practice rather rare, are indicated by an equal sign anywhere to the right of the text data on the preceding line. Words are not to be split and h y p h e n a t e d words are c a r d e d up to the preceding line. T e x t identifiers are imbedded in the text on separate lines preceding the section of text to which they apply; they are signalled as such by equal signs in columns "one and two. With each text identifier, the line-count is reset to one unless the text identifier is one equal sign and one asterisk to signal continuous lineation. To allow for anomalies, line count can be changed at any point by an equal sign in column one followed immediately by the desired line n u m b e r and the text line; an equal sign followed by a plus and a
number boosts the line count above what would be anticipated by the amount of the number. Despite the apparent complexity of imbedding this kind of information rather than making use of tables, numerous frustrating errors have convinced us of its desirability; and the more substantial one's text size, the more desirable it is to have the information imbedded. Additionally, since printing routines are, in the most obvious sense, dependent upon page or section specification and line number, any error of this nature becomes immediately apparent. A convenient input convention is the virgule as a boundary-marker. This can be used as a sense-boundary for prose, or a verse-delimiter for verse either printed or written as prose. Since it can be automatically c o n v e r t e d to a blank, it can be ignored when desired, or it may be carried along but treated as a blank, or used as a context-delimiter for a concordance. In output we make a similar use of the virgule as an optional line-end marker for those instances in which a line-boundary may be considered significant. Although we have discussed in general the encoding of the graphic features of texts, there remains the debatable question o f whether one should encode other kinds of i n f o r m a t i o n - - f o r example, parts of s p e e c h - - i n addition to the boundaries which we have already advocated for particular classes of situations. Where this must be done on a word-for-word basis, we are rather against it in the initial inputting o f a text despite Fr. Busa's recent opinion to the contrary. 14 We have found it extremely difficult to be consistent
6
M I C H A E L J. P R E S T O N A N D S A M U E L S. C O L E M A N
on any level beyond the simple discrimination between noun and verb. There is the additional difficulty that coding on input gives one no more than what was put in and that is a direct reflection of one's preconception of what was worth encoding. On the other hand, once a rough concordance to a lexically uncoded text has been generated, one has all occurrences of the same form grouped together. It is then possible to tag the unambiguous graphic types rather than each token. When there is ambiguity, one has at hand all occurrences of the same form in the text which, as Wisbey noted, helps resolve those "ambiguities [which] cannot be passed over in silence as in a dictionary or in a c o m m e n t a r y . " a5 Discussions of concordances seem to us to be always somehow marred by the assumption that e v e r y o n e knows what a concordance is. In practice there is a tendency to blur distinctions among concordances, dictionaries, and indexes, so it is perhaps best to define a concordance as an alphabetical compilation of all forms of the words occurring in a given text, complete with contexts quoted and locations given. If the contexts are omitted, such a compilation is an index. If words are defined, the compilation is a dictionary. These should be distinguished from an alphabetical list of words which is sometimes called a " d i c t i o n a r y . " One often encounters concordances with lexically related forms grouped together and with indexed or even omitted entries for high frequency word forms. This practice of producing lexical concordances-cureindexes may well result in the ideal reference work for a particular author or for a particular research task, but in another set of circumstances, both a dictionary and a c o n c o r d a n c e may be desirable. As in the encoding of a text, one must adapt one's approach to the situation. Historically concordance-makers have generally attempted to separate lexically disparate but graphically identical forms and to combine lexically related but graphically distinct forms. John D o n n e ' s uses of protests in " A n d unto her protests protests p r o t e s t s " (Satyre IV, 212) and Chaucer's use of taille in " I am youre wyf; score it upon my taille" (Shipman's Tale 416) present interesting problems for the traditional concordance-maker. There is no difficulty in determining the primary meaning or function of either word form, but in a fully parsed concordance the shades of meaning and function are lost by the necessary rigidity of the practice.
Similarly, to intermingle the occurrences of ev't3' with eveo, and against with 'gainst in a concordance to Ben Jonson seems an almost willful turning of one's back on Jonson's interest in and use of metrics. 16 Particularly for medieval and Renaissance authors, this seems a dangerously anachronistic practice which reveals the lexical interests of most compilers of concordances. Our approach, particularly when there are word forms which we do not understand, some of which we are certain no one understands, has been to compile " b r u t e f o r c e " concordances which, if they may be objected to for not being parsed, at least have the advantage of presenting all the textual evidence in a form in which anyone willing to take the trouble can find what he needs. Picking up the theme of the epic struggles o f the old concordance-makers, particularly for concordances to medieval texts, we would like to see the distinction made between unparsed " p r i m a r y " concordances and " s e c o n d a r y " or " d e r i v a t i v e " concordances which have been edited toward particular applications.~7 Another concern is high-frequency word entries. These are generally of little lexical interest, but of stylistic importance. On the most obvious level, in English as and like are often linguistic markers for similes. TM But such c o m m o n items as and, then, but, and when are stylistic indicators on other levels. One need only contrast Jonson's rare use of the initial " t h e n " with Marston's frequent use to see displayed graphically one of the reasons why the effect of Jonson's verse is so different from Marston's. Similarly one can contrast the Wakefield Master's use of these c o m m o n conjunctions with the various authors of the Digby Plays. 19 In some instances the contrasts are so absolute that no tests need be applied, but this fertile area for statistical applications has apparently been ignored because the data is not kept by the compilers of concordances. So far we have not discussed " c o n t e x t s " in concordances. Most of us think of traditional verse-context concordances for poetry, commonly called K W O C (Key Word Out o f Context) concordances, as that in Figure 2; certainly this has been an assumption made in articles on concordance-making. 2~ F o r prose, c o m m o n practice has been to produce K W I C (Key Word I n Context) concordances in which the concorded word is centered in a certain amount of text. This is sometimes, as in Figure 3, sorted
SOME CONSIDERATIONSCONCERNING ENCODING AND CONCORDING TEXTS
SYDE (cont.) W36 The lengthe of De yerys in my ryght syde be W37 Ande in my lefte syde ryches, joy, and prosperyte. W510 Change pat syde aray. I yt dyfye. W565 Truthe on syde I lett hym slyppe. M671 Master Myscheff, hys syde gown may be tolde. M714 3e must haue be yowr syde a long da pacem, E523 In thy moost nede to go by thy syde. SYE (2) C449 I aye sore and grysly grone I snowre, I sobbe, I aye sore. C1866 SYEST (i) C1299 Why syest pou and sobbyst sore? SYGHES (i) E184 Alas, I may well wepe with syghes depe] SYGHT (9) W68 That hys syght from them neuer can remowe? W335 Presumynge in Godys syght, W573 A woman me semyth a hewynly syght. W992 My lyff pleyn schewenge to here syght. WI097 Wyth my syght I se De people vyolent, M531 To blench hys syght I hope to haue hys fote-mett. E25 Of ghostly syght the people be so blynde, E77 His syght to blynde, and fro heuen to departe-El80 And now out of thy syght I wyll me hy. SYGHTE (i) WII8 Ande dammyde to derknes from Godys syghte. SYGHTYS (i) WI086 In De twevn syghtys of yowr ey SYGNYFiCACYON (i) WI3 Therfor De belowyde Sone hathe Dis sygnyficacyon SYGNYFYE (i) W149 Thes tweyn do sygnyfye SYH (1) C1390 Dei pat syh in synnynge, SYHE (2) C1404 Sertys for synne I syhe sore. C3005 Myn bert brekyth, I syhe sore. SYINGE (i) C1308 Uyth sore syinge vndyr sunne? SYKE (5) C425 Myth I r y d e be sompe and s y k e M777 Yf ~e wyll haue bym, goo and syke. syke, syke! M778 Syke not ouerlon, for ]osynge of yowr mvnde! SYKENESSE (i) E620 And am delyuered of my sykenesse and wo. SYLENCE (2) W435 Kepe s y l e n c e , w e p e . and s u r p h e t t y s e s c h e w e , M589 Ande e u e r }e d y d e , f o r me k e p e now yowr s y l e n c e . SYLENS (I) C2185 And kepe in sylens. SYLFE (]) M838 The prowerbe seyth 'pc trewth tryith De sylfe.' Alas, hawe mech care.
7
I
Figure 2. A sample page from A C o n c o r d a n c e to Four " M o r a l " Plays, a conventional KWOC concordance. Word forms are listed alphabetically together with their frequencies of occurrence. An abbreviation for play title, the line number, and complete verse context is given for each occurrence.
8
MICHAEL J. PRESTON AND SAMUEL S. COLEMAN
FLOWE (3) 121 wurs than him was tho!/The see bigan to 1515 stiward of his hus./The see began to 1103 Shup him wolde bringe./He segh the sea FLUR (i) 15 bright so the glas;/He was whit so the FODE (i) 1352 the gode,/Min owene child, my leve
flowe/And Horn child to rowe./The see that shup so faste flowe/And Horn gan to rowe./Hi gunne for arive/Ther King flowe/And Horn nowar rowe./He sede upon his songe/"Horn, nu flur;/Rose-red was his colur./He was fair and eke bold/And fode./Ef Horn child is whnl and sued/And Athulf bithute wnnd
FOLE (3) 593 yede to stable./Thar he tok his gode 597 brunye/That al the curt gan denye./The 595 his gode fole/Also blak so eny cole./The FOLIE (i) 692 /Fikenhild hadde envye/And sede thes FOLK (7) 1031 his shup stonde/And yede to Ionde./His 1533 his quene,/So hit mighte well beon./Ali 262 alle/Ne nowhar in non othere stede./Of 65 londe/And neme hit in here honde./That 1384 flee."/Horn gan his horn to hlowe;/His 622 the laste./Ne mighte no man telle/That 47 /And him well some answarede:/"Thy lond FODE (Ii) 1234 sclavin falle./The quen yede to bure/And 39 bute two--/All too fewe ware tho!/He 765 him sette/And fot on stirop sette./He 372 /Athelbrus wende hine fre./Horn in halle 985 ighe/If heo oght of Horn y-sighe./Tho 1173 to bare/With hire maidenes foure./Tho 1189 feor by yond weste/To seche my beste./I 709 gan tucne/Well mody and well murne./He 635 my dubbing,/So I rod on my pleing/I 601 Horn rod in a while/More than a mile./He 1455 they ne knewe/For he was so newe./Horn
fole/Also blak so eny cole,/The role shok the brunye/That al role bigan to springe/And Horn murye to singe./Horn rod in a fole shok the brunye/That al the curt gan denye./The fole folie:/"Aylmar,
ich thee warne/Horn thee wule berne!/Ich
folk folk folk folk folk folk folk
he dude abide/Under wude side./Horn him yede alone/Also hem mighte rewe/That loveden hem so trewe./Nu ben hi heo hadde drede;/By daye ne by nighte/With him speke ne hi gunne quelle/And churchen for to felle./Ther ne hit gan y-knowe;/Hi comen ut of stere/Fram Hornes that he gall quelle./Of alle that were alive/Ne mighte we shulle slon/And alle that Christ luveth upon/And
fond fond fond fond fond fond fond fond fond fond fond
Athulf in ture./"Athulf," heo sede, "be blithe/And to by the stroede/Arived on his londe/Shipes fiftene/With by the weye/Kinges sones tweye--/That on him het Harild he tho/Bifore the King on benche/Win for to shenche./ heo the knave adrent/That heo hadde for Horn y-sent/And heo what heo wolde,/A ring y-graven of golde/That Horn Horn child stonde/To shupeward in londe./He sede he Horn in arme/On Rymenhilde barme./"Awey ut," he sede, o shup r o w o / ~ d watere al beflowe/All with Sarazines o shup stonde/With hethene hunde./He axede what hi sittinde Arnoldin,/That was Athulfes cosin,/That ther
FONDE (4) 734 838 155 1526
/In to unc~the londe/Well more for to /Agen one hunde/Thre Christen men to lond arived her./And seye that hi shall wide./He arivede in Irlonde,/Ther he wo
fonde;/l shall wune there/Fulle seve yere./At sere yeres fonde./Sire, "I shall alone/Withute more y-mone/With my fonde/The dent of mine honde."/The children yede to tune/By fonde;/Ther he dude Athulf child/Wedden maide Reynild./Horn
Figure 3. A sample page from a KWIC concordance to King Horn, produced by Cathy M. Orr. Word forms are listed alphabetically together with their frequencies of occurrence. Line references to the key word are given. Context is comprised of as much context as will fill available space, with verse-line boundaries indicated by virgules. The text following the key words has been sorted so that similar phrases appear together. b e y o n d the k e y w o r d to bring related phrases together. But our usual practices are not necessarily ideal. It is certainly possible to produce conc o r d a n c e s to prose texts with c o n t e x t s defined by hand, either absolutely on input 2a or interactively on a word-by-word basis later in the concording process, zz Verse c o n c o r d a n c e s with either hand-defined c o n t e x t s or in K W l C format can be readily produced. There are so many degrees o f difference possible from the implied polar opposites o f the K W I C and K W O C formats that perhaps the terms have b e c o m e obsolete, m a y b e to be s u b s u m e d s o m e d a y under a more general term such as K W A C (Key Word And Context). The utility o f the research tool is what must be considered, not our relatively short-lived c o n v e n t i o n s . To our k n o w l e d g e , f e w have e v e n considered publishing K W I C c o n c o r d a n c e s to
verse texts (see Figure 3), e v e n though Bartlett's Concordance to Shakespeare contains pages which function strikingly like K W I C entries. 2~ We might do well to reflect on Fr. Busa's statement that "using the c o m p u t e r to prepare concordances . . . with the same format and the same features as before is a poor use o f a computer. I feel sympathetic to a n y o n e in scholarly research w h o still thinks o f using a c o m p u t e r just to do things easier and faster. ''24 We have found that K W I C c o n c o r d a n c e s , sorted well b e y o n d the key word so that verbally similar phrases appear together, are by far more useful, whether based on prose or verse texts, than more c o n v e n t i o n a l K W O C c o n c o r d a n c e s . One project, involving the British folk plays, has virtually grown out of what has b e e n gleaned from and suggested by a K W I C c o n c o r d a n c e to
SOME CONSIDERATIONS CONCERNING ENCODING AND CONCORDING TEXTS
156 complete texts and 38 fragments. Not only have there been the usual publications, but this concordance has been a proving-ground for the development of various automated and semiautomated post-concordance routines, zs Of so much utility has this concordance been that it is currently being expanded to include the more than one thousand additional play texts acquired since 1970. Interestingly enough, a KWOC concordance to the same texts generated in 1970 has, for the most part, gathered dust. A parallel project is the production of a single KWlC concordance to all Middle English drama. Byproducts of this have so far been conventional KWOC concordances to traditional manuscript groupings and cycles, z6 Although some preliminary critical work is nearing completion, the critical side of the Middle English drama project will lag behind the folk play project until that reaches its logical end and tapers off. In additional experimentation, a corpus of some 4500 bawdy limericks has been transformed into a KWlC concordance with the ultimate aim of describing the traditional phraseology and controlling the ambiguous forms. 27 Cathy M. Orr has recently completed a similar study of "Traditional Comparisons from Colorado. ''28 Other projects involving Old Norse, "-'9 Chaucer, a~ Samuel Beckett, 31 and Ralph Waldo Emerson a2 are in process. What is important in all of this is that a number of primarily literary researchers have discovered in the KWIC concordance and its derivatives a research tool of unquestionable utility for certain kinds of projects. Each concordance project has grown out of a perceived critical need rather than a wish to generate output. In contrast to those who lean towards mechanizing some of the tasks of content analysis by making use of such systems as the "General Inquirer," we have found that a cautionary statement in a manual essentially defines literary needs: Perhaps the chief advantage of Inquirer is that it virtually automates the tedious, costly and time-consuming process of coding (assigning tags according to content). The advantages are obvious. It is well, however, to be aware that although computer-coding is entirely reliable and systematic, it is largely insensitive to subtleties of style, content and nuance which would be apparent to most human coders; irony, humour and allusive reference also go unremarked. The characterisation of the text
9
is therefore liable to be general and somewhat gross.aa In general, "somewhat gross" characterization of texts is the polar opposite of what we are concerned with. Concordances of various kinds for various texts serve as the crucial intermediate step--linking the text and its "pedestrian" succession of word after word and letter after letter--with the world of critical ideas and insights. One can have a concordance arranged to make synchronic analysis of a text considerably easier just as one can have a different arrangement to make diachronic analysis easier. 34 The broad implication of all this is not whether we have utilized the concordances so many of us set out to compile in the 1960's, but whether what we set out to do really serves our needs. Concordances in different forms certainly seem to provide different kinds of information. Traditional objections to KWIC concordances---that they are ugly--are of little more substance than the objection to the early KWOC concordances. The situation is that, in attempting to match the standards of the hand-made concordances, most of us have simply produced more and more attractive KWOC concordances to verse texts, leaving the KWIC for prose. There has not been the thought given to sophisticating the KWlC format that has been lavished on the KWOC. As far as the actual making of concordances is concerned, it appears that we are still in the "black b o x " era. If we have learned that computers are not black boxes, programs are, or appear to be, at least if one judges from snip references to "the computer program" or "the concordance program" made in concordance prefaces in the last three years. Concording a text involves many processes which can be separated so that a fuller interaction with one' s text, requiring " m o r e human work, more mental effort, ''35 can result in a more meaningful product. A highly modular approach to producing concordances, with human intervention practical at various points in the process, seems the only possible way to produce concordances which are significantly more revealing of our languages than those produced a decade and more ago. One of the dominant views in humanistic computing during the last several years is that humanists must all be their own programmers. We strongly disagree with this in most instances, because, on the one hand, keeping informed and
10
MICHAEL J. PRESTON AND SAMUEL S. COLEMAN
active in one's own fields of specialty is necessary to avoid the critical naivet6 which makes any computer-based research pointless, while on the other it is functionally impossible to acquire the level of experience necessary to compete fully with programmers with years of training and still more years of on-the-job experience. One pitfall is best illustrated by an advocate of do-ityourself programming: " i f [an escape character] is placed before the word, all designated words, if the text is ever sorted, will sort out into a clump somewhere in the vocabulary sequence, usually at the top." William Ingrain has explained that this problem in sorting exists because "Most modern concordances are prepared by the splitout technique, in which the individual words of the text itself, no matter what their form, serve as the headwords for the sorting process."36 But "splitout" headwords need not be sorted upon. With UNICORN, a distinct sort-key, whether a headword or a phrase, is generated. This can differ from the word or phrase in the text; likewise the headword itself can differ from the word in the text. UNICORN allows multiple characters to be substituted in the sort key for any character or escape-character sequence and a similar number substituted for any character in the headword. An example is an escape-seven in the text representing an ampersand which may be sorted as AND, put out as ET in the headword, but remain & in the context, or even modified to something still different in the context, such as ande. A simpler and more practical example is an escape-six for a 6, sorted as TH, put out in the headword as a p, while remaining an 6 in the context. But even with this flexibility there are limits. One must recognize that certain forms are so "perverse" that they can never be sorted into their desired position without human intervention, whether interactively during the concording process or in post-editing. But there is no excuse for not having one's concordance the way one would like it. One must, of course, recognize that we are still far away from a purely formal treatment of language and literature. It is possible, however, granting sufficient flexibility of one's particular computing context, to produce far more useful concordances than is common practice. One need step back from his work and look at the assumptions inherent in his particular institution, his discipline, and his equipment. Encoding is a
rather routine task, but rarely carried out with a greater degree of thoroughness than is immediately necessary. Consider, for example, three common output devices: an all-caps lineprinter, an extended-character-set line-printer, and an I/O Selectric. Codes adequate for all of these devices can be carried through a concording system, with codes inappropriate to the particular device screened out or converted immediately before output. One usually must make do with what devices are available, but the amount of time spent on many concordance projects justifies looking beyond one's present equipment toward what may be purchased one year or ten years from now. Thus one would never be locked out of using better or different facilities when they become available. Additionally, one's data, if accurate, could be of use to a number of others who most probably have access to different devices. Consideration of concordances should probably end with a consideration of their publicat i o n ) 7 This is an area in which professional expectations and personal needs are often most obviously at loggerheads. The consideration of whether or not to publish a particular concordance has frequently gotten swallowed up by the personal need to publish or perish. Once again, here is where professional attitude must come in. Somewhere there is a publisher who will publish anything. Other publishers have high standards, far above their practical need to consider the economics of publishing a particular volume. If one ever thinks of formal publication, he should question whether that volume really deserves to be in his institution's library and on the bookshelves of the various specialists in his particular area of research. This is a hard question when one is facing a tenure decision, but numerous ill-planned, dead-end concordances have been published, and their compilers possibly granted tenure, but many concordance-makers have published very little concordance-aided research. ~s Huge concordance projects may be an end in themselves, but many concordances are based on relatively small texts. One is constantly plagued by the feeling that their compilers, if they do not have a need for the concordance, are more attracted to the idea of havhag published a "standard reference work" than any service to their profession or even their own scholarship. One ought to consider alternatives to formal publication. Five years ago there was much talk
SOME CONSIDERATIONS CONCERNING ENCODING AND CONCORDING TEXTS
of "publishing on magnetic tape." Not only has this not been generally accepted, but it is based on a kind of elitist thinking that seems to undermine the efforts of those of us who still identify with a traditional humanistic discipline. If this were the only alternative to publishing--and it is one good alternative--in practice it would exclude those in the profession who do not have access to adequate and compatible computing facilities. Publishing on microfilm or microfiche may be a better alternative. We "publish" through Xerox University Microfilms concordances which are of interest but which we think ought not be formally published. This may not be the ideal middle ground between publishing and not publishing, but it is an alternative available to all. Here is a home for one's most idealistic efforts where cost never restricts even redundant publication of concordances in alternate forms. The greatest difficulty, beyond getting through to the right department to order a copy, is that there is no real quality control. The obvious benefit is that Xerox publications never go out of print and one can always obtain either a film copy or a hard copy of a particular item. An additional consideration is that even though tenure committees are not overly impressed by informal publications, however lengthy, the only responsible view must be that publication of a concordance in any form is a service to one's profession, but anything less than one's best effort is no service at all.
NOTES I. William Ingram, "Concordances in the Seventies, Computers and the Humanities, 8 (1974), 273. 2. Roy Wisbey, "The Analysis of Middle High German Texts by Computer: Some Lexicographical Aspects, Transactions o f the Philological Society 31 (1963), 44. 3. Wisbey, p. 45. 4. James Willis, Latin Textual Crittcism. Urbana, 1972, p. 3. 5. Todd Bender, m "Literary Texts in Electronic Storage: The Editorial Potential," Computers and the Humanities, 10 (1976), 193198, argues against the idea of a single text for nineteenth-century novels which were rewritten at various times. Although he ignores the supposed artistic integrity of the various states of a novel, his is an exciting suggestion. Nineteanth-century novels present a problem sianlar to oral texts (of. infra) of which all recorded versions are variants; there is no single definitive text. 6. One ought think seriously about the implications of such articles as Michael J. Warren, "Repunctuation as Interpretation m Editions of Shakespeare," Enghsh Literary Renaissance, 7(1977), 155-169. Too often minor graphic features are ignored in the rush to get texts into machine-readable form. 7. Compendia 6 (Leeds: W. S. Maney & Son, Ltd., 1975). 8. J.B. Smith, "Encoding Literary Texts: Some Considerations," ALLC Bulletin, 4 (1976), 191. 9. Smith, p. 192.
11
10. See the discussion of CURSOR in L.A. Cummings, "The Electronic Humanist: Computing at Waterloo in Canada," ALLC Bulletin, 3 (1975), 229-230. This is the kind of verifying-collating program we would prefer, but we have not been able to obtain an adequate CRT terminal. 11. The accidents that happen can perhaps be best illustrated by a letter from Mr. Jess Stein of Random House: "Unfortunately, the tapes for the Random Hoase Dictionary are no longer available because of the carelessness of a research group to whom we loaned them." 12. Smith, p. 191. 13. RosseU Hope Robbins, Secular Lyrics o f the XlVth and XVth Centuries (Oxford, 1952, 2rid ed. 1955, repr. 1964). 14. R. Busa, S. J., "Guest Editorial: Why Can a Computer Do So Little?" ALLC Bulletin, 4 (1976), 3. 15. Wisbey, p. 34. Different texts are ambiguous to different degrees. Although expository prose is generally unambiguous, at least in context, limericks are highly ambiguous, with up to seven per cent of the running text made up of ambiguous forms. 16. For a larger discussion of Ben Jonson and a description of our former concordance-generating system, see the preface to my unpubhshed Ph.D. thesis, " A Complete Verse Concordance to the Non-Dramatic Poetry of Ben Jonson," (University of Colorado, 1975"). 17. The terms are borrowed from C.S. Lewis, A Preface to Paradise Lost (New York: Oxford University Press, 1942). 18. See Thomas Schneider's unpublished Ph.D. thesis, ' " K i n d ' and 'Blume:' Rilkes Vergletche," (University of Colorado, 1976), a diachronic study of Rilke's similes based on U.K. Goldsmith's forthcoming Rainer Maria Rilke: A verse Concordance of His Complete Lyrical PoetD'. The high frequency function words, exchided from the formally published concordance, are available with complete contexts from Xerox University Microfilms. 19. See A Concordance to "'The Digby Plays," Ann Arbor: Xerox University Microfilms, 1977, and A Complete Concordance to the "'Wakefield Pageants in the Towneley Cycle, "' Ann Arbor: Xerox University Microfilms, 1977. 20. For e x a m # e , by Iugram, 273-277. 21. As in Jacques Barchilon, et al., A Concordance to Charles Perfault's Tales. Volume One: "Contes de Ma Mbre L'Oye," Philadelphia: Norwood Editions, 1977. 22. Eugene F. Irey is completing a concordance to Moby-Dick in which all functional ambiguities are resolved and all contexts hand-edited. 23 A Complete Concordance or Verbal Index to Words, Phrases and Passages in the Dramatic Works o f Shakespeare with a SupplementaD, Concordance to the Poems, (New York: 1894), particularly such headword phrases as "All-abhorred" through "All yourself, . . . . Almost a fault" through "Almost yield," "Another age" through "Another such," and "Better a ground" through "Better wrestler." In his prefatory note, Bartlett seemed quite proud of this feature of his concordance: " T w o or more words are sometimes given together as Index-words in connection with those to which they are immediately joined in the text, to show more directly the particular use of a word." 24. Busa, p. 3. 25. See "The Robin Hood Folk Plays of South-Central England," Comparative Drama 10 (1976), 91-100; "The Revesby Sword Play," The Journal of American Folklore 85 (1972), 51-57; and especially "Computers in Folklore," Tennessee Folklore Society Bulletin XLIII (1977), 14-22; and "Solutions to Classic Problems in the Study of Oral Literature," in Computing in the Humanities: Proceedings o f the Third International Conference on Computing in the Humanities, ed. Serge Lusiguan and John S. North, (Waterloo: University of Waterloo Press, 1977), pp. 117-132. 26. In addition to those already mentioned, there is also A Concordance to Four "Moral" Plays: "The Castle of Perseverance." "Wisdom," "Mankind," and "'Everyman," Ann Arbor: Xerox University Microfilms, 1975. A Concordance to "The Chester Plays" should be complete by late 1978. Special thanks are due the Manchester University Press and the Council of the Early English Text Society for their kind and broad permissions to make use of their editions without payment of fees.
12
MICHAEL J. PRESTON AND SAMUEL S. COLEMAN
27. Special thanks are also due to Gershon Legman for his permission to use his edition of The Limerick: [First Series] (New York: Bell Publishing Company, 1964) and The New Limerick: [Second Series] (New York: Crown Publishers, 1977), also without fees. Legman's outspoken cntlcism of some of the excesses of our research methods, scattered throughout his books, are particularly thought-provoking in the context of his willingness to aid and abet our undertaking. 28. Cathy M. Orr, " F o l k Comparisons from Colorado," Western Folklore XXXV (1976), 175-208. 29. This is being carried on by a group of Old Norse specialists directed by L. Michael Bell. 30. A series of concordances is being produced in conjunction with the Chaucer Vanorun project at the University of Oklahoma, directed by Paul G. Rugglers. Donald C. Baker (C.U.) and especially Thomas W. Ross (Colorado College) have been directly involved in the production of the prototype concordances. 31. Rubin Rabinovitz (C.U.) is currently writing a critical study of Beckett, focusing on Molloy, Malone Dzes, and The Unnamable which have been fully concorded. 32. Eugene F. Irey is working towards a concordance to all of Emerson's work. One of his particular interests is Emerson's repeated use of his own writing throughout his career. Conventional
methods of locating parallel passages have proven inadequate. 33. A.P.M. Coxon and H.R. Trappes-Lomax, Inquirer III (Edinburgh Verson). User's Guide, Inter-University/Research Councils Series, Report No. 29, University of Edinburgh: Program Library Unit (January 1977), p. 1. 34. A semiformal seminar, devoted to the uses of concordances in literary criticism, is held each semester at the home of Lewis Sawin who began humanistic computing at the University of Colorado. Since the seminars are generally dominated by those who use concordances, or would use them if they existed, the individual compiler is forced away from the comfortable isolation of his own work to confront the basic question of the utility of what he compiles. Much of our recent work, including the design of UNICORN, has benefitted heavily from these continuing discussions. 35. Busa, p. 3. 36. Ingrain, p. 276. 37. A related discussion occurs in L.A. Cummings, " A Homily on Wulfstan's Homilies: Concordance Making and Publishing," ALLC Bulletin 5 (1977), 113-118. 38. For an entertaining account of concordance-malang, see Robert S. Wachal, "Humanities and Computers: A Personal View," North American Review, 8 (1971) 30-33.