Computers and the Humanities 37: 77–96, 2003. © 2003 Kluwer Academic Publishers. Printed in the Netherlands.
77
The Time Course of Language Change PATRICK JUOLA Department of Mathematics and Computer Science, Duquesne University, Pittsburgh, PA 15282, USA E-mail:
[email protected]
Abstract. This paper presents a numeric and information theoretic model for the measuring of language change, without specifying the particular type of change. It is shown that this measurement is intuitively plausible and that meaningful measurements can be made from as few as 1000 characters. This measurement technique is extended to the task of determining the “rate” of language change based on an examination of brief excerpts from the National Geographic Magazine and determining both their linguistic distance from one another as well as the number of years of temporal separation. A statistical analysis of these results shows, first, that language change can be measured, and second, that the rate of language change has not been uniform, and that in particular, the period 1939–1948 had particularly slow change, while 1949–1958 and 1959–1968 had particularly rapid changes. Key words: information theory, KL-distance, language change, linguistic distance, mathematics of language
1. Introduction This paper concerns the development of a technique for automatically assessing the “rate” of language change in quantitative terms. At first glance, this may appear to be a fool’s errand, given the wide variety of ways in which language can vary and the difficulty of combining these ways. For example, the Oxford English Dictionary is an encyclopediac text describing the development of new lexical forms and meanings. Similarly, Biber et al. (1998) describe the change in the (syntactic) use of modal verbs over a several-hundred-year period. It would be a simple step to move from a verbally descriptive to a numerical model and to give the rate of change of modal verb use. It’s not clear, however, that the numerical model of modal use would be easily extensible to incorporate the use of new lexical items described by the OED. The applications may also be limited for a model capable only of describing changes occuring in “the seventeenth century” and not over smaller periods. The techniques proposed in this paper, however, derive from a very general information-theoretic model of language and can be shown to be sensitive not only to any form of language variation, but also sensitive enough to note differences from samples of only a few thousand characters or fewer – small enough that one can take measurements over very short time periods (years or decades).
78
PATRICK JUOLA
2. Technical Background 2.1. I NFORMATION THEORY The fundamentals of information theory were laid down in Shannon (1948, 1951). “Information”, in Shannon’s definition, is simply the inverse of unpredictability, or is what allows people or systems to make accurate predictions about the world. For example, knowing absolutely nothing about tonight’s baseball games, I can still predict a win for the home team with about 50 percent confidence. Knowing more – as I write, my home team has won about one game in three this season – would (probably) let me improve my guess above the base 50/50. A really “informed” person, who also knew that my home team hasn’t had a winning season in several years, that the injury list reads like a page from the phone book, and that the Vegas line favors the visitors by six points, could make an even more accurate prediction. Codifying this notion of “accuracy” allows one to quantify “information” and its effects. A simple children’s game may help to illustrate this. “I’m thinking of a number between one and a million”. By asking a few yes/no questions, you have to determine which number I’m thinking of. A skillfully chosen question (e.g. “is your number more than five hundred thousand”?) can eliminate as many as half the possibilities – an obvious best-possible result. Since one million is less than 220 , at most twenty questions will eventually eliminate all but one (correct) possibility. We can thus say that choosing one number out of a million involves a maximum of twenty yes/no questions (or “bits”, in technical phrasing) of information. Similarly, choosing one person from the eight or so billion people who have ever lived should not require more than thirty-three bits of information. Here, however, psychology and skill in asking questions begins to play a role. Although one can in theory choose any person, living or dead, from any point in history, in practical terms, one will only choose people one is familiar with. Some people are thus more probable choices than others. A person familiar with me and with my background could then make guesses and assumptions about what sort of people I would choose. As a US citizen and resident, I probably know the names of most of the past presidents of the United States. As an English-speaker, I am more likely to know English royalty than the grand dukes of Kiev. Assumptions like these, if made correctly, can reduce the overall unpredictability – while, if made incorrectly, can actually increase it (perhaps I’m a grad student in Russian history and you didn’t realize it). The techniques described in the following section demonstrate how one can quantify both the inherent unpredictability of the system and the effects of one’s assumptions and prior knowledge on this predictability. In Shannon (1957), Shannon illustrated how these ideas can be applied to language. Consider the following sequence of characters: “THEREISNOREVERSEONAMOTO” What character do you think follows this sequence? Perhaps obviously, if these characters were generated at random by a monkey-at-atypewriter, one has no better chance than one in twenty-six of guessing correctly.
THE TIME COURSE OF LANGUAGE CHANGE
79
Most readers, however, will assume that the sequence is not random, but a sequence of words (thus, one should guess a common letter such as E, T, A, O, I, or N). In fact, if this is a grammatical (and meaningful) sentence, one should probably guess R, to continue a word like MOTORCYCLE or MOTORBOAT, vehicles that, in fact, don’t usually have a reverse. (In Shannon’s experiment, this would, in fact, have been a correct guess.) Although choosing from a set of twenty-six uniformly and independently chosen symbols requires nearly five bits per symbol, the phonology, syntax, semantics, and pragmatics of English – our linguistic “information” – can be shown to reduce our average uncertainty to approximately two bits per letter (Shannon, 1951; Schneier, 1996; Brown et al., 1992). The details of mathematical development, under the term “entropy”, can be found in Shannon (1948), Khinchin (1957), and Li and Vitányi (1997). Informally, it can be used to develop a measurement of linguistic divergence, just like a distance one would measure with a ruler – distance is never negative; the distance between any object and itself is zero; two objects that are more noticeably distinct have greater distance between them, et cetera.
2.2. M EASURING DISTANCE To summarize the detailed mathematics omitted here – Shannon’s “entropy” measures the inherent predictability of a system or process given the best possible model of that system. The difference between the measurement with a best model and an inferior model (the Kullback divergence, Kullback-Liebler distance, or informally KL-distance) can be used to measure the quality (or lack thereof) of the model used. Of course, the quality of the model depends both upon the amount of data available and the sophistication of the modelling technique applied. Much research has been done in the proper development of these models and in the efficient estimation of the probability distribution; an excellent recent paper (Brown et al., 1990) calculates the entropy of a statistical model of English that was produced by training a computer on literally billions of observations comprising a huge corpus of written English. Wyner (1996) has suggested that one can determine the entropy to nearly as good accuracy based on much smaller sample sizes (he claims accuracy to within about 10% of Brown’s based on a few thousand characters), but it remains an open research question how much text is needed to determine various properties. At billions of observations per test, it is obviously impractical to determine document-level properties (such as, for instance, authorship, register, difficulty of reading, or even the language in which a novel document is written), but if the tests can be made sufficiently sensitive to work with small texts, tests like this may be practical. The heart of this work is a relatively efficient method for entropy estimation developed by Wyner (Farach et al., 1995; Wyner, 1996) based on match length within a database. Wyner defines the match length Ln (x) of a sequence x1 , x2 , . . . as the length of the longest prefix of the sequence xn+1 , . . . that matches
80
PATRICK JUOLA
a contiguous substring of x1 , x2 , . . . , xn . In other words, one regards the first n observations of a sequence as a database of observations, and then counts how much of the rest of the sequence (the elements after the first n) is exactly contained somewhere in the database. For example, if the data sequence started bananasundae . . ., and n were 4, then the database would be bana and the rest of the sequence would be nasundae . . .. The prefix na is wholly contained in the database, but nas is not, thus the match length L4 is 2. n is the entropy H of the Wyner demonstrated that limit as n increases of log Ln system, the predictability using the best possible model (for the limited amount of data available). Using this technique, one can estimate entropy of a sequence or corpus by using a sliding window of n observations and calculating L at each point in the data stream and thus the mean match length Lˆ and the estimated entropy Hˆ . Using the example above, one would then calculate that L4 of ananasundae . . . is 1, that L4 of nanasundae . . . is 0, and so forth.1 The application of this to measurement of KL-distance is relatively straightforward. A “database” of n observations is compiled for each language of interest and each successive symbol of the message stream of interest is used as the starting point for the maximal prefix to be found within the database. Thus, one could approximate the cross-entropy between one sample (bananasundae) and a second one (its German equivalent, Bananen-Sundae) by comparing the prefices of the sub-sequences Bananen-Sundae, ananen-Sundae, nanen-Sundae, anen-Sundae, nen-Sundae, en-Sundae, et cetera, against the first sample. One then can calculate the KL-distance by taking the average prefix match length, calculating Hˆ , and subtracting the true entropy of the first sample, derived from analyzing the first sample against its own database. Alternatively, since the true entropy of English is approximately constant, one can simply compare two Hˆ measurements (taken from English). Since Hˆ differs from the KL-distance by the fixed entropy of English, absolute differences in Hˆ reflect absolute differences in KL-distance. In this admittedly contrived example, we expect the average L to be rather high (and therefore H to be rather low) – but note that it’s high exactly because the German phrase for a banana sundae is a close cognate of the English phrase, which in turn indicates a close relationship between the languages (either familial or borrowing). It is thus reasonable to inquire what size of n is necessary for reasonable accuracy in this categorization – and, of course, the implicit question as to whether any value will give meaningful results. Although to a certain extent the answer is probabilistic, previous investigations (detailed below) suggest that only a few hundred characters or less are enough to allow for remarkably subtle inferences about categorization to be drawn.
THE TIME COURSE OF LANGUAGE CHANGE
81
2.3. P RIOR APPLICATIONS OF KL - DISTANCE The theory and algorithm discussed in the previous section have been applied to a number of different linguistic problems, ranging in scope and difficulty to language identification to questions of authorship and language family identification. These studies and findings can be briefly summarized as follows. 2.3.1. Language identification Given a short excerpt of text, in what language is it written? This is a simple test problem in classification which admits of a variety of solutions, including via KLdistance. In a recent experiment (Juola, 1997), 373 texts, taken from the ECI/MC1 CDROM (including Danish, Dutch, English, French, German, and Spanish) were classified. Perhaps oddly, one good baseline text for multi-lingual corpus linguistics is the Bible. The Bible has been widely and well-translated, usually by very good scholars and usually without the aid of modern machine translation techniques (which can introduce artifacts). By holding the subject matter and style constant, one can to a certain extent control for (extralinguistic) semantic and/or pragmatic similarity. The database used was a nine-language sample consisting of the biblical text of Psalms 1 : 1–5. Initial segments of 250 and 500 characters were used to compile a database of Danish, Dutch, English, Finnish, French, German, Hungarian, Maori, and Spanish samples. Of the 746 trials, one German document was misclassified as Dutch when analyzed with 250-character samples (and correctly classified when using the 500 character samples). No other errors were made. 2.3.2. Authorship analysis Traditionally, authorship of disputed documents has been resolved by close reading, looking for key linguistic features characteristic of the authors under discussion. Authorship attribution has also been a proving ground and incubator for many techniques in humanities computing. More recent and mathematically intense techniques have involved examination of subtle, often unconscious, features. The number and types of these examinations has been little short of staggering; Holmes (1994) describes and evaluates no less than thirteen different techniques and potential features. Further discussions (Holmes, 1998; Holmes et al., 2001), yield at least another half-dozen methods. Close examination of the various techniques, however, shows wide situational variation in their applications, particularly in the amount of text and of human analysis required; one of the better modern techniques (Burrows, 1989) examines the 50–75 most common words in a document for statistical variation. Implicit in this is that there is a sample from which 50–75 common words can be found. Depending upon the documents in question, this may take an impractical amount of text, or may require (as in Holmes et al. (2001) conjunction of the very documents whose authorship is in dispute.
82
PATRICK JUOLA
Juola (1997) applied the KL-distance techniques discussed above to the identification of the author(s) of The Federalist Papers, a standard touchstone problem in authorship identification. In particular, samples of only 1000 characters were sufficient to allow the computer to determine the correct author with significantly higher probability than chance, and to correctly assign authorship of the disputed papers to Madison, a finding consistent with most contemporary scholarship, but from much smaller text sample lengths. 2.3.3. Sublanguage analysis Somers’ (1998) primary finding was that the techniques he described, at that time believed to be useful for determining authorship, were only somewhat useful for determining genre or sublanguage – whether a document was taken from a newspaper article, a recipe collection, a film script, and so forth. Using samples of fourteen different sublanguages, he showed that relatively few of the useful key features would reliably select the correct sublanguage, and that no (tested) technique would identify more than about eight of the fourteen genres under discussion. By contrast, the algorithm described above performed without error on thirteen of thirteen of Somers’ data sets,2 a finding both strongly above chance and above Somers’ performance with more traditional techniques. 2.3.4. Language family identification Again, techniques for determining the familial relationships between any two languages have long been available. The standard method (Swadesh, 1955) involves collecting a list of approximately 100 basic lexical items, representing concepts fundamental to human life, presumably long-lived, and unlikely to be borrowed from neighboring languages. By looking at the apparent phonological changes and their ordering, one can attempt to determine when/where daughter language (groups) branched from their parents and thus a family tree for the languages under study. This technique was refined and improved by Warnow (1997), but, although more mathematically sophisticated, is still essentially a word-list based technique. The KL-distance technique described above was applied in Juola (1998) to determine the effective distance between languages and to compare whether the effective distances, including syntactic change as well as borrowings, are similar to the phonological family trees obtained via more traditional methods. For this experiment, a sample of fifty languages was obtained from the Bodleian declaration (see Figure 1), as translated by Oxford University. An unofficial project of the Bodleian library has been to translate this 68-word oath, traditionally administered to new members of the Bodleian Library for access to the books, into as many languages as possible, including every language ever spoken in Europe as well as at least one official language of every UN member country.
THE TIME COURSE OF LANGUAGE CHANGE
83
Please read the following aloud: I hereby undertake not to remove from the Library, or to mark, deface, or injure in any way, any volume, document, or other object belonging to it or in its custody; not to bring into the Library or kindle therein any fire or flame, and not to smoke in the Library; and I promise to obey all the rules of the Library. Figure 1. Bodleian declaration in English.
Pairwise distances were computed for every translation in the study, and the closest pair (in distance) were considered to be the most closely related. This process was repeated until all languages had been connected into one overarching “family tree”. The results are interesting. Although there are some definite discrepancies with the more traditional trees, some groupings, such as the Romance subfamily of Indo-European, are readily apparent. This grouping is especially interesting when one realizes that only about three hundred to four hundred characters are available for each language in this experiment. 2.4. S UMMARY The technical details aside, the previous work demonstrates that KL-distance, as implemented above, provides a robust, sensitive, and meaningful measure of the nebulous concept of “linguistic distance”; in cases where one expects samples to be meaningfully “farther”, our intuitions are supported by an increased numerical measurement. Furthermore, this technique appears to be sensitive to any sort of variation in language where one might expect linguistic information, in the nontechnical sense, to be useful as a judgement guide. 3. Current Research 3.1. L ANGUAGE CHANGE Language changes. This fact is undeniable to anyone who has struggled with Romeo and Juliet in high school. A description of the ways in which seventeenthcentury English differs from that of the twenty-first would practically constitute a catalog of the ways in which language can vary, ranging from purely phonological through orthographic, lexical, and syntactic change up to pragmatic changes in subject matter and “style”. It’s not necessary to invoke four hundred years of time in order to notice language change. Johnson (1996) presents examples of significant change in lexical use that have occurred in only sixty years. Even more interesting are her comments on some causes of these changes: Urbanization, industrialization, and technological advances have produced changes in occupation and in the implements used in the workplace and the home, which have led to changes in vocabulary . . . Questions about farming, in particular, more frequently elicited “No Response” in the 1990 interviews, as
84
PATRICK JUOLA
the number of farms in the South declined from 2.1 million in 1950 to 722,000 in 1975. Thus, as familiarity with farming declined, the number of speakers who admitted to lexical gaps in that domain increased.3 The examples of such words that she cites include: calls to cows in pasture, corn cribs, and rail fences, but also window shades and attics, which are hardly exclusive to farming. In addition to the changes driven by technology, she also discusses possible lexical effects of changes in the local economy (with increased trading and decreased economic autonomy), education (increased on average by 5.1 years between 1940–1980), and the availability of information via the media. An obvious question, then, is over how short a period of time language change can be detected – and whether or not language change as a result of time can be distinguished from simple effects of, for instance, topic choice. If language change can not only be detected, but measured, then another question that arises is that of the rate of change – is this measurable, is the measurement meaningful, and (assuming an affirmative answer to the two previous questions), is the rate of change uniform over time? The theory developed in the previous section suggests that KL-distance is, in fact, a true measure of linguistic “distance” in an abstract language space, and thus that “rate of change” can be meaningfully addressed in terms of change of distance divided by elapsed time. 3.2. NATIONAL GEOGRAPHIC In order to take meaningful measurements of language change over time, it is necessary to have suitable samples of language situated in time. This task is surprisingly difficult. First, the samples need to be comparable in a meaningful sense – otherwise, systematic differences in style or authorship may dominate subtle temporal effects. (Consider the differences between a 1930s-era recipe, and a 1950s-era Rogers and Hammerstein lead sheet; it’s hardly fair to make diachronic comparisons of stress patterns on this basis.) Second, the samples need to be accurately dateable. This can be trickier than it appears – a novel published in 1991 might have been written twenty years earlier and left in a desk drawer,4 or might simply be a revision of a passage written and re-written a dozen times over the years. An article in a 1995 encyclopedia might be a hangover first written for the 1945 edition. Third, the documents themselves must cover a sufficient range of time to make useful comparisons possible. Computer-based analyses such as the present study require machine-readable, or at least machine-transcribable, corpora. And, finally, for those of us on a budget, affordability is always a nice feature. Fortunately, the advent of the ubiquitous home computer and reference library has made this much easier. Research groups such as the Oxford Text Archive and Project Gutenberg have made it much easier to obtain, for example, historic documents and novels in machine-readable form. However, these suffer from dating problems as discussed above. These problems can be solved, though, by using one of the several periodicals, among them the National Geographic, which have
THE TIME COURSE OF LANGUAGE CHANGE
85
been made available as machine-readable collections on sets of CD-ROMs at quite reasonable prices (c. US$100). Two teams of students were assigned the task of collecting the data from the CD-ROMs. The first group gathered an excerpt from one article from every January issue over the period 1982–2000, with excerpt lengths varying from 10,000 to 14,500 characters in lengths. The second team compiled much shorter excerpts (from 700–5000 characters) from articles covering the period from 1939–1979, with between two and seven excerpts from a single article from each year. Various minor technical problems abounded, chiefly having to do with the availability and quality of the images on the CDs. The NG publishers made their issues available only as images (presumably photographs of actual back issues stored in the corporate archives), and relatively low-quality JPEGs at that. Furthermore, the back issues themselves are of variable quality, and time has wreaked its inevitable atrocities upon the older magazines, and it was necessary to visit the library and make higher-quality photocopies of physical back issues. Furthermore, OCR (optical character recognition) processing inevitably produces errors and artifacts, but since the OCR-analysis process was held constant throughout, the errors and artifacts should be systematic and not influence comparisons such are described in the following section.
3.3. E XPERIMENTAL HYPOTHESES The first hypothesis to be tested is simply the claim that language change is detectable. Underlying this hypothesis are several assumptions, among them that language itself changes and that the changes are undirected, but cumulative. The physical analogy of the meanderings of a drunkard may be illustrative. Our hypothetical drunkard may wander around for a while and find himself some distance from the bar where he started his wanderings. When he continues his wanderings, he continues from his (current) location, not restarting from the bar. Furthermore, although he may not – in fact, probably will not – continue in the exact direction he has been travelling, he is also unlikely to retrace his steps back to the bar. Similarly, if we take English as written [in NG] in 1950 as a baseline, language in 1955 is likely to be different, and language in 1960 is likely to be different from the 1955 sample, and even further removed from our baseline of 1950. As time passes, we expect two samples of language to be more different (as measured via information theory) the greater their temporal separation. This constitutes a testable (and falsifiable) prediction. A second hypothesis follows almost immediately from the first; if a significant correlation of linguistic with temporal separation can be established, it is reasonable to start curve-fitting. Here one can apply the notion of “rate of change”; if points at five years’ distance display X bits of linguistic separation, while points at ten years’ distance display 2X bits, then (naively), the rate of change is 2X/10 bits per year. Unfortunately, here mathematical troubles again rear their ugly head,
86
PATRICK JUOLA
since language change is (assumed to be) undirected and therefore not additive. (Staggering one mile north and then one mile east does not put you two miles from the bar you started from.) The topology of a drunkard’s walk is somewhat complex. Minimally, two rates obtained by a linear curve fit to two distinct time scales are not directly comparable. However, one can easily test whether the rate of language change is itself uniform (another falsifiable hypothesis) over comparable time scales. Both of these hypotheses have been tested (and results follow below). The process by which the NG sample documents were obtained is described in the preceeding section. For every pair of documents, the linguistic difference between the two was computed, as was the number of years of separation. (For example, two articles from 1941 would have zero years of separation. An article from 1941 and an article from 1945 would have four years, and so forth.) No document was compared with itself – the zero linguistic difference would have introduced artifacts into this analysis. Analysis was performed both over decades5 as well as the larger 41- or 18-year period. For each period, the points were plotted and fit to a linear model. The hypotheses to be tested can be expressed formally as the conjectures that, first, the slope of the best linear fit is greater than zero (reflecting the cumulative nature of the process of language change), and second, that the slopes of the best linear fits per decade will differ (reflecting that language changes at a non-uniform rate). Of course, as discussed above, random walks within an abstract highdimensional space are not well-modelled with linear curve fittings (despite being done here), nor is it clear that such random walks and such a space are the best model for language change. √ In particular, a drunkard wandering in a plane (or on a line) is expected to be N paces away from his starting point after N steps (Weisstein, 1999); this, however, assumes that his paces – the amount of language change from year to year – are all the same size, an assumption expected to be contradicted explicitly by the experimental findings.
3.4. R ESULTS , PART 1 The plots from the experiments decribed above are presented below as Figures 2 to 6, with the numerical value of the regression slope presented in Table I. In all cases, the expected results were obtained. The first major result is that language change does occur, and is detectable by the techniques described above. In particular, notice that the slope of the regression lines in all cases is nonnegative. This indicates that two documents distant in time are at worst equidistant in language. The converse, that two documents distant in time are expected to be close in style, is of course, both topologically and intuitively nonsensical. Given our a priori rejection of the acceptability of such a finding, a one-tailed test for positive slope is appropriate – and in all cases, we have grounds to reject the null hypothesis of zero slope and no detectable change. In fact, over
THE TIME COURSE OF LANGUAGE CHANGE
Figure 2. Temporal distance vs. measured linguistic distance, with regression line.
Figure 3. Temporal distance vs. measured linguistic distance, with regression line.
87
88
PATRICK JUOLA
Figure 4. Temporal distance vs. measured linguistic distance, with regression line.
Figure 5. Temporal distance vs. measured linguistic distance, with regression line.
89
THE TIME COURSE OF LANGUAGE CHANGE
Figure 6. Temporal distance vs. measured linguistic distance, with regression line. Table I. Average rates of language change for various periods Period
Average rate
1939–1979
0.0011 bits/year
1939–1948 1949–1958 1959–1968 1969–1978
0.0039 bits/year (∗ p = 0.0741) 0.0178 bits/year 0.0167 bits/year 0.0111 bits/year
1982–2000
0.0045 bits/year
all periods except the decade of the 1940s (1939–1948), a more conservative twotailed test rejects this hypothesis. Over the period of 1982–2000, language changed at an average measured rate of 0.0045 bits/year. In practical terms, then, a Rip Van Winkle with perfect knowledge of English who had fallen asleep in 1982 could have awakened in January 2000, still with a very good practical knowledge of English. However, his background would have been sufficiently out-of-date that he would have played Shannon’s language game relatively badly, averaging about 0.08 questions per letter poorer than his less well-rested contemporaries, reflecting his lack of knowledge of
90
PATRICK JUOLA
current concerns and idioms. (E.g. the phrase “MONICALE” would probably be meaningless to him.) Similarly, over the period 1939–1979, language changed at an average measured rate of 0.0011 bits/year. while the rate of change for individual decades ranges from a low of 0.0039 bits/year (for the 1940s), to a high of 0.0178 (for the 1950s). In statistical tests, all measured rates were significantly (p < 0.0001) different from zero except for the 1940s (1939–1948). This period was possibly significant (t = 1.7888, p = 0.0741) when a two-sided test was applied, and significant (p = 0.0371) when one applied a one-sided test excluding the possibility of a negative slope. However, the measured rate of change for the various decades differs, in most cases significantly so. Due to the topological factors discussed earlier, periods of different lengths cannot be compared directly. However, one can easily observe that the 1940s had less change (significantly so), than the 1970s, the 1970s had significantly less change than the 1950s and 1960s, and, of course, the 1940s were significantly smaller than the 1950s/60s. The difference between the rates of change of the 1950s and 1960s was not significant, although it may suggest directions for future and more elaborate experiments. 4. Discussion 4.1. F INDINGS We have, then, that language does change, and that that change can be algorithmically perceptible, even over periods as small as a decade and using samples of a few thousand characters. Even in isolation, this is a relatively important development and suggests a new addition for the toolbox of language scholars interested in quantifying language change and variation. From a sociological perspective, however, the findings that language change is not uniform over time is more immediately interesting. Given that language appeared to change relatively quickly between 1949–1968 and relatively slowly between 1939–1948, why did that occur? More broadly, what’s the difference between the 1940s and the 1950s? And, of course, a historian, faced with a question of such tremendous breadth, could only respond by writing a book (or a series). Not being a historian, the author can offer only the most naive suggestions based on gross-scale perception of cultural pressure. For instance, the 1940s were the period of the Second World War, one of the most significant political events of the twentieth centure, especially in terms of direct, personal effects on the “average” American. For the first time in a generation, people were dragged from home and hearth and placed in the theater of world opinion, along with several million of their countryfolk from thousands of miles away. That this event would not somehow leave its mark in the language of those millions of people is implausible. In the course of this war, these millions of people would be exposed to new experiences, new ideas, new technologies, and sometimes an entirely new linguistic
THE TIME COURSE OF LANGUAGE CHANGE
91
environment. Five years later, the veterans would be taking their new experiences home and retelling them to the people who had not gone – and the advertisers and journalists, in the new peacetime prosperity, would be writing to the veterans in a language they believed would be effective for these people, not necessarily the same language they would have used earlier. Of course, we have no reason to assume that this change would necessarily have been instantaneous, but the cumulative pressures of a billion new experiences could have caused tremendous change. If, in fact, it took more than three years after V-J Day for the “new” language of the veterans to reach the relatively conservative National Geographic, one would expect tremendous change in the 1950s, relative to the 1940s. At the same time, the world political situation had changed radically over the same period of time. The idea that the United States could comfortably ignore events happening in Asia and Europe was out of fashion, and even a moderatelyeducated person was expected to know enough about foreign affairs and policy to distinguish between the Communist East and Capitalist West. At the same time, new technological developments had changed the linguistic environment substantially for the average citizen. The introduction of radio in the 1920s and 1930s brought non-local dialects into the home on a scale previously unimaginable; the development of television in the 1950s did the same at an even greater level. The technological message that TV brought did not merely include itself, but other technological developments that piggybacked onto the advertisement and news messages conveyed. And, of course, it’s a commonplace observation that wars themselves drive technological change; many of the new consumer and industrial goods of the 1950s were directly or indirectly the results of the increased pace of wartime research (e.g. radar, the computer, nuclear power, jet airlines). Focusing for a moment on the purely technological questions, it is clear that war itself can drive technological change. In addition to direct technological advances to improve one’s prowess on the battlefield, “the cutting off of an accustomed source of supply during wartime has often been an important stimulus for the development of new techniques. Thus France’s early commercial leadership in the production of synthetic alkalis (utilizing the Leblanc Process) was, in large measure, a result of her loss of access to her traditional supplies of Spanish barilla during the Napoleonic wars. The Haber nitrogen fixation process was developed by the Germans during World War I when the British blockade deprived them of their imports of Chilean nitrates. The loss of Malayan natural rubber as a results of Japanese occupation in World War II played a critical role in the rapid emergence of the American synthetic rubber industry” (Rosenberg, 1972, p. 21). Even in the prosaic area of farming, Rosenberg notes (p. 137) that “World War II serves also to mark a transition to substantially higher yields of output per acre, a rise which greatly exceeds anything in our earlier historical experience” and was, in his view, directly attributable to the rise of chemical engineering and power production. Along with the increased productivity, presumably, came social change as farmers
92
PATRICK JUOLA
discovered they needed only half the labor to produce the same number of bushels, and thereby freed up their farmhands to join the urban sprawls. Could any of these be explanations for the observed stability of language in the 1940s and the relatively tremendous change in the 1950s? And can they explain the intermediate levels of change? In particular, we focus on the possibility of a technological explanation for at least part of the rate of linguistic change; that new technology (as developed in the war) was responsible for part of the linguistic innovation of the post-war period, and that a relative lack of technological innovation can explain a period of relative stability in language. Intuitively, the claim is plausible – discussing new inventions forces people to talk differently about them – but numerical confirmation would require additional experimentation.
4.2. P OSSIBLE CONFOUNDS Of course, there are other, less useful explanations of the findings above. Aside from the obvious objections to any empirical study – the findings are merely a statistical fluke, the samples sizes are too small for one to be confident, the National Geographic is not sufficiently representative of language as a whole, and so forth – there are other directed factors that might be significant confounds. For example, as discussed above, the quality of the images from the National Geographic is of variable but low quality, and the quality gets, in general, worse, the older the issue in question is. Poor quality images translate relatively directly to noisy and error-ridden OCR texts. As the oldest texts, the samples from the 1940s are thus the more error-prone, and the relatively low rates of change could thus be an artifact of a low signal/noise ratio, and the high rate of change of the 1950s might simply measure improving image quality. To address this concern, some preliminary studies involving different corpora have been performed, and the results, while not yet conclusive, are intriguing and support the reality of the effects here described. The Historic Pittsburgh project, a joint project of the University of Pittsburgh Digital Research Library, and the Historical Society of Western Pennsylvania, has been digitizing and storing materials of historical interest for Pittsburgh and western Pennsylvania. Included in their materials is the Full-Text Collection, a set of newly electronicized non-fiction and reference material published in the 19th and early 20th century, covering, according to their project description, “the growth and development of Pittsburgh and the surrounding Western Pennsylvania area from the period of exploration and settlement to the period of industrial revolution and modernization”. Under the direction of the coordinator, Edward Galloway, these texts have been subject to high-quality OCR and post-editing to produce the best possible machine-readable version of (to date) nearly 400 books. From these 400 books, 68 books of appropriate style, published between 1900 and 1939, were selected. (Stylistic selection was necessary because many of the books are, for example, business directories, surveyor’s reports, annual reports
93
THE TIME COURSE OF LANGUAGE CHANGE
Table II. Average rates of language change for early 20th century Period
Average rate
1900–1909 1910–1919 1920–1929 1930–1939
0.0233 bits/year –0.0011 bits/year 0.0450 bits/year 0.0180 bits/year
of organizations, and similar atypical examples of written English.) These were broken down by decade and analyzed for the rate of change, as before. Omitting statistical detail in these preliminary findings, similar patterns emerged; Table II shows that (with one exception, perilously close to zero and within the bounds of noise), all changes were positive, but that the rate varied from period to period. Furthermore, the findings are suggestive in a historical sense, as in both studies, the period with the lowest measured rate of change was during a major war, while the highest measured rate of change was immediately afterwards. The Historic Pittsburgh documents are unfortunately not directly comparable to the National Geographic texts. Not only do they come from an altogether different period, but they lack the close editing, editorial consistency, and commonalities in staff authorship. In addition, the dating of individual documents is problematic; a book published in 1920 may well have been written in drafts since the Civil War. Despite these differences, similar patterns of change can be observed, suggesting that these patterns are not the result of (independent) artifacts in two entirely dissimilar sets of data. 5. Conclusions and Future Work From a methodological standpoint, the most important conclusion is simply: the technique described here works for measuring language change and variation. That language changes is and has been unassailable; how fast it changes has not been the subject of much agreement. This paper has demonstrated how to make a direct quantitative measurement of the amount of language difference from one document to a second, even from samples of only a few thousand characters, and yet obtain meaningful measurements. The technique has proven to be useful for observing variation and classifying documents in a variety of ways, including both diachronic and synchronic variation. Opportunities for future work abound. In addition to the obvious replication and extensions (extend this analysis for a longer period of time, replicate this work for different magazines in the same period, or magazines in different countries/languages, or different genres of writing altogether), there are also possible applications in the area of automatic dating of documents, forgery detection,
94
PATRICK JUOLA
and and document categorization. Conventional literary scholarship might be empowered to provide numeric answers to questions such as “over what time period did a given author develop her characteristic style”? and “how did the writings of this author influence his contemporaries over this time period”? From a technical perspective, the Wyner technique described herein probably admits of improvement. Recent work (Chater and Hahn, 1997; Hutchens and Adler, 1998) suggests that even primitive and psycholinguistically implausible data compression techniques can radically improve our representation of corpus data; corresponding modifications to Wyner’s techique should result in more accurate measures corresponding to a better model of human processing of ambiguous language data. Application of psychologically plausible principles, such as those suggested by Slobin (1979), should improve both the accuracy and the interpretability of results. Also from a technical perspective should be included the possibility of improving the mathematical models, such as the linear curve-fitting, underlying this work. The linear curve fitting, in particular, explicitly assumes both that language change is directed and occurs at a uniform rate, assumptions directly contradicted by the current findings. If even a bad and admittedly implausible model uncovers some findings of significance, however, this suggests both the soundness of the underlying theory and the need for improved and more sensitive models. From a linguistic, or a psycholinguistic, perspective, much additional work is necessary to explain the numeric findings. “Yes,” one can say, “language changed more in the 1950s than other decades”. However, what form did this change take? Is technologically-driven change primarily lexical, as suggested above? Is the rate of lexical innovation different from the rate of syntactic innovation? Does this represent merely a pragmatic difference in what people choose to write/talk about, or is there a fundamental difference in the representation of language going on in people’s heads? In addition to requiring close analysis of the relevant documents, new techniques may need to be developed to test these conjectures. And, finally, from a historical perspective, this may suggest a new indicator of cultural changes and perhaps a new technique to spot previously unsuspected sources for linguistic and cultural pressures. At the very least, information of this sort can be a finger pointing at new information to be read, evaluated, and explained. The current work strongly suggests that language change is related to technological change. However, technological change is clearly not the only factor in language change. A similar investigation could, and should, be performed on any other proposed factors that engender or hinder linguistic change. But merely by allowing language change to be accurately measured, one can use this as a tool to unpack these components of society and examine them individually.
THE TIME COURSE OF LANGUAGE CHANGE
95
Acknowledgements This work has been aided tremendously by a hard-working team of students at Duquesne University, including Nicola Adamchik, Michael Ahlers, Dean Backeris, Sabrina Foster, Michelle Iztel, and Lance Myers. Duquesne University herself was very supportive, both emotionally as well as financially through the medium of an NEH Endowment Grant. Dr. Galloway and the Historical Pittsburgh Project have put thousands of hours into making high-quality OCRed texts available to researchers. Jodi Affuso was invaluable as an informant historian and research assistant. Finally, this research has been helped by a score of fruitful discussions with my associates in the International Quantitative Language Association (IQLA).
Notes 1 From which one would expect to derive a mean between 1 and 2 with a long enough English
sample. 2 The author was unable to obtain the fourteenth set from Somers due to data transmission problems.
(Results from Juola, unpublished ms.) 3 Johnson, 1996, p. 86. 4 Cf. Confederacy of Dunces, by Toole. 5 Pseudo-decades. The period herewith referred to as “the 1940s” is actually the period 1939–1948;
the period referred to as “the 1950s” is 1949–1958, and so forth. Caveat lector.
References Biber D., Conrad S., Reppen R. (1998) Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge. Brown P.F., Cocke J., Della Pietra S.A., Della Pietra V.J., Jelinek F., Lafferty J.D., Mercer R.L., Roossin P.S. (1990) A Statistical Approach to Machine Translation. Computational Linguistics, 16(2), pp. 79–85. Brown P.F., Della Pietra V.J., Mercer R.L., Della Pietra S.A., Lai J.C. (1992) An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics, 18(1), pp. 31–40. Burrows J.F. (1989) An Ocean where each Kind . . .: Statistical Analysis and Some Major Determinants of Literary Style. Computers and the Humanities, 23(4–5), pp. 309–321. Chater N., Hahn U. (1997) Representational Distortion, Similarity, and the Universal Law of Generalization. In Proceedings of the Interdisciplinary Workshop on Similarity and Categorization (SimCat 97), University of Edinburgh, pp. 31–36. Farach M., Noordewier M., Savari S., Shepp L., Wyner A., Ziv J. (1995) On the Entropy of DNA: Algorithms and Measurements Based on Memory and Rapid Convergence. In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, pp. 48– 57. Holmes D.I. (1994) Authorship Attribution. Computing and the Humanities, 28(2), pp. 87–106. Holmes D.I. (1998) The Evolution of Stylometry in Humanities Computing. Literary and Linguistic Computing, 13(3), pp. 111–117. Holmes D.I., Robertson M., Paez R. (2001) Stephen Crane and The New-York Tribune: A Case Study in Traditional and Non-Traditional Authorship Attribution. Computers and the Humanities, 35(3), pp. 315–331.
96
PATRICK JUOLA
Hutchens J.L., Adler M.D. (1998) Finding Structure via Compression. In Powers D.M.W. (ed.), Proceedings of New Methods in Language Processing 3 and Computational Natural Language Learning, ACL, Sydney, Australia, pp. 79–82. Johnson E. (1996) Lexical Change And Variation in the Southeastern United States 1930–1990. University of Alabama Press, Tucaloosa, Alabama. Juola P. (1997) What Can We Do With Small Corpora? Document Categorization Via Cross-Entropy. In Proceedings of an Interdisciplinary Workshop on Similarity and Categorization, Department of Artificial Intelligence, University of Edinburgh, Edinburgh, UK. Juola P. (1998) Cross-Entropy and Linguistic Typology. In Powers D.M.W. (ed.), Proceedings of New Methods in Language Processing 3 and Computational Natural Language Learning, ACL, Sydney, Australia. Khinchin A.I. (1957) Mathematical Foundations of Information Theory. Dover Publications, New York. Li M., Vitányi P. (1997) An Introduction to Kolmogorov Complexity and Its Applications. Graduate Texts in Computer Science. Springer, New York, 2nd edition. Rosenberg N. (1972) Technology and American Economic Growth. Harper Torchbooks, New York. Schneier B. (1996) Applied Cryptography, Second Edition: Protocols, Algorithms and Source Code in C. John Wiley and Sons, Inc, New York. Shannon C.E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal, 27(4), pp. 379–423. Shannon C.E. (1951) Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), pp. 50–64. Slobin D.I. (1979) Psycholinguistics. Scott, Foresman, and Company, Glenview, Ill., second edition. Somers H. (1998) An Attempt to Use Weighted Cusums to Identify Sublanguages. In Powers D.M.W. (ed.), Proceedings of New Methods in Language Processing 3 and Computational Natural Langauge Learning, ACL, Sydney, Australia. Swadesh M. (1955) Towards Greater Accuracy in Lexicostatic Dating. International Journal of American Linguistics, 21, pp. 121–137. Warnow T. (1997) Mathematical Approaches to Comparative Linguistics. Proceedings of the National Academy of Sciences of the USA, 94, pp. 6585–6590. Weisstein E.W. (1999) CRC Concise Encyclopedia of Mathematics. Chapman and Hall/CRC, Boca Raton. Wyner A.J. (1996) Entropy Estimation and Patterns. In Proceedings of the 1996 Workshop on Information Theory.