State of Lexicography Orin Hargraves Spring 2021

Defining Moments

Orin Hargraves

I wrote my first definition for money in 1991. Here it is now 2021! Some things have changed, some have stayed the same during those 30 years.

1991-95: Paper Gives Way to Pixels

My first paid lexicography gig was on the Longman Dictionary of English Language and Culture. It was a first-edition learner’s dictionary, largely already written by British lexicographers. Four UK-resident Americans were recruited for the job, which was to add American content and Americanize the standing definitions enough so that the book could be marketed to learners of American English.

None of us had or were expected to have a computer at home, and the internet was a novelty we’d barely heard of. Batches of work came to us in the mail, printed on A4 sheets. Along with each batch came a packet of paper-clipped index cards, which represented the cross-references to other entries in the dictionary from that batch. We edited the British definitions (adding American senses where needed), and added new words from American vocabulary that fell alphabetically within our batch. If a new cross-reference was required, we added an index card. All of our work was mailed back to Longman in Harlow, where their people transferred our handwritten work to their database.

There was no corpus or anything else for us to look at; only other dictionaries. We were given pretty clear guidelines about what amount of “inspiration” we might draw from looking at the work of others who had already invented the particular wheel that we were working on. We relied largely on our intuitions and on each other; the four of us were often on the phone, even though we lived in different parts of Britain, and we got together in person when we could.

The biggest challenge of the project was staying within the bounds of Longman’s Defining Vocabulary, the list of 3000 words which could be used in definition text. If you needed to use a hard or unfamiliar word in a definition it was set in small caps as a cross-reference, and this was discouraged. We Americans soon found that the British defining vocabulary was less than ideal for American English. We lobbied for and got some changes to it. Cricket, lord, parliament, and railway were out; Baseball, high school, inch,  and railroad were in, among many other changes, which eventually resulted in Longman having two defining vocabularies, one for British and one for American English. I was reminded of this not too long ago when I worked on the Oxford 3000 and the Oxford 5000, which also exist in two versions: British and American.

I continued to work for Longman during the next four years on various titles for the American market. By the end of that time, they had rented me a computer (£50/month) on which I now did the work in flat files and sent it to them on 3-inch floppies.

1996-2000: Dictionary Software, Corpora, and the Migration Online

I moved back to the US for the first time in 1992 and soon after attended my first DSNA conference in 1993. There I met the great and charming Sidney Landau, who agreed to take me on for an upcoming project with CUP. By the time it got fully underway (early 1995) I was back in the UK but this proved not to be an impediment. Paul Heacock was visiting Cambridge and he came down to London one day to get me up to speed. Together we loaded CUP’s editing software (I think it was an off-the-shelf XML editor) and a corpus—all onto Longman’s rented computer! It took more than an hour; the software and corpus were on a boatload of 3-inch disks that had to be loaded one by one. Paul showed me how to use the corpus, write and edit definitions on the computer, and also how to import and export packages of work via FTP. This project was the Cambridge Dictionary of American English, another Americanization project for which the underlying data was the Cambridge International Dictionary of English.

The next year I was back in the States, and the Cambridge database was on PubMan, a product developed by Stephen Perkins for dictionary content management. PubMan answered all the needs of the Cambridge data. You could edit and augment the data online or off, but mostly off, using an XML editor.  It was a great product for basic dictionaries and of all the systems I have worked on, the one I still like the best.

During this period I also worked on EWED, the Encarta World English Dictionary. Different software and editing environment, but essentially the same setup of doing work offline but having the ability to look into other parts of the dictionary online, which helped greatly with cross-references and to check for inclusions and omissions.

2001-05: Word Sketches and Prepositions

Following close on the heels of EWED, I worked on the Macmillan English Dictionary (MED) with a lot of the cast of characters who had worked on EWED. It was a new, from-scratch learner’s dictionary. But now here was something novel: a collection of CD-ROMS on which were loaded Word Sketches. Sketch Engine wasn’t online yet but the vision of it was incubating in Adam Kilgarriff’s capacious mind and he had already put together Word Sketches for thousands of high-frequency words, using the British National Corpus. If a word had a Word Sketch, we were to use it in crafting our definitions for the MED.

I can’t adequately describe what a revolution Word Sketches were for streamlining the work of the lexicographer. So I’ll borrow some words from Buddhist scripture: “Magnificent! Just as if one were to place upright what was overturned, to reveal what was hidden, to show the way to one who was lost, or to set out a lamp in the darkness so that those with eyes could see forms, in the same way has the Blessed One made the truth clear.” Only here it was not the Blessed One but rather the Word Sketch that made the truth clear. Hours of painstaking and mind-numbing study of corpus examples could be saved by simply looking at a Word Sketch and absorbing a digest of a word’s behavior patterns extracted from thousands of language samples.

The gravy train of remote defining for various publishers that had kept me going all through the 1990s suddenly came to a halt – not long after I’d bought a house and a car in 1998. Along with the hat-in-hand emails that I regularly sent to everyone I’d ever worked for to see if there was more work about, I looked around for other things to do. This led me to Ken Litkowski, who lived about an hour away from me and who was working on what became the Preposition Project: an attempt to characterize the semantic and syntactic features of prepositions in a way that would be usable in natural language processing (NLP). I wrote about it briefly in the February 2020 Newsletter.

What I did for Ken over the course of four years was pretty much the converse of defining: you start with a sense inventory (we used the one from the New Oxford Dictionary of English)  and your job is to map sentences that instantiate a particular word sense to that sense in the inventory. If the usage represented a sense that was not in the inventory, I expanded the inventory to account for the undocumented sense by writing a new definition. This proved to be necessary only infrequently.

Spending most of my working life with prepositions for four years took harmless drudgery to a new level but the experience was invaluable, primarily in introducing me to the nuts and bolts of NLP: the never-ending business training computers to deal competently with natural language.

2006-2010: Software is King

As dictionary publishers dropped like flies and more and more in-house lexicographers got their pink slips, I reflected that perhaps I had made a good choice in never becoming one, despite occasional temptations. There were no long-term defining projects around during this period: only odd jobs of a few weeks to a few months for OUP, CUP, Harper-Collins in Glasgow, and Merriam-Webster. I never turned down an offered job: doing any tiresome old task was better than having no work. The benefit of this was having to regularly learn new software and develop the ability to jump quickly from one platform to another. These skills were now indispensable; anything I knew about lexicography would have found no takers if I couldn’t quickly master new software as well.

OUP and Harper-Collins had both started using software from IDM (DWS, their Dictionary Writing System). It was PubMan on steroids: many more bells and whistles, it required a lot more training time to master, a lot more things could go wildly wrong, and for any given dictionary entry, there were a lot more things the lexicographer had to input or check.

Through Ken I was introduced to Roberto Navigli of the University of Rome, who was developing computational models for word sense disambiguation. Roberto gave me projects that required me to map instances of usage (nouns, verbs, adjectives, and adverbs) in a corpus to various dictionary sense inventories. This involved one delightful trip to Rome but was mostly done online, using interfaces that Roberto and his team wrote. Work with him eventually led to our paper for the 2007 SemEval Coarse-Grained English All-Words Task, which still racks up an occasional citation today.

2011-2021: the LSA Throws Open the Door to NLP

 In 2011 the DSNA sponsored me to teach the lexicography course at the LSA Summer Institute at CU-Boulder. Martha Palmer, professor of Linguistics and Computer Science at CU, ran the Institute. Owing to Ken Litkowski, Adam Kilgarriff, and Roberto being in academic circles that overlapped with hers, she was acquainted with my capacity for computational harmless drudgery.

By this time I had seen the writing on the wall about the viability of continuing to earn a living from contract lexicography and the writing said: “not anymore, chump.” It was already in my mind to return to Colorado (where I grew up) to spend more time with my ailing mother and I mentioned this to Martha. She said: “Come to Boulder; I’ll fix you up with something.” This is how I know that she is my fairy godmother.

The next year I went to work for her, expanding coverage in VerbNet in order to make it more usable in the NLP community. We needed to sense-map thousands of verb usages from various corpora to the VerbNet inventory in order to identify gaps in the inventory generally, and also to discover important missing senses of polysemous verbs. If a new verb or new sense needed to be added to VerbNet, Martha’s team of grad students and I found a home for it in the hierarchy (based initially on Beth Levin’s English Verb Classes and Alterations) and tried to nail down the limits on its syntactic behavior. This process was often long and fraught, and exactly like trying to determine how many genuinely distinct senses a verb has and how they should be divided. After years of working with prepositions, verbs felt technicolor and fascinating. I still find them so.

Through Martha I got onto a project for the Technische Universität Darmstadt. The challenge there was to see if a database of text from Yahoo Answers could be mapped in detail to FrameNet. Yahoo Answers is a bit like Quora, if you subtract grammatically, factuality, good spelling, and any pretense of authority. The annotation interface required mapping someone’s Yahoo Answer to a specific frame in FrameNet, and then individually mapping the sentence constituents to whatever frame elements they represented. This work was one of the inputs for Knowledge-based Supervision for Domain-adaptive Semantic Role Labeling, the whopping 268-page dissertation of Dr. Silvana Hartmann that I know I should read someday.

All of these NLP projects were fun and challenging and they all extended my lexicographic mind in directions I had never anticipated. I think the chief take-home was something that I already strongly suspected from two decades of defining: the more polysemous a word is, the more arbitrary is the division of its meanings into discrete senses. In the end, you have a job to do, whether it’s turning out a definition or assigning a word usage to a definition, and you’re nearly always working against the clock. Agonizing about nuances of difference between particular senses, even though it is the cherished pastime of the lexicographer and the semanticist, is rarely productive. It’s hard to find anyone who will pay you for it.

There are a couple of larger take-homes from all of the foregoing: first, it is an irony that today, owing to the internet and computer technology, lexicographers have at their disposal unsurpassed resources for writing good definitions that reflect real language usage. But the internet and computer technology have effectively collapsed the commercial dictionary market, and so the need for lexicographers to define is now greatly diminished.

The second take-home is that lexicography, the longer you have the privilege and good fortune to practice it, gives you a valuable facility with language that is still useful and relevant today, even as the writing of English dictionaries has become a quiet backwater.