Tour de CLARIN: Historical Thesaurus of English

Submitted by Jakob Lenardič on 21 October 2020

Written by Fraser Dallachy and Marc Alexander

The Historical Thesaurus of English is an invaluable resource for research into the semantics of English, from the study of individual concepts up to a perspective on the language as a whole, from its beginnings to the present day. It contains every sense of every word in the language as recorded by the Oxford English Dictionary and other sources, including Old English dictionaries, sorted into semantic categories which are themselves ordered into a comprehensive hierarchy of ideas. The work for the first edition was conducted by research staff and students at the University of Glasgow over the course of half a century, beginning in 1964 and reaching publication in 2009. The Thesaurus is now freely available for consultation through the University of Glasgow’s webpages and accessible via the CLARIN-UK webpages.

Since completion, Thesaurus data has been explored by a number of daughter projects, most notably the Mapping Metaphor project, also based at the University of Glasgow. Mapping Metaphor looked for evidence of systemic repurposing of words from one semantic field into another – a phenomenon known as a conceptual metaphor – such as the use of words related to travelling (“It’s been a long road” or “We had a bumpy start”) to describe life experiences. By comparing the contents of every individual category in the Historical Thesaurus against all other categories, the project mapped those areas of the English vocabulary which are strongly connected by such metaphorical borrowing of words. These are displayed on the Metaphor Map of English, which was awarded ‘Best DH data visualization’ in the 2015 DH Awards.

Figure 1: The Historical Thesaurus entry for the noun spirituality, showing related concepts sorted by date of attestation.

The organization of lexis into meaning categories forms an ideal knowledge base for semantic annotation software, which attempts to tag words in text with a label representing their meaning. Such semantically annotated text allows the use of concepts as search terms (for example the concept HAPPINESS rather than the word “happiness”) and can feed into deeper analysis of the structuring of information within sentences and texts. This use of the data was explored by a multi-institutional team including experts from the University of Lancaster whose previous work included the UCREL Semantic Analysis System (USAS) and the VARD spelling normalization tool. The project developed the Historical Thesaurus Semantic Tagger and the release of two semantically annotated linguistic corpora, Semantic Hansard and Semantic EEBO, both freely accessible through English-corpora.org (registration required). The Hansard Corpus contains records of debates in the Houses of Commons and Lords of the British parliament from 1803 to 2005 CE (approximately 1.6 billion words) and is the largest parliamentary corpus in the CLARIN Resource Families, whilst EEBO (Early English Books Online) contains open-source transcriptions of early modern (roughly 1470–1700 CE) printed material (approximately 755 million words in 25,000 texts).

The final major project to use Thesaurus data thus far is Linguistic DNA, which primarily sought to explore regular word groupings in EEBO texts as evidence for the emergence and development of concepts in the early modern period. Thesaurus data was analysed for evidence that particular semantic fields experienced sudden rises, falls, or other remarkable behaviour in the focal period, and a tool for exploring these is under development, with a test version available through the Thesaurus website.

Figure 2: The list of concepts and associated words related to spirituality.

The Historical Thesaurus is a rich resource whose exploration has really only just begun. There are a number of other projects working with the data and that of its sister project, A Thesaurus of Old English, with highly anticipated results. A second edition of the Thesaurus, incorporating data from the 3rd edition of the Oxford English Dictionary, is due to launch in autumn 2020 and is currently keeping the editorial team very busy! Nonetheless, researchers interested in using Thesaurus data are encouraged to get in touch.