The Project
This project illustrates the possibility to trace, almost in real time, changes in language in response to a crisis using a monitor newspaper corpus. The study ‘Contagious “Corona” Compounding by Journalists in a CLARIN Newspaper Monitor Corpus’ examines the linguistic changes that occurred in the Norwegian language during the first wave of the COVID-19 pandemic.
Methodology
This study used the Norwegian Newspaper Corpus as its data source. All occurrences of words starting with corona/korona in the period from 9 January 2020 to 8 March 2021 were retrieved using the Corpuscle corpus management and search system, and then downloaded as a tab-separated file with keywords, newspaper codes and dates. The final word list, after some forms and errors had been removed, had 167957 tokens. Pre-processing, analysis and plotting was performed with a shell script that called programs in Awk, Python and R.
Outcome
Not only was the occurrence of new compounds with the stem corona/korona in the studied timeframe very high, but the speed of vocabulary growth and the diversity of new words was also noteworthy. The earliest occurrence of relevant compounds in the Norwegian Newspaper Corpus appeared on January 9, 2020, with coronavirus and its definite form coronaviruset. Initially, the use of these and other compounds remained modest. However, on February 26 of the same year, when the virus was detected in Norway, there was a marked increase.
Many of the new compounds are heavily context-dependent: for instance, koronatelt (corona tent), koronautsettelsene (corona postponements), coronalov (corona law) and coronakompensasjon (corona compensation). Several of the compounds are metaphorical and have emotional connotations, such as the final parts knekken (breakdown), knipen (pinch), spøkelset (ghost), tsunamien (tsunami) and tabu (taboo).
The study also illustrates that the creativity of creating new compounds did not stop or slow during the studied timeframe, but was sustained throughout the entire period, with new words continuing to emerge.
In terms of the change in spelling, there was, perhaps surprisingly, a rapid shift: while in January 2020, the spelling with c- was still very dominant, the majority of the media had adopted the new spelling with k- within about a month of the intervention by the Language Council of Norway. However, despite the initial sharp rise, the change was never fully achieved, but plateaued at about seventy to eighty per cent.
CLARIN Tools and Resources
This study used the Norwegian Newspaper Corpus as its data source. The corpus is part of the CLARIN Resource Family ‘Newspaper Corpora’. It is updated every night by harvesting publicly accessible articles from ten major Norwegian online newspapers. At every automatic update, boilerplate is removed so that nearly clean text is left, and each article is tagged with the date and the source.
The corpus was accessed through the Corpuscle corpus management and search system, which was developed at the CLARINO Bergen Centre. This system has a user-friendly interface and a powerful and efficient query system. It allows the specification of arbitrary start and end dates in queries and also offers download of matching strings, with optional annotation features, to a file with tabseparated values.
Views on CLARIN
Koenraad De Smedt, Professor of Computational Linguistics, Department of Linguistic, Literary and Aesthetic Studies, University of Bergen, Norway