Helsinki Digital Humanities Hackathon 2021: ‘Parliamentary Debates in COVID Times’

The Project

Organised by the University of Helsinki, the online hackathon ‘Parliamentary Debates in COVID Times’ was a short, intense project that took place from 19 to 28 May, 2021. Inspired by the recently completed ParlaMint dataset, this multilingual, interdisciplinary project brought together a team of social scientists, computational anthropologists, digital historians, linguists and computer scientists. The main focus of the project were the parliamentary transcripts from the period of the COVID-19 pandemic from four European countries: Italy, Poland, Slovenia and the UK. The team analysed the data in order to determine how the parliamentary debates during the pandemic differed from the pre-COVID period, and to identify the differences and similarities between the four countries.

‘Compiling a corpus is already a big project, so being able to skip this step was a huge privilege. Also, knowing that the corpus was granted permission to be included in the CLARIN repository already gives you some idea of its quality.’

Kristina Pahor de Maiti

SNE plot with perplexity 20 and exaggeration

Methodology

As their main data source, the team used the ParlaMint 2.1 dataset, a multilingual set of uniformly annotated corpora of parliamentary proceedings.

For keyword analysis and collocations, the team used the NoSketch Engine tool. With the help of the ‘word list’ function, the team compiled a list of the top fifty keywords for each language. The keywords – those words more likely to appear in the COVID subcorpus than in the reference subcorpus – were determined by calculating the keyness score.

The ‘collocations’ functionality was used to create lists of collocations, which were then sorted by the logDice score, indicating the collocation’s significance. However, in order to achieve a more meaningful result, which correlated with the specific terms used in the parliamentary debates, the team established collocation networks for specific time periods, based on the seed term ’virus’.

Collocation Network for seed term VIRUS in 2020-05 (Corpus: GB)

In order to identify which keywords occurred across all four countries, and which were country-specific, the hackathon team then manually selected the top twenty COVID-related keywords from each parliament and translated them into English. The fastText embedding model and t-SNE visualisations from Orange were used to retrieve and map word vectors.

Using ggplot2, the team then plotted timelines of word frequencies using relative occurrences, and added a curve indicating the number of COVID cases, thus illustrating the relation between the parliamentary debates and the epidemiological situation in in each country.

Slovenia - relative proportion of keywords over time

‘It was really nice to have such a well-structured dataset of this size. It’s great that the dataset spans several years and that it’s well-annotated, so that it offers lemmas, part-of-speech tags and named entities – it offers a lot of opportunities for researchers.’

Ajda Pretnar Žagar

Outcome

The results showed that the majority of the top fifty keywords for all countries were related to the pandemic. In addition, there was a strong overlap among the manually selected top twenty COVID-related keywords across the four countries, with keywords falling into two broad semantic clusters: the pandemic itself (for example, virus, pandemic, infection) and the reaction to the pandemic (quarantine, ventilator, mask).

In terms of the most prominent collocations, there were also clear parallels – both for the pandemic itself (such as outbreak, crisis, death, cause, time, infection, emergency, global, impact) and also for the measures that were taken (against, response, preparedness, handle, recovery, fund, reform, stability, guideline, reopen).

Collocation Network for seed term VIRUS in 2020-04 (Corpus: IT)

The collocation networks offered useful insight into the relationship between key terms in the parliamentary discussions, especially when viewed against a timeline. Although some words and clusters referred to country-specific debates, overall the four countries exhibited similarities in terms of the themes that emerged. At first, in March 2020, debates in all countries focused on crisis response, but in subsequent months the discussions increasingly centred on the measures needed to contain the virus, such as lockdowns and quarantine. Other themes that emerged included the polarisation of public opinion and vaccines.

When comparing the timelines of word frequencies against the epidemiological situation, the team noted that during the first wave of the pandemic in the spring of 2020, the increase of COVID-related parliamentary discussions mirrored the rise in the number of cases. However, during the second wave, this was not the case – the increase of COVID-related debates was less pronounced than the actual increase of infections.

Poland - relative proportion of keywords and number of newcases per 50 milion population over time

CLARIN Tools and Resources

The project was based on the recently published ParlaMint 2.1 dataset. The sessions in the corpora are marked as either belonging to the COVID-19 period (after 1 November 2019), or as ‘reference’ (before that date). This resource includes transcripts of parliamentary sessions for seventeen parliaments in sixteen languages, with around 500 million words in total. The corpora contain extensive metadata, such as the speaker’s name, gender and party affiliation, as well as linguistic annotations of the transcripts, such as named entities and lemmas.

Access ParlaMint 2.1

Views on CLARIN

‘I really like the comparative perspective that the ParlaMint dataset offers, making it possible to compare different national parliaments. It would be great to have everything available in English, so any researcher could approach this data not just from a linguistic perspective, but from a content perspective. We’re currently experimenting with machine translation, to enable machine-translated parliamentary corpora for future use. This would open up all sorts of opportunities for researchers in political science, sociology, social sciences, history and more.’

Ajda Pretnar Žagar

‘What was great was that everything was standardised across the countries – this was really crucial. Another excellent feature is the metadata – the additional information about the speakers, such as their ages and their gender. It’s public domain knowledge, so anyone could find and match this data, but it would take so much time and effort to collect it. The ParlaMint team did all that work, so that’s wonderful.’

Marta Kołczyńska

For a comprehensive discussion of the hackathon, see the blog post

HELDIG - Helsinki Centre for Digital Humanities

To watch a video about the hackathon, go to CLARIN Café

A linguistically marked-up version of the corpus is available here.

For a recently published, multimedia tutorial on how to conduct high-quality corpus analysis via concordancers without the need for programming skills, see here.

Contributors

Isabella Calabretta, Digital Product Manager at Cambridge University Press & Assessment

Courtney Dalton, MLIS student at Simmons University, Boston, Massachusetts

Richard Griscom, PhD, Postdoc, Centre for Linguistics, Leiden University

Marta Kołczyńska, PhD, Assistant Professor, Institute of Political Studies, Polish Academy of Sciences

Matej Klemen, Young Researcher, Faculty of Computer and Information Science, University of Ljubljana

Kristina Pahor de Maiti, Research Assistant, Faculty of Arts, University of Ljubljana

Ajda Pretnar Žagar, PhD, Researcher, Faculty of Computer and Information Science, University of Ljubljana, and Institute of Contemporary History

Ruben Ros, Doctoral Researcher, Centre for Contemporary and Digital History, University of Luxembourg