What's on the agenda? Topic modelling parliamentary debates before and during the COVID-19 pandemic

Goals and Objectives

The main goal of this tutorial is to introduce basic text mining concepts to digital humanities beginners by applying the Latent Dirichlet Allocation (LDA) topic modelling to a specific use case.

Learning Outcomes

By following this tutorial, the students will learn to:

independently perform topic modelling on new data, typically on a comparable corpus of parliamentary debates;
understand the pitfalls of topic modelling and know when to and not to apply the method.

Author(s)

Ajda Pretnar Žagar

Researcher

Institute of Contemporary History

Privoz 11, 1000 Ljubljana, Slovenia

Other contributors

Kristina Pahor de Maiti: result interpretation, theoretical part on parliamentary debates, testing

Darja Fišer: conceptual design, testing

Description of the Training Materials

(Sub)discipline & language(s)	Topics: Digital Humanities \| Language: English
Keywords	Topic modelling, LDA, parliamentary debates, text mining
Project URL	The tutorial is available at https://sidih.github.io/agenda/index.html. The files can be downloaded from:https://sidih.si/20.500.12325/2178.
CLARIN resources	ParlaMint-GB annotated corpus
Target audience	Beginners in digital humanities, specifically anyone who is interested in parliamentary corpora or topic modelling
Facilities required	No specific requirements other than a laptop with admin rights. The student will have to install Orange (an open-source software) and have about 8GB of RAM for the analysis to run fairly smoothly. We provide additional materials for students with less processing power (preprocessed corpora, subsets).
Format	PDF and online (XML)
Licence and (re)use	CC-BY-SA
Creation date	12.04.2022
Last modification date	30.05.2022

Experience with Using CLARIN Resources in Teaching

The use is fairly straightforward. ParlaMint corpora are well-annotated in a standard CoNLL-U format. The data was easy to find in the repository, from where it was downloaded and used for the analysis. The rich metadata on the speakers is great for detailed analyses.

Reusability Notes

The materials include the links for independent work (workflows, data, software references). The materials could be easily reused in two ways:

Topic modelling on a different data set, for example on a ParlaMint corpus from a different country
Expanding on the techniques used in the tutorial, for example, semantic analysis of the corpus, longitudinal comparison, or applying a different topic model

All the procedures used in the tutorial are language-agnostic, so no additional changes need to be made for non-English corpora.

Cite this Work

Pretnar Žagar, Ajda, Kristina Pahor de Maiti, and Darja Fišer. 2022. What's on the agenda? Topic modelling parliamentary debates before and during the COVID-19 pandemic.

Contact Information

Teachers who reuse and adapt this training material are invited to share their feedback via training [at] clarin.eu (training[at]clarin[dot]eu).