Goals and Objectives
The main goal of this tutorial is to introduce basic text mining concepts to digital humanities beginners by applying the Latent Dirichlet Allocation (LDA) topic modelling to a specific use case.
Learning Outcomes
- independently perform topic modelling on new data, typically on a comparable corpus of parliamentary debates;
-
understand the pitfalls of topic modelling and know when to and not to apply the method.
Author(s)
Researcher
Institute of Contemporary History
Other contributors
Kristina Pahor de Maiti: result interpretation, theoretical part on parliamentary debates, testing
Darja Fišer: conceptual design, testing
Description of the Training Materials
(Sub)discipline & language(s) |
Topics: Digital Humanities | Language: English |
Keywords |
Topic modelling, LDA, parliamentary debates, text mining |
Project URL |
The tutorial is available at https://sidih.github.io/agenda/index.html.
The files can be downloaded from:https://sidih.si/20.500.12325/2178.
|
CLARIN resources |
ParlaMint-GB annotated corpus |
Target audience |
Beginners in digital humanities, specifically anyone who is interested in parliamentary corpora or topic modelling |
Facilities required |
No specific requirements other than a laptop with admin rights. The student will have to install Orange (an open-source software) and have about 8GB of RAM for the analysis to run fairly smoothly. We provide additional materials for students with less processing power (preprocessed corpora, subsets). |
Format |
PDF and online (XML) |
Licence and (re)use | CC-BY-SA |
Creation date |
12.04.2022 |
Last modification date | 30.05.2022 |
Experience with Using CLARIN Resources in Teaching
Reusability Notes
The materials include the links for independent work (workflows, data, software references). The materials could be easily reused in two ways:
- Topic modelling on a different data set, for example on a ParlaMint corpus from a different country
- Expanding on the techniques used in the tutorial, for example, semantic analysis of the corpus, longitudinal comparison, or applying a different topic model
All the procedures used in the tutorial are language-agnostic, so no additional changes need to be made for non-English corpora.