ParlaMint, a CLARIN flagship project, resulted in the creation of comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities.
The second stage of the project, ParlaMint II, introduced various enhancements, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and the use of CLARIN services for dissemination. Qualitative additions made within the ParlaMint II project include metadata localisation, the addition of new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora.
ParlaMint Corpora
ParlaMint corpora are openly available under the CC BY license, as well as freely available for analysis and browsing through noSketch Engine and TEITOK. The latest version of the corpora is 4.1:
- Tomaž Erjavec et al. (2024) Multilingual comparable corpora of parliamentary debates ParlaMint 4.1. http://hdl.handle.net/11356/1912
- Tomaž Erjavec et al. (2024) Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.1. http://hdl.handle.net/11356/1911
- Taja Kuzman et al. (2024) Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.1. http://hdl.handle.net/11356/1910
The ParlaMint project also has a GitHub repository, where samples of the corpora, the XML schema and corpus processing and validation scripts are available.
Showcases
Echoes of the Chambers: Studying Democracy through Parliamentary Speeches. Vaibhav Agarwal, Hugo Bonin, Kai Ferragallo-Hawkins, Matthes Fürst, Ekaterina Glazacheva, Jani Marjanen, Marko Milošev, Elma Nevala, Niklas Oetken, Tuukka Puonti, Olli Rousu, Risto Turunen, Artur Voit-Antal, and Johan Wahlsten, Helsinki Digital Humanities Hackathon 2024 (#DHH24).
Explainable AI - Understanding Political Orientations in Slovenian Parliament. Bojan Evkoski and Senja Pollak, CLARIN Impact Stories,
2023.
Networks of Power - Gender Analysis in European Parliaments. Jure Skubic, Alexandra Bruncrona, Jan Angermeier, Bojan Evkoski and Larissa Leiminger, CLARIN Impact Stories, 2023.
ParlaMint - A Resource for Democracy. Dario Del Fante and Virginia Zorzi, 'Who Is the Enemy Now?', CLARIN Impact Stories, 2023.
Emotions Running High? Gül M. Kurtoğlu Eskişar and Çağrı Çöltekin, ParlaCLARIN III at LREC2022.
ParlaMint and ParlaMeter: How Standardised Data Formats Empower End Users. Filip Dobranić, CLARIN Café: ParlaMint Unleashed, 2021.
Tutorials
Tutorial by Darja Fišer and Kristina Pahor de Maiti Voices of the Parliament: A Corpus Approach to Parliamentary Discourse Research.
This tutorial shows how corpora can be used to investigate language use and communication practices in a specialised socio-cultural context of political discourse in order to explore socio-cultural phenomena. It demonstrates the potential of a richly annotated diachronic corpus of Slovenian parliamentary debates for investigating the characteristics and dynamics of the representation of women and their language use in the Slovenian Parliament.
Tutorial by Barbora Hladká, Use Case: Leaving the European Union in the UK Parliament.
The tutorial was part of a workshop aimed at economic history undergraduates, to encourage the students to use data in their projects. The tutorial guides the students in using the collections of parliamentary data in ParlaMint to answer two simple research questions. It provides a step-by-step demonstration of information extraction in KonText and data processing in Google Sheets. The accompanying assignment is available here.
Publications and Presentations
- Tomaž Erjavec et al. The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 2022. https://doi.org/10.1007/s10579-021-09574-0
- Skubic, Jure, Angermeier, Jan, Bruncrona, Alexandra, Evkoski, Bojan and Larissa Leiminger. (2022). "Networks of Power: Gender Analysis in Selected European Parliaments." In: Proceedings of the 2nd Workshop on Computational Linguistics for Political Text Analysis (CPSS-2022), Potsdam, Germany. (https://old.gscl.org/en/arbeitskreise/cpss/cpss-2022/workshop-proceedings-2022)
- Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çağrı Çöltekin, Matyáš Kopp, Katja Meden (2022): ParlaMint II: The Show Must Go On. In: Proceedings of the LREC 2022 ParlaCLARIN III Workshop on Creating, Enriching and Using Parliamentary Corpora, pp. 1-6, European Language Resources Association (ELRA), Paris, France, ISBN 979-10-95546-85-6 (http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.1.pdf)
- Skubic, Jure, and Darja Fišer. "Parliamentary discourse research in sociology: Literature review." In Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pp. 81-91. 2022. (https://aclanthology.org/2022.parlaclarin-1.12/)
- Agnoloni T., Bartolini R., Frontini F., Montemagni S., Marchetti C., Quochi V., Ruisi M. e Venturi G. (2022) “Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus”, Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, Marseille, France, 20/06/2022, edito da European Language Resources Association ELRA (Paris, FRA), pp. 117-124.(https://aclanthology.org/2022.parlaclarin-1.17.pdf)
- Per Erik Solberg, Pierre Beauguitte, Per Egil Kummervold, Freddy Wetjen (2023) A Large Norwegian Dataset for Weak Supervision ASR. In: Dana Dannélls, Simon Dobnik, Nikolai Ilinykh, Beáta Megyesi, Felix Morger, Joakim Nivre (eds.) Proceedings from The SecondWorkshop on Resources and Representations for Under-Resourced Languages and Domains, May 22, 2023, Tórshavn, Faroe Islands, pp.48-52, ©2023 Association for Computational Linguistics, ISBN 978-1-959429-73-9. (https://aclanthology.org/2023.resourceful-1.7/)
- Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Andrej Pančur, Nikola Ljubešić, Tommaso Agnoloni, Starkaður Barkarson, María Calzada Pérez, Çağrı Çöltekin, Matthew Coole, Roberts Dargis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Dorte Haltrup Hansen, Matyáš Kopp, Tomas Krilavičius, Giancarlo Luxardo, Maarten Marx, Vaidas Morkevičius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, István Üveges, Ruben van Heusden, Giulia Venturi. Fišer D., Pahor de Maiti K., Osenova P., Ogrodniczuk M. (202x). Parliaments in focus: Language, Gender and the Pandemic. Gender and Language. (SUBMITTED FOR REVIEW)
- Tomaž Erjavec, Matyáš Kopp, and Katja Meden (2023). "Experience of remote collaborative work in the ParlaMint project using Git". In: TwinTalks Workshop at DH2023, book of abstracts. Graz, Austria. (https://www.clarin.eu/event/2023/twintalks-workshop-dh2023)
- Tomaž Erjavec, Katja Meden and Jure Skubic (2023). "Adding political orientation metadata to ParlaMint corpora". CLARIN annual conference 2023 (in print).
- Maciej Ogrodniczuk, Petya Osenova, Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Çagrı Çöltekin, Matyáš Kopp, Katja Meden and Taja Kuzman. (2023). "The ParlaMint Project: Ever-growing Family of Comparable and Interoperable Parliamentary Corpora". CLARIN annual conference 2023 (in print).
- Michal Mochtak, Peter Rupnik, and Nikola Ljubešić. 2024. The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16024–16036, Torino, Italia. ELRA and ICCL.
- Invited talk of Tomaž Erjavec, at the 1st Workshop on Computational Linguistics for Political Text Analysis at KONVENS2021.
- Maciej Ogrodniczuk: The Impact of Parliamentary Datasets
for Society and (Data) Science. At: DH Forum, European Parliament's STOA Panel, Brussels, April 26, 2023.
https://www.europarl.europa.eu/cmsdata/268842/STOA_26042023_Maciej%20Ogroduniczuk.pdf - CLARIN Café on ParlaMint, 30 January 2024
- Marilina Pisani. (2022) Árboles, Gráficos y Matrices de Datos. Codificación en TEI de un Corpus de Interacciones Parlamentarias con Python. Final Master Thesis supervised by Núria Bel. Máster en Humanidades y Patrimonio Digitales. Universidad Autónoma de Barcelona. (https://github.com/marilinapisani/)
- Pieters M. (2021). A Comparative Analysis on the ParlaMint Corpus. MSc thesis.
- A Return of Science? Mapping Attitudes Towards Science and Expertise in COVID-19 Parliamentary Debates by Ruben Ros for CLARIN Café: ParlaMint Unleashed, June 2021. GitHub repository with code and research report.
- A Comparative Analysis on the ParlaMint Project by Miguel Pieters for CLARIN Café: ParlaMint Unleashed, June 2021.
- ParlaMint II: The show must go on presented at the CLARIN Annual Conference 2022.
- ParlaMint: Towards Comparable Parliamentary Corpora presented at the Virtual CLARIN Annual Conference 2020.