Introduction
Parliamentary corpora are a very important multidisciplinary language resource that can be approached from many research perspectives, including not only political science, but also sociology, history, psychology, and applicative approaches to linguistics, for instance, critical discourse analysis. The good availability of parliamentary proceedings in digitized form and granted access rights to public information in the EU countries have motivated a number of national as well as international initiatives to compile, process and analyse parliamentary corpora.
The CLARIN ERIC infrastructure offers access to 26 parliamentary corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN . In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.
Below we first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
Note that in 2020 a project was launched focussing on parliamentary debates on the COVID-19 outbreak and the policy measures in response to it under the name of ParlaMint. More details can be found below and on the project page for ParlaMint.
For comments, changes of the existing content or inclusion of new corpora, send us an email.
This website was last updated on 26 July 2021.
Parliamentary corpora in the CLARIN infrastructure
Corpus | Language | Description | Availability |
---|---|---|---|
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1 Size: 3.7 million utterances, 495 million words |
Bulgarian, Croatian, Czech, Danish, Dutch, English, French, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Polish, Slovenian, Spanish, Turkish |
ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid 2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), or being "reference" (before that date). The corpora have extensive meta-data about the speakers (name, gender, party affiliation, MP status), are structured into time-stamped terms, sessions and meetings, with each speech being marked by its speaker and their role (chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpus is available for download from the CLARIN.SI repository and through the concordancer noSketchEngine. Note that the version of the corpus without linguistic mark-up is available for download under a separate CLARIN.SI entry. |
|
Croatian parliamentary corpus ParlaMeter-hr9 1.0 Size: 14.1 million tokens |
Croatian |
The corpus contains minutes of the National Assembly of the Republic of Croatia and currently covers its VIth mandate from 15 November 2016 to 21 Nomveber 2018. The corpus contains speaker metadata (gender, age, education, party affiliation). The corpus is available for download from the CLARIN.SI repository and through the concordancers KonText and noSketchEngine, as well as through a dedicated webpage. |
|
Size: 88 hours, 0.5 million tokens |
Czech |
The corpus contains recordings of the parliamentary sessions as well as corresponding transcriptions. The corpus is available for download from LINDAT and through the concordancer KonText. |
|
The Danish Parliament Corpus 2009 - 2017, v2 Size: 40.6 million words |
Danish |
The corpus contains Danish parliamentary debates from 2009 to 2017. The corpus is available for download from the DK-CLARIN repository. |
|
Size: 1.6 billion tokens |
English |
The corpus contains British parliamentary debates from 1803 to 2005. It is semantically tagged with the USAS semantic tagger and the Historical Thesaurus Semantic Tagger (HTST). The corpus is available through a dedicated concordancer. For the relevant publication, see Rayson et al. (2015) |
|
Parliamentary Debates on Europe at the House of Commons (1998-2015) Size: 190,000 tokens |
English |
The corpus contains British parliamentary debates from 1998 to 2015. The corpus is available for download from Ortolang. |
|
Transcripts of Riigikogu (Estonian Parliament) Size: 13 million tokens |
Estonian |
The corpus contains Estonian parliamentary debates from 1995 to 2001. The corpus is available for download from a dedicated webpage and through a concordancer on the same webpage. |
|
Plenary Sessions of the Parliament of Finland Size: 22.4 million tokens |
Finnish |
The corpus contains Finnish parliamentary debates from 2008 to 2016. The corpus is available through the concordancer Korp. |
|
Parliamentary Debates on Europe at the Assemblée nationale (2002-2012) Size: 137,000 tokens |
French |
The corpus contains French parliamentary debates from 2002 to 2005. The corpus is available for download from Ortolang. |
|
Parliamentary Debates on Europe at the Bundestag (1998-2015) Size: 417,000 tokens |
German |
The corpus contains German parliamentary debates from 1998 to 2015. The corpus is available for download from Ortolang. |
|
German Political Speeches Corpus Size: 15,240 speeches, 27 million texts |
German |
The corpus contains speeches by 200 important political figures for the period between 1982 and 2020. A large part of the corpus contains speeches by the holders of the four highest German state offices: the Federal President, the Federal Chancellor, the President of the Bundestag and Foreign Ministers with terms of offie between 1982 and 2020. The corpus is available for online browsing through the DWDS platform and a subset encoded in XML with 6,685 speeches until 2019 can be downloaded. For the relevant publication, see Barbaresi (2018) |
|
Size: 75.2 million tokens |
German (Austrian) |
This corpus contains Austrian parliamentary proceedings from 1996 to 2017. Currently in development, ParlAT is planned to be a monitor corpus with new material added over time. For the relevant publication, see Wissik and Pirker (2018) |
|
Hellenic Parliament Minutes (1989-1994, 1997-2018) Size: 181 million words |
Greek |
The corpus contains Greek parliamentary debates for two periods: 1989-1994 and 1997-2018. The corpus is available for download from the CLARIN:el repository. |
|
Speeches of Politicians in the Greek Parliament Size: 258,036 words |
Greek |
This corpus contains speeches delivered by 5 members of parliament: Dimitris Anagnostakis, Nikos Tsoukalis, Paros Koukoulopoulos, Niki Founta, and Panayiotis Kammenos. The corpus is available for download from the CLARIN:el repository. |
|
European Parliament Proceedings Parallel Corpus 1996-2011, parallel corpus Greek-English Size: 31.9 million words (English), 1.2 million sentences (Greek) |
Greek-English |
This corpus is a bilingual Gree-English subset of the Europal parallel corpus. The corpus is available for download from the CLARIN:EL repository. |
|
The Icelandic Parliamentary Corpus Size: 238 million tokens |
Icelandic |
This corpus contains debates in the Icelandic parliament (AlĂľingi) from 1911 to 2017. The corpus is available for download from CLARIN-IS (as a part of the Icelandic Gigaword Corpus) and for search through the concordancer Korp. For the relevant publication, see SteingrĂmsson et al. (2018) |
|
Lithuanian Parliament Corpus for Authorship Attribution Size: 23.9 million tokens |
Lithuanian |
The corpus contains Lithuanian parliamentary debates from 1990 to 2013. It is annotated with Lemuoklis (morphological analyzer for lemmatization) and MaltParser (generation of dependency tags). The corpus is available for download from the repository of CLARIN-LT. |
|
Proceedings of Norwegian Parliamentary Debates Size: 29 million tokens |
Norwegian |
The corpus contains Norwegian parliamentary debates from 2008 to 2015. The corpus is available through the concordancer Corpuscle. |
|
Size: 63.8 million tokens |
Norwegian |
The corpus contains Norwegian parliamentary debates from 1998 to 2016. The corpus is available for download from the CLARINO repository. For the relevant publication, see Lapponi et al. (2018) |
|
Size: 300 million tokens |
Polish |
The corpus contains Polish parliamentary debates from 1991 to 2017. It is annotated with Morfeusz SGJP (morphological analyser), Pantera (disambiguating tagger), Spejd (shallow parser), Nerf (named entity recognizer). The corpus is available for download from a dedicated webpage and through the concordancer NKJP. For the relevant publication, see Ogrodniczuk (2012)#SEPOgrodniczuk (2018) |
|
Size: 1 million tokens |
Portuguese |
The corpus contains Portuguese parliamentary debates from 1970 to 2008. It is annotated with LX-Tokenizer, LX-Tagger, MBT, MBLEM (lemmatisation). The corpus is available for download from the ELRA catalogue. For the relevant publication, see Généreux et al. (2012) |
|
Slovenian parliamentary corpus siParl 2.0 Size: 239.7 million tokens |
Slovenian |
The corpus contains Slovenian parliamnetary debates from 1990 to 2018. It differs from the SlovParl 2.0 corpus (listed below) in that it contains only basic meta-data about the speakers, a typology of sessions and structural and editorian annotations. The corpus is available for download from the CLARIN.SI repository and through the concordancers KonText and noSketchEngine. |
|
Slovenian parliamentary corpus SlovParl 2.0 Size: 3.2 million tokens |
Slovenian |
Slovenian parliamentary corpus SlovParl 2.0 For the relevant publication, see PanÄŤur and Ĺ orn (2016) |
|
Slovenian parliamentary corpus ParlaMeter-sl 1.0 Size: 41 million tokens |
Slovenian |
The corpus contains minutes of the National Assembly of the Republic of Slovenia and currently covers the VIIth mandate from 1 August 2014 to 22 June 2018. The corpus contains speaker metadata (gender, age, education, party affiliation). The corpus is available for download from the CLARIN.SI repository and through the concordancers KonText and noSketchEngine, as well as through a dedicated dedicated webpage. For the relevant publication, see Ljubešić et al. (2018) |
|
Size: 1.25 billion tokens |
Swedish |
The corpus contains Swedish parliamentary debates from 1971 to 2016. It is annotated with Sparv. The corpus is available for download from Spr?kbanken (all entries with "Riksdag's Open Data" in the subtitle) and through the concordancer Korp. For the relevant publication, see Borin et al. (2016) |
|
Europarl: European Parliament Proceedings Parallel Corpus 1996-2011 Size: 33.7 million tokens |
21 languages |
This corpus contains parliamentary debates from the European Parliament from 1996 to 2011. The corpus is available for download from a dedicated webpage. |
Other parliamentary corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Korpusbasierte Analyse österreichischer Parlamentsreden Size: 1.2 million tokens |
German (Austrian) |
The corpus contains Austrian parliamentary debates from 2013 to 2015. It is annotated with the It's Stanford Tagger. The corpus currently is not available. For the relevant publication, see Sippl et al. (2016) |
|
Size: 6.3 million parliamentary speeches |
German, Czech, Danish, Dutch, English, Spanish, Swedish |
The corpus contains complete parliamentary speeches in the key legislative chambers of Austria, the Czech Republic, Germany, Denmark, the Netherlands, New Zealand, Spain, Sweden, and the United Kingdom, covering periods between 21 and 32 years. The corpus is available for download from the Harvard Dataverse repository. |
|
Corpus of Bulgarian Political and Journalistic Speech Size: 10 million tokens |
Bulgarian |
The corpus contains Bulgarian parliamentary debates from 2006 to 2012. The corpus is available through a dedicated concordancer. |
|
The Chinese/English Political Interpreting Corpus (CEPIC) Size: 6.5 million words |
Chinese, English |
The CEPIC consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington and London, as well as their translated/interpreted texts. The main speech types of CEPIC include the reading of government reports such as policy addresses and budget speeches, Q&A at press conferences, parliamentary debates, as well as remarks delivered at bilateral meetings. The corpus features a parallel display of up to six versions of the same speech segment, aligned at paragraph level. The corpus is available for online querying through a dedicated concordancer. For the relevant publication, see Pan (2019) |
|
Size: 81.9 million tokens |
Czech |
The corpus contains Czech parliamentary debates from 1993 to 2010. It is annotated with It's ajka. The corpus is available through the Sketch Engine. For the relevant publication, see JakubĂÄŤek and Kovář (2010) |
|
Size: 800 million tokens |
Dutch |
The corpus contains Dutch parliamentary debates from 1814 to 2014. It is annotated with It's Frog. See also It's information on the schema used. The corpus is available for download (the authors needs to be contacted) and is also accessible online through the Political Mashup environment. For the relevant publication, see Marx and Schuth (2010) |
|
HanDeSeT: Hansard Debates with Sentiment Tags Size: 1251 motion-speech units taken from 129 separate debates |
English |
This corpus contains English parliamentary debates from 1997 to 2017. The corpus is available for download from a dedicated webpage. For the relevant publication, see Abercrombie and Batista-Navarro (2018) |
|
Size: 354,400 tokens |
English |
This corpus contains British parliamentary debates of the House of Commons from 2013 to 2016. The corpus is available for download from Google Drive. For the relevant publication, see Nanni et al. (2018) |
|
Size: Only a small sample available |
German |
A small sample is available for download from the GitHub webpage of the corpus. |
|
|
French |
The Archives parlementaires is a chronologically-ordered edited collection of sources on the French Revolution. It was conceived in the mid 19th century as a project to produce a definitive record of parliamentary deliberations and also includes letters, reports, speeches, and other first-hand accounts from a great variety of published and archival sources. FRDA currently contains the AP volumes covering the years 1787-1794, which can be searched using ARTFL's PhiloLogic 4 open source software platform. The texts have been marked up using so that speakers, places, dates, and terms in the published index can be easily found. Users can see both scanned images of the AP pages or just the texts. |
|
Size: 12.5 million tokens |
Latvian |
The corpus contains Latvian parliamentary debates from 1993 to 2016. The corpus is available through noSketchEngine. |
CLARIN-funded project: ParlaMint
CLARIN is currently funding the project ParlaMint: Towards Comparable Parliamentary Corpora. This project aims to provide multilingual standardised and linguistically processed resources for focused observations on trends, opinions, decisions with respect to lockdowns and restrictive measures in times of emergencies. For this goal two types of parliamentary corpora are envisaged: a contemporary one that is focused on COVID-19 issues (Nov. 2019 - July 2020) and a reference one for comparison (2015 - Oct. 2020). The data will be made available through concordances and monitoring tools. Read more
Additional materials
CLARIN-PLUS Workshop "Working with Parliamentary Records". 27-29 March 2017, Sofia, Bulgaria. [html]
Videolectures of the CLARIN-PLUS workshop. [html]
ParlaCLARIN@LREC2018. 7 May 2018, Miyazaki, Japan. [html]
Videolectures of the ParlaCLARIN workshop. [html]
Darja Fišer, Maria Eskevich, Franciska de Jong (eds.), Proceedings of the ParlaCLARIN Workshop at LREC2018. [pdf]
Parthenos Video: Working with Parliamentary Data, interview with Federico Nanni. [html]
Parthenos Video: Working with Parliamentary Data, interview with Andreas Blätte. [html]
Darja Fišer, Maria Eskevich, Franciska de Jong (eds.), Proceedings of the ParlaCLARIN II Workshop at LREC2020. [pdf]
Publications on the parliamentary corpora
[Abercrombie and Batista-Navarro 2018] Gavin Abercrombie and Riza Theresa Batista-Navarro. 2018. A Sentiment-labelled Corpus of Hansard Parliamentary Debate Speeches.
[Barbaresi 2018] Adrien Barbaresi. 2018. A corpus of German political speeches from the 21st century. Proceedings of LREC2018, 792–797.
[Borin et al. 2016] Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure.
[Branco and Silva 2006] AntĂłnio Branco and JoĂŁo Silva. 2006. A Suite of Shallow Processing Tools for Portuguese: LX-Suite.
[Généreux et al. 2012] Michel Généreux, Iris Hendrickx, Amália Mendes. 2012. A Large Portuguese Corpus On-Line: Cleaning and Preprocessing.
[Lapponi et al. 2018] Emanuele Lapponi, Martin G. Søyland, Erik Velldal, and Stephan Oepen. 2018. The Talk of Norway: a richly annotated corpus of the Norwegian parliament, 1998–2016.
[Ljubešić et al. 2018] Nikola Ljubešić, Darja Fišer, Tomaž Erjavec, and Filip Dobranić. 2018. The Parlameter corpus of contemporary Slovene parliamentary proceedings.
[JakubĂÄŤek and Kovář 2010] Miloš JakubĂÄŤek, VojtÄ›ch Kovář. 2010. CzechParl: Corpus of Stenographic Protocols from Czech Parliament.
[Marx and Schuth 2010] Maarten Marx and Anne Schuth. DutchParl: The Parliamentary Documents in Dutch.
[Nanni et al. 2018] Federico Nanni, Mahmoud Osman, Yi-Ru Cheng, Simone Paolo Ponzetto, Laura Dietz. 2018. UKParl: A Data Set for Topic Detection with Semantically Annotated Text.
[Ogrodniczuk 2012] Maciej Ogrodniczuk. 2012. The Polish Sejm Corpus.
[Ogrodniczuk 2018] Maciej Ogrodniczuk. 2018. The Polish Parliamentary Corpus.
[Pan 2019] Jun Pan. 2019. The Chinese/English Political Interpreting Corpus (CEPIC): A New Electronic Resource for Translators and Interpreters.
[PanÄŤur and Ĺ orn 2016] Andrej PanÄŤur, Mojca Ĺ orn. 2016. Smart Big Data: use of Slovenian parliamentary papers in digital history.
[Rayson et al. 2015] Paul Rayson, Alistair Baron, Scott Piao, Steve Wattam. 2015. Large-scale Time-sensitive Semantic Analysis of Historical Corpora.
[Sippl et al. 2016] Colin Sippl, Manuel Burghardt, Christian Wolff, Bettina Mielke. 2016. Korpusbasierte Analyse österreichischer Parlamentsreden.
[SteingrĂmsson et al. 2018] SteingrĂmsson, Steinþór, SigrĂşn HelgadĂłttir, EirĂkur Rögnvaldsson, Starkaður Barkarson and JĂłn Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus.
[Wissik and Pirker 2018] Tanja Wissik and Hannes Pirker. 2018. ParlAT beta: Corpus of Austrian Parliamentary Records.