Legal corpora contain legislation, legal acts, transcriptions of court decisions, and other kinds of materials related to national or supernational law. Such corpora are an important resource for anyone who practises or researches law, as they can be used to investigate issues such as legal phraseology and terminology, variation in legal discourse, legal translation, register and genre perspectives on legal discourse, legal discourse in forensic contexts, and evaluative language in judicial settings (Goźdź-Roszkowski 2021).
The CLARIN infrastructure gives access to 33 legal corpora, most of which are richly annotated both linguistically (e.g., syntactic dependency parsing in addition to PoS-tagging and lemmatisation) and at various domain-specific metalinguistic levels, such as the speaker roles in the case of courtroom proceedings (e.g., judge, defendant, prosecutor, etc.). As CLARIN mostly consists of European countries, many of the legal corpora consist of the so-called Acquis Communautaire, which refers to the legislation, legal acts and court decisions constituting the law of the European Union.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Legal Corpora in the CLARIN Infrastructure
Monolingual Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Annotated Corpus of Czech Case Law for Reference Recognition Tasks Annotation: legal references (identifier of court decision; author of law book or article, etc.) |
Czech |
This corpus consists of 350 manually annotated decisions at Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Each decision has been manually annotated by two trained annotators; the corpus is primarily developed as training and testing materials for reference recognition tasks. See also the variant of this corpus annotated for segmentation tasks. The corpus is available for download from LINDAT. For the relevant publication, see Harašta et al. (2018) |
Download |
Czech Court Decisions Corpus (CzCDC 1.0) Size: 460 million words |
Czech |
This corpus consists of around 237,000 court decisions from three top-tier courts (Supreme, Supreme Administrative, and Constitutional) in Czechia, published between 1993 and 2018. The corpus is available for download from LINDAT. For the relevant publication, see Novotná and Harašta (2019) |
Download |
Size: 1128 sentences |
Czech |
This corpus consists of two legal documents: Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended). The corpus is available for download from LINDAT and online browsing through the treebank viewer PML-TQ and the concordancer KonText. For the relevant publication, see Kríž and Hladka (2018) |
|
META-NORD Acquis Danish Treebank Size: 102 sentences; 1799 words |
Danish |
This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is available for download and online browsing through INESS (CLARINO). |
|
Size: 5,856 texts |
Dutch |
This corpus contains legal texts from 1814 to 1989, compiled year by year. The corpus is available for online browsing on a dedicated webpage For the relevant publication, see de Does et al. (2017) |
Browse |
CABank English SCOTUS Oral Arguments Corpus Annotation: speaker segmentation, sociolinguistic annotation |
English |
This corpus consists of transcripts and recordings of oral arguments at the Supreme Court of the United States. The transcripts and audio recordings are aligned at the utterance level; the utterances are annotated based on speaker role (the primary one being Justice) and name, as well as gender. The corpus is part of the CABank collection and available for download from and online browsing through TalkBank. For the relevant publication, see Johnson and Goldman (2009) |
|
The English Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts Size: 359,874 tokens |
English |
This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties. The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN Distribution). |
Browse |
Size: 34.6 million tokens |
English |
This corpus contains selected texts from the Acquis Communautaire between the 1950s and today, translated to English. The corpus is available for download from PORTULAN. For the relevant publication, see Steinberger et al. (2006) |
Download |
Size: 24.4 million words |
English (Late Modern) |
This historical corpus consists of Proceedings of the Old Bailey; the Old Bailey was London’s central criminal court between 1674 and 1913. The corpus consists of texts from 1970 to 1913, and is annotated for detailed utterance-level sociolinguistic annotation at the following three levels: sociobiographical speaker information (gender, age, occupation, social class), pragmatic information (speaker role in the courtroom such as judge, witness, etc.), and metatextual information (the scribe, printer, and publisher of the individual Proceeding). The corpus is available for download from CLARIN-D (Saarland University) and for online browsing through CQPWeb. |
|
Size: 11 million tokens |
Estonian |
This corpus contains Estonian laws (1.8 million tokens) as well as European legislation (9.6 million tokens) translated into Estonian. The corpus is available for download from a dedicated webpage hosted by CLARIN Estonia. |
Download |
META-NORD Acquis Estonian Treebank Size: 78 sentences; 1443 words |
Estonian |
This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is available for download and online browsing through INESS (CLARINO). |
|
The Finnish Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts Size: 1.5 million tokens |
Finnish |
This is the Finnish subcorpus of FiRuLex, which contains juridical texts in Russian and Finnish. The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution) |
Browse |
The Finnish Sub-corpus of the JRC-Acquis Multilingual Parallel Corpus, Downloadable Version Size: 44.1 million tokens |
Finnish |
This is the legal subcorpus of the Helsinki Korp Version of the Finnish TreeBank 3. The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution) and for download from the Finnish Language Bank. |
|
META-NORD Acquis Finnish Treebank Size: 122 sentences; 1464 words |
Finnish | This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is syntactically parsed using the FinnTreeBank 2 schema and is available for download and online browsing through INESS (CLARINO). | |
The German Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts Size: 198,035 tokens |
German |
This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties. The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN Distribution). |
Browse |
Corpus of Judicial Rhetoric: cases of rapes and homicides Licence: CC BY-NC-ND 4.0 |
Greek |
This corpus consists of transcriptions of defendants’ and witnesses’ speeches in criminal cases of rape, attempted rape, murder, and attempted murder. The corpus is available for download from the CLARIN:EL repository. |
Download |
META-NORD Acquis Icelandic Treebank Size: 73 sentences; 1880 words |
Icelandic |
This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is available for download and online browsing through INESS (CLARINO). |
|
IGC-Laws-21.05 (The Icelandic Gigaword Corpus: Law, bills and proposals) Size: 2,2 million sentences; 40,6 million words |
Icelandic |
IGC-Laws is a subcorpus of the The Icelandic Gigaword Corpus (see also CLARIN reference corpora). IGC-Laws contains 1) the Icelandic laws, 2) explanatory reports and observations extracted from bills submitted to Althingi, and 3) parliamentary proposals and resolutions. The corpus comes in two formats. One contains the texts untokenized and untagged while the other has been tokenized, PoS-tagged and lemmatized. The corpus is available for download from the CLARIN-IS repository. For the relevant publication, see Steingrímsson et al. (2018) |
Download |
Corpus of Legal Acts of the Republic of Latvia (Likumi) Size: 116 million tokens; 73 million words |
Latvian |
The corpus contains all legal acts of the Republic of Latvia published on the website likumi.lv (until February 2022). The corpus is available for download from the CLARIN.LV repository. |
Download |
Lithuanian Corpus of the EU Primary and Secondary Law Acts of the Period 2015–2017 Size: 274,460 words |
Lithuanian |
This corpus contains primary and secondary European law acts (32 texts) translated into Lithuanian. The corpus is available for download from CLARIN-LT. |
Download |
Size: 20.9 million tokens |
Maltese |
This corpus contains selected texts from the Acquis Communautaire between the 1950s and today, translated to Maltese. The corpus is available for download from PORTULAN. For the relevant publication, see Steinberger et al. (2006) |
Download |
META-NORD Acquis Norwegian Treebank Size: 101 sentences; 1862 words |
Norwegian |
This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is available for download and online browsing through INESS (CLARINO). |
|
Norwegian Acquis Communautaire Size: 14 million words |
Norwegian (Bokmål and Nynorsk) |
This corpus contains Norwegian translations of 5414 documents in Acquis Communautaire. The corpus is available for download from the Norwegian Language Bank. |
Download |
Legal Documents from Norwegian Nynorsk Municipialities Size: 127 million words |
Norwegian (Nynorsk and Bokmål) |
This corpus contains 50,000 legal documents and meeting minutes collected with the web crawler Veidemann. Around 88.5 million words are in Nynork, while the rest are in Bokmal (Bokmål). The corpus is available for download from the Norwegian Language Bank. |
Download |
The Russian Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts Size: 198,035 tokens |
Russian |
This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties. The corpus can be accessed online through the concordancer Korp (FIN-CLARIN Distribution). |
Browse |
The Russian Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts Size: 1.2 million tokens |
Russian |
This is the Russian subcorpus of FiRuLex, which contains juridical texts in Russian and Finnish. The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution) |
Browse |
META-NORD Acquis Swedish Treebank Size: 102 sentences; 1982 words |
Swedish |
This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is available for download and online browsing through INESS (CLARINO). |
Multilingual Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
JRC EU DGT Translation Memory Parsebank DGT-UD Size: 2.1 billion tokens |
Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Irish, Italian, Latvian, Lithuanian, Modern Greek (1453-), Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish |
This is a 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with UD-Pipe 1.2 using Universal Dependencies 2.0 models. The corpus is available for download from the CLARIN.SI repository and for online browsing through the KonText and noSketch Engine concordancers. |
|
The JRC-Acquis Corpus, version 3.0 Size: 1 billion words |
Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish |
This is a parallel corpus of Acquis Communautaire, which is the total body of European Union law applicable in European member states. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction). The sentence-level alignment was done using the hunalign tool. The corpus is available for download from the CLARIN:EL repository. For the relevant publication, see Steinberger et al. (2006) |
Download |
COVID-19 EUR-LEX dataset. Βilingual (EN-PT) Size: 21,000 units |
English, Portuguese |
This is a parallel corpus of the European Union Law pertaining to COVID-19 period. The corpus is available for download from the PORTULAN repository. |
Download |
Legal texts from Estonian Ministry of Justice (Processed) Size: 47,000 units |
Estonian-English |
This corpus contains Estonian-English translations of the Acts of Estonian law. The corpus is available for download from PORTULAN. |
Download |
Annotation: conceptual annotation |
Finnish, Slovak, Lithuanian, Croatian, Slovenian, Estonian, Latvian, Maltese, English, German, French, Italian, Spanish; Castilian, Polish, Romanian; Moldavian; Moldovan, Dutch; Flemish, Modern Greek (1453-), Hungarian, Portuguese, Czech, Swedish, Bulgarian, Danish |
This corpus consists of 65,000 European laws in 23 official European languages. Each law has been annotated with the EuroVoc concept labels. The corpus is available for download from the repository of CLARIN:EL. For the relevant publication, see Chalkidis et al. (2021) |
Download |
COVID-19 EUR-LEX dataset . Multilingual (CEF languages) Size: 475,931 translation pairs |
Maltese, Hungarian, Lithuanian, Latvian, Polish, Portuguese, English, Slovenian, Modern Greek, Spanish (Castilian), Romanian, Slovak, Moldavian, Swedish, Bulgarian, Italian, German, Croatian, French, Dutch (Flemish), Czech, Finnish, Danish, Irish, Estonian |
This is a multilingual corpus of the European Union Law pertaining to COVID-19 period. The corpus is available for download from the PORTULAN repository. |
Download |
Publications on Legal Corpora
[Chalkidis et al. 2021] Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. 2021. MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv.
[de Does et al. 2017] Jesse de Does, Jan Niestadt, and Katrien Depuydt. 2017. Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, 151–165. London: Ubiquity Press.
[Goźdź-Roszkowski 2021] Stanisław Goźdź-Roszkowski. 2021. Corpus Linguistics in Legal Discourse. International Journal for the Semiotics of Law 34: 1515–1540.
[Harašta et al. 2018] Jakub Harašta, Jaromír Šavelka, František Kasl, Adéla Kotková, Pavel Loutocký, Jakub Míšek, Daniela Procházková, Helena Pullmannová, Petr Semenišín, Tamara Šejnová, Nikola Šimková, Michal Vosinek, Lucie Zavadilová, and Jan Zibner. 2018. Annotated Corpus of Czech Case Law for Reference Recognition Tasks. In TSD 2018, volume 11107. Springer, Cham.
[Johnson and Goldman 2009] Timothy R. Johnson and Jerry Goldman, eds. 2009. A Good Quarrel: America’s Top Legal. University of Michigan Press.
[Kríž and Hladká 2018] Vincent Kríž and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0. Proceedings of LREC 2018, 4501–4505.
[Novotná and Harašta 2019] Tereza Novotná and Jakub Harašta. 2019. The Czech Court Decisions Corpus (CzCDC): Availability as the First Step. arXiv pre-print.
[Steinberger et al. 2006] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, and Dániel Varga. 2006. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In Proceedings of LREC 2006, 2142–2147.
[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.