Skip to main content

Legal Corpora

Legal corpora contain legislation, legal acts, transcriptions of court decisions, and other kinds of materials related to national or supernational law. Such corpora are an important resource for anyone who practises or researches law, as they can be used to investigate issues such as legal phraseology and terminology, variation in legal discourse, legal translation, register and genre perspectives on legal discourse, legal discourse in forensic contexts, and evaluative language in judicial settings (Goźdź-Roszkowski 2021).

The CLARIN infrastructure gives access to 33 legal corpora, most of which are richly annotated both linguistically (e.g., syntactic dependency parsing in addition to PoS-tagging and lemmatisation) and at various domain-specific metalinguistic levels, such as the speaker roles in the case of courtroom proceedings (e.g., judge, defendant, prosecutor, etc.). As CLARIN mostly consists of European countries, many of the legal corpora consist of the so-called Acquis Communautaire, which refers to the legislation, legal acts and court decisions constituting the law of the European Union.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] (email).


Monolingual Corpora

Corpus Language Description Availability

Annotated Corpus of Czech Case Law for Reference Recognition Tasks

Annotation: legal references (identifier of court decision; author of law book or article, etc.)
Licence: CC BY 4.0


This corpus consists of 350 manually annotated decisions at Czech top-tier courts (Supreme Court, Supreme Administrative Court, Constitutional Court). Each decision has been manually annotated by two trained annotators; the corpus is primarily developed as training and testing materials for reference recognition tasks. See also the variant of this corpus annotated for segmentation tasks.

The corpus is available for download from LINDAT.

For the relevant publication, see Harašta et al. (2018)


Czech Court Decisions Corpus (CzCDC 1.0)

Size: 460 million words
Annotation: unannotated
Licence: CC BY-NC 4.0


This corpus consists of around 237,000 court decisions from three top-tier courts (Supreme, Supreme Administrative, and Constitutional) in Czechia, published between 1993 and 2018.

The corpus is available for download from LINDAT.

For the relevant publication, see Novotná and Harašta (2019)


Czech Legal Text Treebank

Size: 1128 sentences
Annotation: manual syntactic annotation; manual annotation of entities from the accouting domain and relations definition, obligation, right
Licence: CC BY-NC-SA 4.0


This corpus consists of two legal documents: Accounting Act (563/1991 Coll., as amended) and Decree on Double-entry Accounting for undertakers (500/2002 Coll., as amended).

The corpus is available for download from LINDAT and online browsing through the treebank viewer PML-TQ and the concordancer KonText.

For the relevant publication, see Kríž and Hladka (2018)




META-NORD Acquis Danish Treebank

Size: 102 sentences; 1799 words
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC BY 4.0


This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).



Corpus Juridisch Nederlands

Size: 5,856 texts
Annotation: lemmatised, PoS-tagged


This corpus contains legal texts from 1814 to 1989, compiled year by year.

The corpus is available for online browsing on a dedicated webpage

For the relevant publication, see de Does et al. (2017)


CABank English SCOTUS Oral Arguments Corpus

Annotation: speaker segmentation, sociolinguistic annotation
Licence: CC BY-NC-SA 3.0


This corpus consists of transcripts and recordings of oral arguments at the Supreme Court of the United States.

The transcripts and audio recordings are aligned at the utterance level; the utterances are annotated based on speaker role (the primary one being Justice) and name, as well as gender.

The corpus is part of the CABank collection and available for download from and online browsing through TalkBank.

For the relevant publication, see Johnson and Goldman (2009)



The English Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts

Size: 359,874 tokens
Annotation: lemmatised, MSD-tagged
Licence: CC BY-ND


This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN Distribution).


English Acquis Communautaire

Size: 34.6 million tokens
Licence: MIT (academic)


This corpus contains selected texts from the Acquis Communautaire between the 1950s and today, translated to English.

The corpus is available for download from PORTULAN.

For the relevant publication, see Steinberger et al. (2006)


Old Bailey Corpus

Size: 24.4 million words
Annotation: sociolinguistic annotation
Licence: CC BY-NC-SA 4.0

English (Late Modern)

This historical corpus consists of Proceedings of the Old Bailey; the Old Bailey was London’s central criminal court between 1674 and 1913. The corpus consists of texts from 1970 to 1913, and is annotated for detailed utterance-level sociolinguistic annotation at the following three levels: sociobiographical speaker information (gender, age, occupation, social class), pragmatic information (speaker role in the courtroom such as judge, witness, etc.), and metatextual information (the scribe, printer, and publisher of the individual Proceeding).

The corpus is available for download from CLARIN-D (Saarland University) and for online browsing through CQPWeb.



Corpus of Estonian law texts

Size: 11 million tokens


This corpus contains Estonian laws (1.8 million tokens) as well as European legislation (9.6 million tokens) translated into Estonian.

The corpus is available for download from a dedicated webpage hosted by CLARIN Estonia.


META-NORD Acquis Estonian Treebank

Size: 78 sentences; 1443 words
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC-BY 4.0


This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).



The Finnish Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

Size: 1.5 million tokens
Annotation: lemmatised, MSD-tagged
Licence: CC BY-ND


This is the Finnish subcorpus of FiRuLex, which contains juridical texts in Russian and Finnish.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution)


The Finnish Sub-corpus of the JRC-Acquis Multilingual Parallel Corpus, Downloadable Version

Size: 44.1 million tokens
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC BY


This is the legal subcorpus of the Helsinki Korp Version of the Finnish TreeBank 3.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution) and for download from the Finnish Language Bank.



META-NORD Acquis Finnish Treebank

Size: 122 sentences; 1464 words
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC BY 4.0

Finnish This is a subcorpus of the META-NORD Acquis Parallel Treebank. The corpus is syntactically parsed using the FinnTreeBank 2 schema and is available for download and online browsing through INESS (CLARINO).



The German Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts

Size: 198,035 tokens
Licence: CC BY-ND


This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN Distribution).


Corpus of Judicial Rhetoric: cases of rapes and homicides

Licence: CC BY-NC-ND 4.0


This corpus consists of transcriptions of defendants’ and witnesses’ speeches in criminal cases of rape, attempted rape, murder, and attempted murder.

The corpus is available for download from the CLARIN:EL repository.


META-NORD Acquis Icelandic Treebank

Size: 73 sentences; 1880 words
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC BY 4.0


This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).



IGC-Laws-21.05 (The Icelandic Gigaword Corpus: Law, bills and proposals)

Size: 2,2 million sentences; 40,6 million words
Annotation: lemmatised, MSD-tagged
Licence: CC BY 4.0


IGC-Laws is a subcorpus of the The Icelandic Gigaword Corpus (see also CLARIN reference corpora). IGC-Laws contains 1) the Icelandic laws, 2) explanatory reports and observations extracted from bills submitted to Althingi, and 3) parliamentary proposals and resolutions. The corpus comes in two formats. One contains the texts untokenized and untagged while the other has been tokenized, PoS-tagged and lemmatized.

The corpus is available for download from the CLARIN-IS repository.

For the relevant publication, see Steingrímsson et al. (2018)


Corpus of Legal Acts of the Republic of Latvia (Likumi)

Size: 116 million tokens; 73 million words
Licence: CC BY 4.0


The corpus contains all legal acts of the Republic of Latvia published on the website (until February 2022).

The corpus is available for download from the CLARIN.LV repository.


Lithuanian Corpus of the EU Primary and Secondary Law Acts of the Period 2015–2017

Size: 274,460 words


This corpus contains primary and secondary European law acts (32 texts) translated into Lithuanian.

The corpus is available for download from CLARIN-LT.


Maltese Acquis Communautaire

Size: 20.9 million tokens
Licence: MIT (academic)


This corpus contains selected texts from the Acquis Communautaire between the 1950s and today, translated to Maltese.

The corpus is available for download from PORTULAN.

For the relevant publication, see Steinberger et al. (2006)


META-NORD Acquis Norwegian Treebank

Size: 101 sentences; 1862 words
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC BY 4.0


This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).



Norwegian Acquis Communautaire

Size: 14 million words
Licence: CC BY-NC 4.0

Norwegian (Bokmål and Nynorsk)

This corpus contains Norwegian translations of 5414 documents in Acquis Communautaire.

The corpus is available for download from the Norwegian Language Bank.


Legal Documents from Norwegian Nynorsk Municipialities

Size: 127 million words
Licence: CC0 1.0 Universal

Norwegian (Nynorsk and Bokmål)

This corpus contains 50,000 legal documents and meeting minutes collected with the web crawler Veidemann. Around 88.5 million words are in Nynork, while the rest are in Bokmal (Bokmål).

The corpus is available for download from the Norwegian Language Bank.


The Russian Sub-corpus of MULCOLD, Multilingual Parallel Corpus of Legal Texts

Size: 198,035 tokens
Annotation: lemmatised, MSD-tagged
Licence: CC BY-ND


This corpus, which is a subcorpus of MULCOLD (see also the Parallel corpora resource family) contains international conventions and treaties.

The corpus can be accessed online through the concordancer Korp (FIN-CLARIN Distribution).


The Russian Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

Size: 1.2 million tokens
Annotation: lemmatised, MSD-tagged
Licence: CC BY-ND


This is the Russian subcorpus of FiRuLex, which contains juridical texts in Russian and Finnish.

The corpus is available for online browsing through the concordancer Korp (FIN-CLARIN distribution)


META-NORD Acquis Swedish Treebank

Size: 102 sentences; 1982 words
Annotation: syntactically parsed (constituency); sentence/phrase/word segmentation
Licence: CC BY 4.0


This is a subcorpus of the META-NORD Acquis Parallel Treebank.

The corpus is available for download and online browsing through INESS (CLARINO).



Multilingual Corpora

Corpus Language Description Availability

JRC EU DGT Translation Memory Parsebank DGT-UD

Size: 2.1 billion tokens
Annotation: syntactically parsed (Universal Dependencies)
Licence: CC BY 4.0

Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Irish, Italian, Latvian, Lithuanian, Modern Greek (1453-), Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish

This is a 23-language parallel syntactically parsed corpus, which consists of the JRC DGT translation memory of European law, automatically annotated with UD-Pipe 1.2 using Universal Dependencies 2.0 models.

The corpus is available for download from the CLARIN.SI repository and for online browsing through the KonText and noSketch Engine concordancers.



noSketch Engine

The JRC-Acquis Corpus, version 3.0

Size: 1 billion words
Annotation: paragraph and sentence alignment
Licence: CC BY 4.0

Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish

This is a parallel corpus of Acquis Communautaire, which is the total body of European Union law applicable in European member states.

Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction). The sentence-level alignment was done using the hunalign tool.

The corpus is available for download from the CLARIN:EL repository.

For the relevant publication, see Steinberger et al. (2006)


COVID-19 EUR-LEX dataset. Βilingual (EN-PT)

Size: 21,000 units
Licence: CC BY

English, Portuguese

This is a parallel corpus of the European Union Law pertaining to COVID-19 period.

The corpus is available for download from the PORTULAN repository.


Legal texts from Estonian Ministry of Justice (Processed)

Size: 47,000 units
Licence: CC BY


This corpus contains Estonian-English translations of the Acts of Estonian law.

The corpus is available for download from PORTULAN.



Annotation: conceptual annotation
Licence: CC BY

Finnish, Slovak, Lithuanian, Croatian, Slovenian, Estonian, Latvian, Maltese, English, German, French, Italian, Spanish; Castilian, Polish, Romanian; Moldavian; Moldovan, Dutch; Flemish, Modern Greek (1453-), Hungarian, Portuguese, Czech, Swedish, Bulgarian, Danish

This corpus consists of 65,000 European laws in 23 official European languages. Each law has been annotated with the EuroVoc concept labels.

The corpus is available for download from the repository of CLARIN:EL.

For the relevant publication, see Chalkidis et al. (2021)


COVID-19 EUR-LEX dataset . Multilingual (CEF languages)

Size: 475,931 translation pairs
Licence: CC BY

Maltese, Hungarian, Lithuanian, Latvian, Polish, Portuguese, English, Slovenian, Modern Greek, Spanish (Castilian), Romanian, Slovak, Moldavian, Swedish, Bulgarian, Italian, German, Croatian, French, Dutch (Flemish), Czech, Finnish, Danish, Irish, Estonian

This is a multilingual corpus of the European Union Law pertaining to COVID-19 period.

The corpus is available for download from the PORTULAN repository.


Publications on Legal Corpora

[Chalkidis et al. 2021] Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. 2021. MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv

[de Does et al. 2017] Jesse de Does, Jan Niestadt, and Katrien Depuydt. 2017.  Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, 151–165. London: Ubiquity Press.

[Goźdź-Roszkowski 2021] Stanisław Goźdź-Roszkowski. 2021. Corpus Linguistics in Legal Discourse. International Journal for the Semiotics of Law 34: 1515–1540. 

[Harašta et al. 2018] Jakub Harašta, Jaromír Šavelka, František Kasl, Adéla Kotková, Pavel Loutocký, Jakub Míšek, Daniela Procházková, Helena Pullmannová, Petr Semenišín, Tamara Šejnová, Nikola Šimková, Michal Vosinek, Lucie Zavadilová, and Jan Zibner. 2018. Annotated Corpus of Czech Case Law for Reference Recognition Tasks. In TSD 2018, volume 11107. Springer, Cham. 

[Johnson and Goldman 2009] Timothy R. Johnson and Jerry Goldman, eds. 2009. A Good Quarrel: America’s Top Legal. University of Michigan Press. 

[Kríž and Hladká 2018] Vincent Kríž and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0. Proceedings of LREC 2018, 4501–4505. 

[Novotná and Harašta 2019] Tereza Novotná and Jakub Harašta. 2019. The Czech Court Decisions Corpus (CzCDC): Availability as the First Step. arXiv pre-print

[Steinberger et al. 2006] Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, and Dániel Varga. 2006. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In Proceedings of LREC 2006, 2142–2147.

[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.