Skip to main content

Reference Corpora

According to the linguist Geoffrey Leech (2002), a "reference corpus is designed to provide comprehensive information about the language […] It has to be a general corpus of wide coverage of the language, and hopefully it will be treated by its user community as some kind of “standard” for the language." Reference corpora thus contrast with specialised corpus families (e.g., parliamentary corpora, CMC-corpora) in that they are comprehensive with respect to genre inclusion, typically sampling a diverse set of primarily written genres. 

The CLARIN infrastructure offers access to 30 reference corpora for 21 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and NoSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

Reference corpora in the CLARIN infrastructure

Corpus Language Description Availability

AbNC: Abkhaz National Corpus

Size: 10 million words 
Annotation: MSD-tagged, lemmatized 
Licence: CLARIN_PUB-BY-NC-ND

Abkhaz

This corpus includes Abkhaz texts published between 1920 and 2016. The corpus is encoded in .

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO distribution).

For the relevant publication, see Meurer (2018)

Concordancer

Bulgarian National Reference Corpus (BNRC)

Size: 70 million tokens 
Annotation: tokenized, PoS-tagged 
Licence: Individual terms of agreement

Bulgarian

This corpus includes Bulgarian texts taken from news media, literature, and administrative documents between 1997 and 2002.

The tokenised corpus is available through WebCLaRK, while the PoS-tagged version is available only upon request.

For the relevant publication, see Simov et al. (2004)

Concordancer

Croatian language corpus Riznica 0.1

Size: 101.8 million tokens, 85.3 million words, 4.7 million sentences, 14,781 texts 
Annotation: sentence segmented, PoS-tagged, lemmatized 
Licence: CC BY-NC-SA 4.0

Croatian

This corpus includes Croatian texts taken from fiction (28%) and specialised texts (72%).

The corpus is available for online browsing via noSketch Engine and KonText and for download from the CLARIN.SI repository.

For the relevant publication, see Ćavar and Brozović Rončević (2012)

noSketchEngine

KonText

Download

Croatian National Corpus

Size: 101 million tokens

Croatian

This corpus includes Croatian texts taken from newspapers, magazines, popular texts, and fiction.

The corpus is available for online browsing through the noSketch Engine.

For the relevant publication, see Tadić (2002)

Concordancer

SYN2005: balanced corpus of written Czech

Size: 100 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech texts published between 2000 and 2004. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)

Concordancer

Download

SYN2010: balanced corpus of written Czech

Size: 100 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech fiction, professional literature, newspapers etc. published between 2005 and 2009. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)

Concordancer

Download

SYN2015: representative corpus of written Czech

Size: 100 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech fiction, professional literature, newspapers etc. published between 2010 and 2014. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)

Concordancer

Download

DK-CLARIN Reference Corpus of General Danish

Size: 45.1 million words 
Annotation: PoS-tagged, sentence and paragraph segmentation, lemmatized 
Licence: CLARIN ACA-NC

Danish

This corpus includes Danish texts published between 2008 and 2011.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source and year of publication.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR

Size: 500 million words 
Annotation: PoS-tagged, lemmatized, named entities; coreference annotation and annotation of spatial and temporal relations for the manually annotated SoNaR-1 subset 
Licence: Terms of Agreement

Dutch

This corpus includes representative Dutch texts (fiction, brochures, magazines, legal texts, newspapers, parliamentary proceedings, and computer-mediated communication).

Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is encoded in FoLiA.

The corpus is available for online browsing through the OpenSONAR concordancer and can be downloaded from the Dutch Language Institute (CLARIAH-NL).

Concordancer

Download subset 1

Download subset 2

Corpus of Contemporary American English – Kielipankki version

Size: 440 million words, 190,000 texts 
Annotation: PoS-tagged, lemmatized 
Licence: CLARIN ACA (online version), CLARIN RES (downloadable version)

English (American)

This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012.

The corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Download

British National Corpus

Size: 100 million words 
Annotation: PoS-tagged, lemmatized 
Licence: BNC User Licence (restricted for the downloadable version)

English (British)

This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993.

The corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK).

Concordancer

Download

Estonian National Corpus 2019

Size: 1.5 billion words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY-SA

Estonian

This corpus includes Estonian texts published between 1990 and 2019. Amongst others, this corpus contains the Estonian Reference Corpus as a subcorpus.

The corpus is available for download from (CELR distribution).

Download

Estonian Reference Corpus

Size: 175 million words 
Annotation: MSD-tagged, lemmatized 
Licence: free for non-commercial use

Estonian

This corpus includes Estonian texts (fiction, PhD theses, newspapers, magazines, parliamentary transcriptions, computer-mediated communication) published between 1990 and 2007. The corpus is encoded in TEI.

The corpus is available for online browsing through a dedicated concordancer and is available for download from CELR.

Concordancer

Download

DeReKo

Size: 31.7 billion words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY-SA

German

This corpus includes German texts in a wide variety of genres published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information.

Part of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform.

For the relevant publication, see Kupietz et al. (2018)

Concordancer

Download

Corpus of Greek Texts

Size: 27.6 million words 
Licence: CC-BY-NC, ACA

Greek

This corpus includes representative Greek texts published between 1990 and 2010. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Goutsos (2010)

Concordancer

Diachronic corpus of Greek of the 20th century

Size: 20 million words 
Licence: CC BY-NC

Greek

This corpus includes Greek texts published in the 20th century.

The corpus is available for download from CLARIN:EL.

Download

Hellenic National Corpus

Size: 47 million words 
Annotation: sentence segmented 
Licence: proprietary

Greek

This corpus includes Greek texts published from 1990 onwards.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Gavrilidou (2002)

Concordancer

Hungarian National Corpus

Size: 190 million tokens 
Annotation: PoS-tagged 
Licence: free after registration

Hungarian

This corpus includes Hungarian texts (newspapers, literature, scientific articles, official and personal documents).

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Váradi (2002)

Concordancer

The Icelandic Gigaword Corpus

Size: 1.9 billion words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY and a special user licence

Icelandic

This corpus includes Icelandic texts (newspapers, parliamentary proceedings, adjudications, fiction and non-fiction) published until 2017.

The corpus is encoded in TEI. Non-linguistic metadata include bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing and download through CLARIN-IS (in two subsets, each with its own licence).

For the relevant publication, see Steingrímsson et al. (2018)

Concordancer

Download subset 1

Download subset 2

Balanced Corpus of Modern Latvian (LVK2022)

Size: 122.9 million tokens 
Annotation: MSD-tagged, lemmatized

Latvian

This corpus includes texts from journalism, fiction, science, Wikipedia, legal documents, parliamentary subscripts, and subtitles.

The corpus is available for online browsing through the noSketch Engine concordancer.

Concordancer

Corpus of the Contemporary Lithuanian Language

Size: 208.4 million tokens 
Annotation: MSD-tagged, lemmatized 
Licence: CLARIN RES

Lithuanian

This corpus includes Lithuanian texts (mostly newspapers but also fiction, non-fiction, and specialised magazines) published between 1990 and 2008.

The corpus is encoded in TEI. Non-linguistic metadata includes bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

Concordancer

The Lexicographic Corpus for Norwegian Bokmål (LBK)

Size: 100 million tokens 
Annotation: PoS-tagged, lemmatized 
Licence: CLARIN_ACA-NC-LOC-ND

Norwegian (Bokmål)

This corpus includes representative Norwegian (Bokmål) texts (newspapers and periodicals, non-fiction, fiction, TV subtitles, and small print) published between 1985 and 2013.

The corpus is available for online browsing through the concordancer Glossa (CLARINO).

For the relevant publication, see Lain Knudsen and Vatvedt Fjeld (2013)

Concordancer

Norsk Ordboks Nynorskkorpus (NNK)

Size: 107.8 million words 
Annotation: MSD-tagged, lemmatized 
Licence: CLARIN_RES-NC-DEP

Norwegian (Nynorsk)

This corpus includes representative Norwegian (Nynorsk) texts published between 1866 and 2012. The corpus is encoded in XML.

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO).

Concordancer

National Corpus of Polish

Size: 1.8 billion tokens 
Annotation: MSD-tagged, lemmatized

Polish

This is a written and spoken corpus that includes representative Polish texts published between 1945 and 2010.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Przepiórkowski et al. (2012)

Concordancer

Corpus of combined Slovenian corpora metaFida 1.0

Size: 6 billion tokens 
Annotation: MSD-tagged (MULTEXT-East), lemmatised, normalised 
Licence: various

Slovenian

This corpus contains a number of existing Slovenian corpora available through the CLARIN.SI concordances and thus provides a unified search across all the included corpora. metaFida contains over 4,7 billion words or 6 billion tokens from 15 million text published 1584 - 2022 from 34 corpora.

In the metaFida corpus we keep only information that is common to most of the selected corpora. The structure is nested very shallowly (text and paragraph), as it is then easier to create subcorpora or limit the search to individual text types. All metaFida positional attributes (word, normalised form, lemma, MULTEXT-East MSD in Slovenian and English) are considered to have multiple values, separated by a space.

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

Spoken corpus Gos 2.0

Size: 1534 texts; 127,604 utterances; 2,462,368 words 
Annotation: PoS-tagged, lemmatised, phonetically and orthographically transcribed 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words).

The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer.

For the relevant publication, see Verdonik and Zwitter-Vitez (2011)

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

Written corpus ccGigafida 1.0

Size: 126.9 million tokens, 103.2 million words, 31,722 texts 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY-NC-SA 4.0

Slovenian

This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the representative Gigafida corpus (version 1). It can be downloaded from the CLARIN.SI repository.

For the relevant publication, see Erjavec and Logar (2012)

Download

Written corpus ccKres 1.0

Size: 12.2 million tokens, 9.8 million words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY

Slovenian

This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the balanced Kres corpus. It can be downloaded from the CLARIN.SI repository.

For the relevant publication, see Erjavec and Logar (2012)

Download

Written corpus Gigafida 2.0

Size: 1.3 billion tokens, 1.1 billion words, 38,310 texts 
Annotation: MSD-tagged, lemmatized 
Licence: Individual terms of agreement

Slovenian

This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2018. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through the noSketch Engine concordancer (CLARIN.SI distribution), as well as through a dedicated search engine.

For the relevant publication, see Krek et al. (2018)

noSketchEngine

Concordancer

Written corpus Kres 1.0

Size: 99 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Individual terms of agreement

Slovenian

This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011.

This corpus is a balanced subset of the representative Gigafida corpus (version 1). The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Krek et al. (2018)

Concordancer

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh

Size: 11 million words 
Licence: CC BY-NC-SA 4.0

Welsh

This corpus contains spoken, written and digital (e-language) Welsh. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

The corpus is available for online browsing through a dedicated webpage and by request.

For the relevant publication, see Knight et al. (2020)

Request

Concordancer

Publications

[Ćavar and Brozović Rončević 2012] Damir Ćavar and Dunja Brozović Rončević. 2012. Riznica: the Croatian Language Corpus. Prace filologiczne, 63: 51–65. 

[Knight et al. 2020] Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E-M. (2020). Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh – A community driven approach to linguistic corpus construction: Project Report.

[Erjavec and Logar Berginc 2012] Tomaž Erjavec and Nataša Logar Berginc. 2012.  Referenčni korpusi slovenskega jezika (cc)Gigafida in (cc)KRES. In Zbornik Osme konference Jezikovne tehnologije, 57–62. 

[Gavrilidou 2002] Maria Gavrilidou. 2002. The Hellenic National Corpus on-line. Revue belge de Philologie et d'Histoire, 80 (3): 1003–1015. 

[Goutsos 2010] Dionysis Goutsos. 2010. The Corpus of Greek Texts: a reference corpus for Modern Greek. Corpora, 5 (1): 29–44. 

[Hnátková et al. 2014] Milena Hnátková, Michal Kren, Pavel Procházka, and Hana Skoumalová. 2014. The SYN-series corpora of written Czech. In Proceedings of LREC 2014, 160–164.

[Krek et al. 2016] Simon Krek, Polona Gantar, Špela Arhar Holdt, and Vojko Gorjanc. 2016. In Proceedings of the Conference on Language Technologies and Digital Humanities, 200–202.

[Kupietz et al. 2018] Marc Kupietz, Harald Lüngen, Pawel Kamocki, and Andreas Witt. 2018. The German Reference Corpus DeReKo: New Developments – New Opportunities In Proceedings of LREC 2018, 4353–4360.

[Lain Knudsen and Vatvedt Fjeld 2013] Rune Lain Knudsen and Ruth Vatvedt Fjeld. 2013. LBK2013: A balanced; annotated national corpus for Norwegian Bokmål. In Proceedings of the workshop on lexical semantic resources for NLP at NODALIDA 2013, 12–20.

[Leech 2002] Geoffrey Leech. 2002. The Importance of Reference Corpora.

[Meurer 2017] Paul Meurer. 2017. The Morphosyntactic Analysis of Georgian.

[Meurer 2018] Paul Meurer. 2018. The Abkhaz National Corpus. In Proceedings LREC 2018, 2456–2460.

[Przepiórkowski et al. 2012] Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus Języka Polskiego.

[Simov et al. 2004] Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff. 2004. A Language Resources Infrastructure for Bulgarian. In Proceedings of LREC 2004, 1685–1688.

[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.

[Tadić 2002] Marko Tadić. 2002. Building the Croatian National Corpus. In Proceedings of LREC 2002, 441–446.

[Váradi 2002]Tamás Váradi. 2002. The Hungarian National Corpus. In Proceedings of LREC 2002, 385–389.