According to the linguist Geoffrey Leech (2002), a "reference corpus is designed to provide comprehensive information about the language […] It has to be a general corpus of wide coverage of the language, and hopefully it will be treated by its user community as some kind of “standard” for the language." Reference corpora thus contrast with specialised corpus families (e.g., parliamentary corpora, CMC-corpora) in that they are comprehensive with respect to genre inclusion, typically sampling a diverse set of primarily written genres.
The CLARIN infrastructure offers access to 30 reference corpora for 21 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and NoSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Reference corpora in the CLARIN infrastructure
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 10 million words |
Abkhaz |
This corpus includes Abkhaz texts published between 1920 and 2016. The corpus is encoded in . The corpus is available for online browsing through the Corpuscle concordancer (CLARINO distribution). For the relevant publication, see Meurer (2018) |
Concordancer |
Bulgarian National Reference Corpus (BNRC) Size: 70 million tokens |
Bulgarian |
This corpus includes Bulgarian texts taken from news media, literature, and administrative documents between 1997 and 2002. The tokenised corpus is available through WebCLaRK, while the PoS-tagged version is available only upon request. For the relevant publication, see Simov et al. (2004) |
Concordancer |
Croatian language corpus Riznica 0.1 Size: 101.8 million tokens, 85.3 million words, 4.7 million sentences, 14,781 texts |
Croatian |
This corpus includes Croatian texts taken from fiction (28%) and specialised texts (72%). The corpus is available for online browsing via noSketch Engine and KonText and for download from the CLARIN.SI repository. For the relevant publication, see Ćavar and Brozović Rončević (2012) |
|
Size: 101 million tokens |
Croatian |
This corpus includes Croatian texts taken from newspapers, magazines, popular texts, and fiction. The corpus is available for online browsing through the noSketch Engine. For the relevant publication, see Tadić (2002) |
Concordancer |
SYN2005: balanced corpus of written Czech Size: 100 million words |
Czech |
This corpus includes Czech texts published between 2000 and 2004. The corpus is encoded in XML. The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository. For the relevant publication, see Hnátková et al. (2014) |
|
SYN2010: balanced corpus of written Czech Size: 100 million words |
Czech |
This corpus includes Czech fiction, professional literature, newspapers etc. published between 2005 and 2009. The corpus is encoded in XML. The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository. For the relevant publication, see Hnátková et al. (2014) |
|
SYN2015: representative corpus of written Czech Size: 100 million words |
Czech |
This corpus includes Czech fiction, professional literature, newspapers etc. published between 2010 and 2014. The corpus is encoded in XML. The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository. For the relevant publication, see Hnátková et al. (2014) |
|
DK-CLARIN Reference Corpus of General Danish Size: 45.1 million words |
Danish |
This corpus includes Danish texts published between 2008 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source and year of publication. The corpus is available for download from the CLARIN-DK repository. |
Download |
Size: 500 million words |
Dutch |
This corpus includes representative Dutch texts (fiction, brochures, magazines, legal texts, newspapers, parliamentary proceedings, and computer-mediated communication). Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is encoded in FoLiA. The corpus is available for online browsing through the OpenSONAR concordancer and can be downloaded from the Dutch Language Institute (CLARIAH-NL). |
|
Corpus of Contemporary American English – Kielipankki version Size: 440 million words, 190,000 texts |
English (American) |
This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012. The corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution). |
|
Size: 100 million words |
English (British) |
This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993. The corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK). |
|
Size: 1.5 billion words |
Estonian |
This corpus includes Estonian texts published between 1990 and 2019. Amongst others, this corpus contains the Estonian Reference Corpus as a subcorpus. The corpus is available for download from (CELR distribution). |
Download |
Size: 175 million words |
Estonian |
This corpus includes Estonian texts (fiction, PhD theses, newspapers, magazines, parliamentary transcriptions, computer-mediated communication) published between 1990 and 2007. The corpus is encoded in TEI. The corpus is available for online browsing through a dedicated concordancer and is available for download from CELR. |
|
Size: 31.7 billion words |
German |
This corpus includes German texts in a wide variety of genres published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information. Part of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform. For the relevant publication, see Kupietz et al. (2018) |
|
Size: 27.6 million words |
Greek |
This corpus includes representative Greek texts published between 1990 and 2010. Aside from written materials, the corpus also includes transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Goutsos (2010) |
Concordancer |
Diachronic corpus of Greek of the 20th century Size: 20 million words |
Greek |
This corpus includes Greek texts published in the 20th century. The corpus is available for download from CLARIN:EL. |
Download |
Size: 47 million words |
Greek |
This corpus includes Greek texts published from 1990 onwards. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Gavrilidou (2002) |
Concordancer |
Size: 190 million tokens |
Hungarian |
This corpus includes Hungarian texts (newspapers, literature, scientific articles, official and personal documents). The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Váradi (2002) |
Concordancer |
Size: 1.9 billion words |
Icelandic |
This corpus includes Icelandic texts (newspapers, parliamentary proceedings, adjudications, fiction and non-fiction) published until 2017. The corpus is encoded in TEI. Non-linguistic metadata include bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is available for online browsing and download through CLARIN-IS (in two subsets, each with its own licence). For the relevant publication, see Steingrímsson et al. (2018) |
|
Balanced Corpus of Modern Latvian (LVK2022) Size: 122.9 million tokens |
Latvian |
This corpus includes texts from journalism, fiction, science, Wikipedia, legal documents, parliamentary subscripts, and subtitles. The corpus is available for online browsing through the noSketch Engine concordancer. |
Concordancer |
Corpus of the Contemporary Lithuanian Language Size: 208.4 million tokens |
Lithuanian |
This corpus includes Lithuanian texts (mostly newspapers but also fiction, non-fiction, and specialised magazines) published between 1990 and 2008. The corpus is encoded in TEI. Non-linguistic metadata includes bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer. |
Concordancer |
The Lexicographic Corpus for Norwegian Bokmål (LBK) Size: 100 million tokens |
Norwegian (Bokmål) |
This corpus includes representative Norwegian (Bokmål) texts (newspapers and periodicals, non-fiction, fiction, TV subtitles, and small print) published between 1985 and 2013. The corpus is available for online browsing through the concordancer Glossa (CLARINO). For the relevant publication, see Lain Knudsen and Vatvedt Fjeld (2013) |
Concordancer |
Norsk Ordboks Nynorskkorpus (NNK) Size: 107.8 million words |
Norwegian (Nynorsk) |
This corpus includes representative Norwegian (Nynorsk) texts published between 1866 and 2012. The corpus is encoded in XML. The corpus is available for online browsing through the Corpuscle concordancer (CLARINO). |
Concordancer |
Size: 1.8 billion tokens |
Polish |
This is a written and spoken corpus that includes representative Polish texts published between 1945 and 2010. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. Aside from written materials, the corpus also includes transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Przepiórkowski et al. (2012) |
Concordancer |
Corpus of combined Slovenian corpora metaFida 1.0 Size: 6 billion tokens |
Slovenian |
This corpus contains a number of existing Slovenian corpora available through the CLARIN.SI concordances and thus provides a unified search across all the included corpora. metaFida contains over 4,7 billion words or 6 billion tokens from 15 million text published 1584 - 2022 from 34 corpora. In the metaFida corpus we keep only information that is common to most of the selected corpora. The structure is nested very shallowly (text and paragraph), as it is then easier to create subcorpora or limit the search to individual text types. All metaFida positional attributes (word, normalised form, lemma, MULTEXT-East MSD in Slovenian and English) are considered to have multiple values, separated by a space. |
|
Size: 1534 texts; 127,604 utterances; 2,462,368 words |
Slovenian |
This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words). The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer. For the relevant publication, see Verdonik and Zwitter-Vitez (2011) |
|
Size: 126.9 million tokens, 103.2 million words, 31,722 texts |
Slovenian |
This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. This corpus is a downloadable subset of the representative Gigafida corpus (version 1). It can be downloaded from the CLARIN.SI repository. For the relevant publication, see Erjavec and Logar (2012) |
Download |
Size: 12.2 million tokens, 9.8 million words |
Slovenian |
This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. This corpus is a downloadable subset of the balanced Kres corpus. It can be downloaded from the CLARIN.SI repository. For the relevant publication, see Erjavec and Logar (2012) |
Download |
Size: 1.3 billion tokens, 1.1 billion words, 38,310 texts |
Slovenian |
This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2018. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. The corpus is available for online browsing through the noSketch Engine concordancer (CLARIN.SI distribution), as well as through a dedicated search engine. For the relevant publication, see Krek et al. (2018) |
|
Size: 99 million words |
Slovenian |
This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. This corpus is a balanced subset of the representative Gigafida corpus (version 1). The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Krek et al. (2018) |
Concordancer |
CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh Size: 11 million words |
Welsh |
This corpus contains spoken, written and digital (e-language) Welsh. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels. The corpus is available for online browsing through a dedicated webpage and by request. For the relevant publication, see Knight et al. (2020) |
Publications
[Ćavar and Brozović Rončević 2012] Damir Ćavar and Dunja Brozović Rončević. 2012. Riznica: the Croatian Language Corpus. Prace filologiczne, 63: 51–65.
[Knight et al. 2020] Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E-M. (2020). Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh – A community driven approach to linguistic corpus construction: Project Report.
[Erjavec and Logar Berginc 2012] Tomaž Erjavec and Nataša Logar Berginc. 2012. Referenčni korpusi slovenskega jezika (cc)Gigafida in (cc)KRES. In Zbornik Osme konference Jezikovne tehnologije, 57–62.
[Gavrilidou 2002] Maria Gavrilidou. 2002. The Hellenic National Corpus on-line. Revue belge de Philologie et d'Histoire, 80 (3): 1003–1015.
[Goutsos 2010] Dionysis Goutsos. 2010. The Corpus of Greek Texts: a reference corpus for Modern Greek. Corpora, 5 (1): 29–44.
[Hnátková et al. 2014] Milena Hnátková, Michal Kren, Pavel Procházka, and Hana Skoumalová. 2014. The SYN-series corpora of written Czech. In Proceedings of LREC 2014, 160–164.
[Krek et al. 2016] Simon Krek, Polona Gantar, Špela Arhar Holdt, and Vojko Gorjanc. 2016. In Proceedings of the Conference on Language Technologies and Digital Humanities, 200–202.
[Kupietz et al. 2018] Marc Kupietz, Harald Lüngen, Pawel Kamocki, and Andreas Witt. 2018. The German Reference Corpus DeReKo: New Developments – New Opportunities In Proceedings of LREC 2018, 4353–4360.
[Lain Knudsen and Vatvedt Fjeld 2013] Rune Lain Knudsen and Ruth Vatvedt Fjeld. 2013. LBK2013: A balanced; annotated national corpus for Norwegian Bokmål. In Proceedings of the workshop on lexical semantic resources for NLP at NODALIDA 2013, 12–20.
[Leech 2002] Geoffrey Leech. 2002. The Importance of Reference Corpora.
[Meurer 2017] Paul Meurer. 2017. The Morphosyntactic Analysis of Georgian.
[Meurer 2018] Paul Meurer. 2018. The Abkhaz National Corpus. In Proceedings LREC 2018, 2456–2460.
[Przepiórkowski et al. 2012] Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus Języka Polskiego.
[Simov et al. 2004] Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff. 2004. A Language Resources Infrastructure for Bulgarian. In Proceedings of LREC 2004, 1685–1688.
[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.
[Tadić 2002] Marko Tadić. 2002. Building the Croatian National Corpus. In Proceedings of LREC 2002, 441–446.
[Váradi 2002]Tamás Váradi. 2002. The Hungarian National Corpus. In Proceedings of LREC 2002, 385–389.