Reference Corpora | CLARIN ERIC

According to the linguist Geoffrey Leech (2002), a "reference corpus is designed to provide comprehensive information about the language […] It has to be a general corpus of wide coverage of the language, and hopefully it will be treated by its user community as some kind of “standard” for the language." Reference corpora thus contrast with specialised corpus families (e.g., parliamentary corpora, CMC-corpora) in that they are comprehensive with respect to genre inclusion, typically sampling a diverse set of primarily written genres.

The CLARIN infrastructure offers access to 30 reference corpora for 21 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and NoSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Reference corpora in the CLARIN infrastructure

Corpus	Language	Description	Availability
AbNC: Abkhaz National Corpus Size: 10 million words Annotation: MSD-tagged, lemmatized Licence: CLARIN_PUB-BY-NC-ND	Abkhaz	This corpus includes Abkhaz texts published between 1920 and 2016. The corpus is encoded in . The corpus is available for online browsing through the Corpuscle concordancer (CLARINO distribution). For the relevant publication, see Meurer (2018)	Concordancer
Bulgarian National Reference Corpus (BNRC) Size: 70 million tokens Annotation: tokenized, PoS-tagged Licence: Individual terms of agreement	Bulgarian	This corpus includes Bulgarian texts taken from news media, literature, and administrative documents between 1997 and 2002. The tokenised corpus is available through WebCLaRK, while the PoS-tagged version is available only upon request. For the relevant publication, see Simov et al. (2004)	Concordancer
Croatian language corpus Riznica 0.1 Size: 101.8 million tokens, 85.3 million words, 4.7 million sentences, 14,781 texts Annotation: sentence segmented, PoS-tagged, lemmatized Licence: CC BY-NC-SA 4.0	Croatian	This corpus includes Croatian texts taken from fiction (28%) and specialised texts (72%). The corpus is available for online browsing via noSketch Engine and KonText and for download from the CLARIN.SI repository. For the relevant publication, see Ćavar and Brozović Rončević (2012)	noSketchEngine KonText Download
Croatian National Corpus Size: 101 million tokens	Croatian	This corpus includes Croatian texts taken from newspapers, magazines, popular texts, and fiction. The corpus is available for online browsing through the noSketch Engine. For the relevant publication, see Tadić (2002)	Concordancer
SYN2005: balanced corpus of written Czech Size: 100 million words Annotation: MSD-tagged, lemmatized Licence: Czech National Corpus (Shuffled Corpus Data)	Czech	This corpus includes Czech texts published between 2000 and 2004. The corpus is encoded in XML. The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository. For the relevant publication, see Hnátková et al. (2014)	Concordancer Download
SYN2010: balanced corpus of written Czech Size: 100 million words Annotation: MSD-tagged, lemmatized Licence: Czech National Corpus (Shuffled Corpus Data)	Czech	This corpus includes Czech fiction, professional literature, newspapers etc. published between 2005 and 2009. The corpus is encoded in XML. The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository. For the relevant publication, see Hnátková et al. (2014)	Concordancer Download
SYN2015: representative corpus of written Czech Size: 100 million words Annotation: MSD-tagged, lemmatized Licence: Czech National Corpus (Shuffled Corpus Data)	Czech	This corpus includes Czech fiction, professional literature, newspapers etc. published between 2010 and 2014. The corpus is encoded in XML. The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository. For the relevant publication, see Hnátková et al. (2014)	Concordancer Download
DK-CLARIN Reference Corpus of General Danish Size: 45.1 million words Annotation: PoS-tagged, sentence and paragraph segmentation, lemmatized Licence: CLARIN ACA-NC	Danish	This corpus includes Danish texts published between 2008 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source and year of publication. The corpus is available for download from the CLARIN-DK repository.	Download
SoNaR Size: 500 million words Annotation: PoS-tagged, lemmatized, named entities; coreference annotation and annotation of spatial and temporal relations for the manually annotated SoNaR-1 subset Licence: Terms of Agreement	Dutch	This corpus includes representative Dutch texts (fiction, brochures, magazines, legal texts, newspapers, parliamentary proceedings, and computer-mediated communication). Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is encoded in FoLiA. The corpus is available for online browsing through the OpenSONAR concordancer and can be downloaded from the Dutch Language Institute (CLARIAH-NL).	Concordancer Download subset 1 Download subset 2
Corpus of Contemporary American English – Kielipankki version Size: 440 million words, 190,000 texts Annotation: PoS-tagged, lemmatized Licence: CLARIN ACA (online version), CLARIN RES (downloadable version)	English (American)	This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012. The corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution).	Concordancer Download
British National Corpus Size: 100 million words Annotation: PoS-tagged, lemmatized Licence: BNC User Licence (restricted for the downloadable version)	English (British)	This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993. The corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK).	Concordancer Download
Estonian National Corpus 2019 Size: 1.5 billion words Annotation: MSD-tagged, lemmatized Licence: CC-BY-SA	Estonian	This corpus includes Estonian texts published between 1990 and 2019. Amongst others, this corpus contains the Estonian Reference Corpus as a subcorpus. The corpus is available for download from (CELR distribution).	Download
Estonian Reference Corpus Size: 175 million words Annotation: MSD-tagged, lemmatized Licence: free for non-commercial use	Estonian	This corpus includes Estonian texts (fiction, PhD theses, newspapers, magazines, parliamentary transcriptions, computer-mediated communication) published between 1990 and 2007. The corpus is encoded in TEI. The corpus is available for online browsing through a dedicated concordancer and is available for download from CELR.	Concordancer Download
DeReKo Size: 31.7 billion words Annotation: MSD-tagged, lemmatized Licence: CC-BY-SA	German	This corpus includes German texts in a wide variety of genres published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information. Part of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform. For the relevant publication, see Kupietz et al. (2018)	Concordancer Download
Corpus of Greek Texts Size: 27.6 million words Licence: CC-BY-NC, ACA	Greek	This corpus includes representative Greek texts published between 1990 and 2010. Aside from written materials, the corpus also includes transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Goutsos (2010)	Concordancer
Diachronic corpus of Greek of the 20th century Size: 20 million words Licence: CC BY-NC	Greek	This corpus includes Greek texts published in the 20th century. The corpus is available for download from CLARIN:EL.	Download
Hellenic National Corpus Size: 47 million words Annotation: sentence segmented Licence: proprietary	Greek	This corpus includes Greek texts published from 1990 onwards. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Gavrilidou (2002)	Concordancer
Hungarian National Corpus Size: 190 million tokens Annotation: PoS-tagged Licence: free after registration	Hungarian	This corpus includes Hungarian texts (newspapers, literature, scientific articles, official and personal documents). The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Váradi (2002)	Concordancer
The Icelandic Gigaword Corpus Size: 1.9 billion words Annotation: MSD-tagged, lemmatized Licence: CC-BY and a special user licence	Icelandic	This corpus includes Icelandic texts (newspapers, parliamentary proceedings, adjudications, fiction and non-fiction) published until 2017. The corpus is encoded in TEI. Non-linguistic metadata include bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is available for online browsing and download through CLARIN-IS (in two subsets, each with its own licence). For the relevant publication, see Steingrímsson et al. (2018)	Concordancer Download subset 1 Download subset 2
Balanced Corpus of Modern Latvian (LVK2022) Size: 122.9 million tokens Annotation: MSD-tagged, lemmatized	Latvian	This corpus includes texts from journalism, fiction, science, Wikipedia, legal documents, parliamentary subscripts, and subtitles. The corpus is available for online browsing through the noSketch Engine concordancer.	Concordancer
Corpus of the Contemporary Lithuanian Language Size: 208.4 million tokens Annotation: MSD-tagged, lemmatized Licence: CLARIN RES	Lithuanian	This corpus includes Lithuanian texts (mostly newspapers but also fiction, non-fiction, and specialised magazines) published between 1990 and 2008. The corpus is encoded in TEI. Non-linguistic metadata includes bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer.	Concordancer
The Lexicographic Corpus for Norwegian Bokmål (LBK) Size: 100 million tokens Annotation: PoS-tagged, lemmatized Licence: CLARIN_ACA-NC-LOC-ND	Norwegian (Bokmål)	This corpus includes representative Norwegian (Bokmål) texts (newspapers and periodicals, non-fiction, fiction, TV subtitles, and small print) published between 1985 and 2013. The corpus is available for online browsing through the concordancer Glossa (CLARINO). For the relevant publication, see Lain Knudsen and Vatvedt Fjeld (2013)	Concordancer
Norsk Ordboks Nynorskkorpus (NNK) Size: 107.8 million words Annotation: MSD-tagged, lemmatized Licence: CLARIN_RES-NC-DEP	Norwegian (Nynorsk)	This corpus includes representative Norwegian (Nynorsk) texts published between 1866 and 2012. The corpus is encoded in XML. The corpus is available for online browsing through the Corpuscle concordancer (CLARINO).	Concordancer
National Corpus of Polish Size: 1.8 billion tokens Annotation: MSD-tagged, lemmatized	Polish	This is a written and spoken corpus that includes representative Polish texts published between 1945 and 2010. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. Aside from written materials, the corpus also includes transcriptions of spoken language. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Przepiórkowski et al. (2012)	Concordancer
Corpus of combined Slovenian corpora metaFida 1.0 Size: 6 billion tokens Annotation: MSD-tagged (MULTEXT-East), lemmatised, normalised Licence: various	Slovenian	This corpus contains a number of existing Slovenian corpora available through the CLARIN.SI concordances and thus provides a unified search across all the included corpora. metaFida contains over 4,7 billion words or 6 billion tokens from 15 million text published 1584 - 2022 from 34 corpora. In the metaFida corpus we keep only information that is common to most of the selected corpora. The structure is nested very shallowly (text and paragraph), as it is then easier to create subcorpora or limit the search to individual text types. All metaFida positional attributes (word, normalised form, lemma, MULTEXT-East MSD in Slovenian and English) are considered to have multiple values, separated by a space.	Concordancer (noSketchEngine) Concordancer (KonText) Download
Spoken corpus Gos 2.0 Size: 1534 texts; 127,604 utterances; 2,462,368 words Annotation: PoS-tagged, lemmatised, phonetically and orthographically transcribed Licence: CC BY-SA 4.0	Slovenian	This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words). The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer. For the relevant publication, see Verdonik and Zwitter-Vitez (2011)	Concordancer (noSketchEngine) Concordancer (KonText) Download
Written corpus ccGigafida 1.0 Size: 126.9 million tokens, 103.2 million words, 31,722 texts Annotation: MSD-tagged, lemmatized Licence: CC-BY-NC-SA 4.0	Slovenian	This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. This corpus is a downloadable subset of the representative Gigafida corpus (version 1). It can be downloaded from the CLARIN.SI repository. For the relevant publication, see Erjavec and Logar (2012)	Download
Written corpus ccKres 1.0 Size: 12.2 million tokens, 9.8 million words Annotation: MSD-tagged, lemmatized Licence: CC-BY	Slovenian	This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. This corpus is a downloadable subset of the balanced Kres corpus. It can be downloaded from the CLARIN.SI repository. For the relevant publication, see Erjavec and Logar (2012)	Download
Written corpus Gigafida 2.0 Size: 1.3 billion tokens, 1.1 billion words, 38,310 texts Annotation: MSD-tagged, lemmatized Licence: Individual terms of agreement	Slovenian	This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2018. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. The corpus is available for online browsing through the noSketch Engine concordancer (CLARIN.SI distribution), as well as through a dedicated search engine. For the relevant publication, see Krek et al. (2018)	noSketchEngine Concordancer
Written corpus Kres 1.0 Size: 99 million words Annotation: MSD-tagged, lemmatized Licence: Individual terms of agreement	Slovenian	This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. This corpus is a balanced subset of the representative Gigafida corpus (version 1). The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. The corpus is available for online browsing through a dedicated concordancer. For the relevant publication, see Krek et al. (2018)	Concordancer
CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh Size: 11 million words Licence: CC BY-NC-SA 4.0	Welsh	This corpus contains spoken, written and digital (e-language) Welsh. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels. The corpus is available for online browsing through a dedicated webpage and by request. For the relevant publication, see Knight et al. (2020)	Request Concordancer

Corpus

Language

Description

Availability

AbNC: Abkhaz National Corpus

Size: 10 million words
Annotation: MSD-tagged, lemmatized
Licence: CLARIN_PUB-BY-NC-ND

Abkhaz

This corpus includes Abkhaz texts published between 1920 and 2016. The corpus is encoded in .

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO distribution).

For the relevant publication, see Meurer (2018)

Concordancer

Bulgarian National Reference Corpus (BNRC)

Size: 70 million tokens
Annotation: tokenized, PoS-tagged
Licence: Individual terms of agreement

Bulgarian

This corpus includes Bulgarian texts taken from news media, literature, and administrative documents between 1997 and 2002.

The tokenised corpus is available through WebCLaRK, while the PoS-tagged version is available only upon request.

For the relevant publication, see Simov et al. (2004)

Concordancer

Croatian language corpus Riznica 0.1

Size: 101.8 million tokens, 85.3 million words, 4.7 million sentences, 14,781 texts
Annotation: sentence segmented, PoS-tagged, lemmatized
Licence: CC BY-NC-SA 4.0

Croatian

This corpus includes Croatian texts taken from fiction (28%) and specialised texts (72%).

The corpus is available for online browsing via noSketch Engine and KonText and for download from the CLARIN.SI repository.

For the relevant publication, see Ćavar and Brozović Rončević (2012)

noSketchEngine

KonText

Download

Croatian National Corpus

Size: 101 million tokens

Croatian

This corpus includes Croatian texts taken from newspapers, magazines, popular texts, and fiction.

The corpus is available for online browsing through the noSketch Engine.

For the relevant publication, see Tadić (2002)

Concordancer

SYN2005: balanced corpus of written Czech

Size: 100 million words
Annotation: MSD-tagged, lemmatized
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech texts published between 2000 and 2004. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)

Concordancer

Download

SYN2010: balanced corpus of written Czech

Size: 100 million words
Annotation: MSD-tagged, lemmatized
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech fiction, professional literature, newspapers etc. published between 2005 and 2009. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)

Concordancer

Download

SYN2015: representative corpus of written Czech

Size: 100 million words
Annotation: MSD-tagged, lemmatized
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus includes Czech fiction, professional literature, newspapers etc. published between 2010 and 2014. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)

Concordancer

Download

DK-CLARIN Reference Corpus of General Danish

Size: 45.1 million words
Annotation: PoS-tagged, sentence and paragraph segmentation, lemmatized
Licence: CLARIN ACA-NC

Danish

This corpus includes Danish texts published between 2008 and 2011.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source and year of publication.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR

Size: 500 million words
Annotation: PoS-tagged, lemmatized, named entities; coreference annotation and annotation of spatial and temporal relations for the manually annotated SoNaR-1 subset
Licence: Terms of Agreement

Dutch

This corpus includes representative Dutch texts (fiction, brochures, magazines, legal texts, newspapers, parliamentary proceedings, and computer-mediated communication).

Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is encoded in FoLiA.

The corpus is available for online browsing through the OpenSONAR concordancer and can be downloaded from the Dutch Language Institute (CLARIAH-NL).

Concordancer

Download subset 1

Download subset 2

Corpus of Contemporary American English – Kielipankki version

Size: 440 million words, 190,000 texts
Annotation: PoS-tagged, lemmatized
Licence: CLARIN ACA (online version), CLARIN RES (downloadable version)

English (American)

This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012.

The corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Download

British National Corpus

Size: 100 million words
Annotation: PoS-tagged, lemmatized
Licence: BNC User Licence (restricted for the downloadable version)

English (British)

This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993.

The corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK).

Concordancer

Download

Estonian National Corpus 2019

Size: 1.5 billion words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-SA

Estonian

This corpus includes Estonian texts published between 1990 and 2019. Amongst others, this corpus contains the Estonian Reference Corpus as a subcorpus.

The corpus is available for download from (CELR distribution).

Download

Estonian Reference Corpus

Size: 175 million words
Annotation: MSD-tagged, lemmatized
Licence: free for non-commercial use

Estonian

This corpus includes Estonian texts (fiction, PhD theses, newspapers, magazines, parliamentary transcriptions, computer-mediated communication) published between 1990 and 2007. The corpus is encoded in TEI.

The corpus is available for online browsing through a dedicated concordancer and is available for download from CELR.

Concordancer

Download

DeReKo

Size: 31.7 billion words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-SA

German

This corpus includes German texts in a wide variety of genres published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information.

Part of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform.

For the relevant publication, see Kupietz et al. (2018)

Concordancer

Download

Corpus of Greek Texts

Size: 27.6 million words
Licence: CC-BY-NC, ACA

Greek

This corpus includes representative Greek texts published between 1990 and 2010. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Goutsos (2010)

Concordancer

Diachronic corpus of Greek of the 20th century

Size: 20 million words
Licence: CC BY-NC

Greek

This corpus includes Greek texts published in the 20th century.

The corpus is available for download from CLARIN:EL.

Download

Hellenic National Corpus

Size: 47 million words
Annotation: sentence segmented
Licence: proprietary

Greek

This corpus includes Greek texts published from 1990 onwards.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Gavrilidou (2002)

Concordancer

Hungarian National Corpus

Size: 190 million tokens
Annotation: PoS-tagged
Licence: free after registration

Hungarian

This corpus includes Hungarian texts (newspapers, literature, scientific articles, official and personal documents).

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Váradi (2002)

Concordancer

The Icelandic Gigaword Corpus

Size: 1.9 billion words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY and a special user licence

Icelandic

This corpus includes Icelandic texts (newspapers, parliamentary proceedings, adjudications, fiction and non-fiction) published until 2017.

The corpus is encoded in TEI. Non-linguistic metadata include bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing and download through CLARIN-IS (in two subsets, each with its own licence).

For the relevant publication, see Steingrímsson et al. (2018)

Concordancer

Download subset 1

Download subset 2

Balanced Corpus of Modern Latvian (LVK2022)

Size: 122.9 million tokens
Annotation: MSD-tagged, lemmatized

Latvian

This corpus includes texts from journalism, fiction, science, Wikipedia, legal documents, parliamentary subscripts, and subtitles.

The corpus is available for online browsing through the noSketch Engine concordancer.

Concordancer

Corpus of the Contemporary Lithuanian Language

Size: 208.4 million tokens
Annotation: MSD-tagged, lemmatized
Licence: CLARIN RES

Lithuanian

This corpus includes Lithuanian texts (mostly newspapers but also fiction, non-fiction, and specialised magazines) published between 1990 and 2008.

The corpus is encoded in TEI. Non-linguistic metadata includes bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

Concordancer

The Lexicographic Corpus for Norwegian Bokmål (LBK)

Size: 100 million tokens
Annotation: PoS-tagged, lemmatized
Licence: CLARIN_ACA-NC-LOC-ND

Norwegian (Bokmål)

This corpus includes representative Norwegian (Bokmål) texts (newspapers and periodicals, non-fiction, fiction, TV subtitles, and small print) published between 1985 and 2013.

The corpus is available for online browsing through the concordancer Glossa (CLARINO).

For the relevant publication, see Lain Knudsen and Vatvedt Fjeld (2013)

Concordancer

Norsk Ordboks Nynorskkorpus (NNK)

Size: 107.8 million words
Annotation: MSD-tagged, lemmatized
Licence: CLARIN_RES-NC-DEP

Norwegian (Nynorsk)

This corpus includes representative Norwegian (Nynorsk) texts published between 1866 and 2012. The corpus is encoded in XML.

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO).

Concordancer

National Corpus of Polish

Size: 1.8 billion tokens
Annotation: MSD-tagged, lemmatized

Polish

This is a written and spoken corpus that includes representative Polish texts published between 1945 and 2010.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Przepiórkowski et al. (2012)

Concordancer

Corpus of combined Slovenian corpora metaFida 1.0

Size: 6 billion tokens
Annotation: MSD-tagged (MULTEXT-East), lemmatised, normalised
Licence: various

Slovenian

This corpus contains a number of existing Slovenian corpora available through the CLARIN.SI concordances and thus provides a unified search across all the included corpora. metaFida contains over 4,7 billion words or 6 billion tokens from 15 million text published 1584 - 2022 from 34 corpora.

In the metaFida corpus we keep only information that is common to most of the selected corpora. The structure is nested very shallowly (text and paragraph), as it is then easier to create subcorpora or limit the search to individual text types. All metaFida positional attributes (word, normalised form, lemma, MULTEXT-East MSD in Slovenian and English) are considered to have multiple values, separated by a space.

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

Spoken corpus Gos 2.0

Size: 1534 texts; 127,604 utterances; 2,462,368 words
Annotation: PoS-tagged, lemmatised, phonetically and orthographically transcribed
Licence: CC BY-SA 4.0

Slovenian

This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words).

The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer.

For the relevant publication, see Verdonik and Zwitter-Vitez (2011)

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

Written corpus ccGigafida 1.0

Size: 126.9 million tokens, 103.2 million words, 31,722 texts
Annotation: MSD-tagged, lemmatized
Licence: CC-BY-NC-SA 4.0

Slovenian

This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the representative Gigafida corpus (version 1). It can be downloaded from the CLARIN.SI repository.

For the relevant publication, see Erjavec and Logar (2012)

Download

Written corpus ccKres 1.0

Size: 12.2 million tokens, 9.8 million words
Annotation: MSD-tagged, lemmatized
Licence: CC-BY

Slovenian

This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the balanced Kres corpus. It can be downloaded from the CLARIN.SI repository.

For the relevant publication, see Erjavec and Logar (2012)

Download

Written corpus Gigafida 2.0

Size: 1.3 billion tokens, 1.1 billion words, 38,310 texts
Annotation: MSD-tagged, lemmatized
Licence: Individual terms of agreement

Slovenian

This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2018. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through the noSketch Engine concordancer (CLARIN.SI distribution), as well as through a dedicated search engine.

For the relevant publication, see Krek et al. (2018)

noSketchEngine

Concordancer

Written corpus Kres 1.0

Size: 99 million words
Annotation: MSD-tagged, lemmatized
Licence: Individual terms of agreement

Slovenian

This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011.

This corpus is a balanced subset of the representative Gigafida corpus (version 1). The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Krek et al. (2018)

Concordancer

CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh

Size: 11 million words
Licence: CC BY-NC-SA 4.0

Welsh

This corpus contains spoken, written and digital (e-language) Welsh. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

The corpus is available for online browsing through a dedicated webpage and by request.

For the relevant publication, see Knight et al. (2020)

Request

Concordancer

Publications

[Ćavar and Brozović Rončević 2012] Damir Ćavar and Dunja Brozović Rončević. 2012. Riznica: the Croatian Language Corpus. Prace filologiczne, 63: 51–65.

[Knight et al. 2020] Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E-M. (2020). Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh – A community driven approach to linguistic corpus construction: Project Report.

[Erjavec and Logar Berginc 2012] Tomaž Erjavec and Nataša Logar Berginc. 2012. Referenčni korpusi slovenskega jezika (cc)Gigafida in (cc)KRES. In Zbornik Osme konference Jezikovne tehnologije, 57–62.

[Gavrilidou 2002] Maria Gavrilidou. 2002. The Hellenic National Corpus on-line. Revue belge de Philologie et d'Histoire, 80 (3): 1003–1015.

[Goutsos 2010] Dionysis Goutsos. 2010. The Corpus of Greek Texts: a reference corpus for Modern Greek. Corpora, 5 (1): 29–44.

[Hnátková et al. 2014] Milena Hnátková, Michal Kren, Pavel Procházka, and Hana Skoumalová. 2014. The SYN-series corpora of written Czech. In Proceedings of LREC 2014, 160–164.

[Krek et al. 2016] Simon Krek, Polona Gantar, Špela Arhar Holdt, and Vojko Gorjanc. 2016. In Proceedings of the Conference on Language Technologies and Digital Humanities, 200–202.

[Kupietz et al. 2018] Marc Kupietz, Harald Lüngen, Pawel Kamocki, and Andreas Witt. 2018. The German Reference Corpus DeReKo: New Developments – New Opportunities In Proceedings of LREC 2018, 4353–4360.

[Lain Knudsen and Vatvedt Fjeld 2013] Rune Lain Knudsen and Ruth Vatvedt Fjeld. 2013. LBK2013: A balanced; annotated national corpus for Norwegian Bokmål. In Proceedings of the workshop on lexical semantic resources for NLP at NODALIDA 2013, 12–20.

[Leech 2002] Geoffrey Leech. 2002. The Importance of Reference Corpora.

[Meurer 2017] Paul Meurer. 2017. The Morphosyntactic Analysis of Georgian.

[Meurer 2018] Paul Meurer. 2018. The Abkhaz National Corpus. In Proceedings LREC 2018, 2456–2460.

[Przepiórkowski et al. 2012] Adam Przepiórkowski, Mirosław Bańko, Rafał L. Górski, and Barbara Lewandowska-Tomaszczyk, editors. 2012. Narodowy Korpus Języka Polskiego.

[Simov et al. 2004] Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff. 2004. A Language Resources Infrastructure for Bulgarian. In Proceedings of LREC 2004, 1685–1688.

[Steingrímsson et al. 2018] Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson, and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of LREC 2018, 4361–4366.

[Tadić 2002] Marko Tadić. 2002. Building the Croatian National Corpus. In Proceedings of LREC 2002, 441–446.

[Váradi 2002]Tamás Váradi. 2002. The Hungarian National Corpus. In Proceedings of LREC 2002, 385–389.