Manually Annotated Corpora

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools, as well as testing the accuracy of existing annotation tools.

There are more than 70 manually annotated training corpora and corpus collections in the CLARIN infrastructure. Among the multilingual corpora, there are 4 collections in the CLARIN infrastructure that were annotated under the following umbrella initiatives: HamleDT 3.0, Treebanks of INESS, Universal Dependencies, and Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1).

The corpora and corpus collections are classified into six categories based on the type of manual annotation:

PoS/MSD tagging
Lemmatisation
Syntactic parsing
Named Entity recognition
Sentiment analysis
Other

If a corpus is manually annotated for more than one linguistic information, then it is listed under all the relevant sections. For instance, the xLiMe Twitter Corpus XTC 1.0.1 is manually annotated for PoS tags, Named Entities and sentiment, so it is listed under all the three relevant sections.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

The Manually Annotated Corpora

PoS MSD tagging

Corpus	Language	Description	Availability
MULTEXT-East "1984" annotated corpus 4.0 Size: 80,000 sentences, 1 million words Annotation: morphosyntactic tagging, lemmatisation, sentence alignment Licence: CC BY-NC-SA 4.0	Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian	This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec (2012)	Download
The Morphologically Annotated Part of BulTreeBank Size: 214,000 tokens Annotation: morphosyntactic tagging Licence: MS-NC-NoReD	Bulgarian	This corpus is available for download through the concordancer Corpuscle.	Concordancer
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition Licence: CC BY 4.0	Croatian	This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016)	Download
BNC Sampler Size: 2 million tokens Annotation: PoS tagging Licence: BNC Licence	English	The corpus was manually post-edited to correct the PoS tags automatically assigned by CLAWS. The corpus is available for online querying via CQPWeb (registration required) for download from the Oxford Text Archive	Concordancer Download
Corpus of morphologically disambiguated Estonian texts Size: 513,000 tokens Annotation: morphological disambiguation Licence: CLARIN_ACA-NC	Estonian	This corpus contains texts from the 1980s subcorpus of the Corpus of Written Estonian 1890-1990.	Download
Austrian Baroque Corpus Size: 200,000 tokens Annotation: tokenised, PoS-tagged, lemmatised, named entities	German	This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked. The corpus is available through a dedicated concordancer. For the relevant publication, see Resch et al (2016)	Concordancer
xLiMe Twitter Corpus XTC 1.0.1 Size: 364,000 tokens Annotation: PoS tagging, Named Entity recognition, sentiment analysis Licence: MIT License	German, Italian, Spanish	This corpus contains Tweets. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Rei et al. (2016)	Download
Szeged Corpus 2.0 Size: 1.5 million tokens Annotation: morphosyntactic tagging Licence: Licence agreement	Hungarian	This corpus is available for download from a dedicated webpage. To download the versions of the Szeged Corpus and Szeged Treebank, you are obliged to fill and send a Licence Agreement.	Download
Lithuanian morphologically annotated corpus - MATAS Size: 1.6 million words Annotation: morphosyntactic tagging Licence: CLARIN ACA	Lithuanian	The corpus contains texts from various domains (documents, fiction, periodicals, scientific texts, wordforms). This corpus is available for download from the CLARIN-LT repository.	Download
NKJP1M Size: 1 million tokens Annotation: morphosyntactic tagging Licence: GNU GPL 3	Polish	This corpus is a manually annotated subset of the National Corpus of Polish. The corpus is available for download from the Computational Linguistics in Poland website. For the relevant publication, see Przepiórkowski and Murzynowski (2011)	Download
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens Annotation: morphosyntactic tagging, tokenisation, sentence segmentation, word normalisation, lemmatisation and Named Entity recognition Licence: CC BY 4.0	Serbian	This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016)	Download
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition Licence: CC BY-SA 4.0	Slovenian	This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018)	KonText noSketch Download
Training corpus jos1M 1.1 Size: 1 million words Annotation: morphosyntactic tagging and lemmatisation Licence: CC BY-NC 4.0	Slovenian	This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec et al. (2010)	Download
Training corpus ssj500k 2.1 Size: 586,000 tokens Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles Licence: CC BY-NC-SA 4.0	Slovenian	This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.	KonText noSketch Download
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. Licence: CC BY-SA 4.0	Croatian	This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016)	Download
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities Licence: CC BY-SA 4.0	Serbian	This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017)	Download

Lemmatisation

Corpus

Language

Description

Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words
Annotation: morphosyntactic tagging, lemmatisation, sentence alignment
Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Croatian

This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

Austrian Baroque Corpus

Size: 200,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Serbian

This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016).

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanović et al. (2018).

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words
Annotation: morphosyntactic tagging and lemmatisation
Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010).

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles.
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities
Licence: CC BY-SA 4.0

Serbian

This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Syntatic parsing

Corpus	Language	Description	Availability
Prague Arabic Dependency Treebank 1.0 Annotation: syntactic parsing and morphosyntactic tagging Licence: CC BY-NC-SA 3.0	Arabic	This corpus is available for download from the LINDAT repository. For the relevant publication, see Hajič et al. (2004)	Download
Czech Legal Text Treebank 2.0 Size: 1121 sentences Annotation: syntactic parsing, labelling of semantic entities Licence: CC BY-NC-SA 4.0	Czech	This corpus contains legal texts. The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository. For the relevant publication, see Kríž and Hladká (2018)	KonText PML-TQ Download
FicTree 1.0 Size: 12760 sentences Annotation: syntactic parsing and morphosyntactic tagging Licence: CC BY-NC-SA 4.0	Czech	This corpus contains fictional texts. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Jelínek (2017)	KonText Download
Prague Dependency Treebank 3.5 Size: 2 million words Annotation: syntactic parsing and morphosyntactic tagging Licence: CC BY-NC-SA 4.0	Czech	This corpus is manually annotated at several levels – aside from syntactic parsing and morphological information, it is annotation for sentence information structure, multiword expression, coreference, bridging relations and discourse relations. The corpus is available for download from the LINDAT repository.	Download
Prague Discourse Treebank 2.0 Size: 49,500 sentences Annotation: syntactic parsing, mark-up of discourse phenomena enriched by the annotation of secondary connectives Licence: CC-BY	Czech	This corpus is a subset of the Prague Dependency Treebank 3.5 The corpus is available through the PML-TQ tool.	PML-TQ
Slovak Dependency Treebank Size: 106,000 tokens, 10,600 sentences Annotation: syntactic parsing Licence: CC BY-SA 4.0	Czech	This syntactic parsing is modelled after the Prague Dependency Treebank. The corpus is available for download from the LINDAT repository.	Download
Prague Czech-English Dependency Treebank 2.0 Coref Size: 49,000 sentences Annotation: syntactic parsing, mark-up of coreference Licence: CC-BY-NC-SA + LDC99T42 (restricted use)	Czech, English	This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 style. The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool (Czech part only). For the relevant publication, see Hajič et al. (2012)	KonText PML-TQ Download
Artificial Treebank with Ellipsis Size: 106,000 tokens, 10,604 sentences Annotation: syntactic parsing, mark-up of elliptical constructions Licence: Licence Universal dependencies v2.1	Czech, English, Finnish, Russian, Slovak	This syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from the LINDAT repository.	Download
Lassy Klein-corpus Size: 1 million tokens Annotation: PoS tagging, syntactic parsing Licence: VAGUE	Dutch	This corpus is available for download from the Dutch Language Institute and through the online environments PaQu and GrETEL. For the relevant publication, see Noord (2009)	Download Pa-Qu GrETEL
SoNaR-1 Size: 1 million words Annotation: PoS tagging, syntactic parsing, semantic role labelling	Dutch	This is a manually annotated subset of the much larger (approx. 500 million) word) SoNaR corpus. The corpus is available for download from the Dutch Language Institute.	Download
Estonian Treebank Size: 1,000 sentences Annotation: syntactic parsing Licence: CLARIN_ACA	Estonian	The corpus contains fictional and newspaper texts. The corpus is available for download from META-SHARE (CELR distribution).	Download
UD Estonian ver.2.3 Size: 434,000 tokens Annotation: syntactic parsing Licence: CC-BY-SA	Estonian	This corpus contains fictional, newspaper and scientific texts. The syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from (CELR distribution). For the relevant publication, see Muischnek et al. (2014)	Download
TimeML annotated corpus of Estonian newspaper articles Size: 22,000 words Annotation: morphosyntactic tagging and syntactic parsing Licence: CC-BY-SA	Estonian	This corpus contains newspaper articles. The corpus is available for download from META-SHARE (CELR distribution). For the relevant publication, see Orasmaa (2014)	Download
Finnish TreeBank 1 Size: 160,000 tokens Annotation: syntactic parsing Licence: CC-BY 3.0	Finnish	This corpus contains 19,000 sentences from the Large Grammar of Finnish. The corpus is available for download from the Language Bank of Finland.	Download
Finnish TreeBank 2 Size: 160,000 tokens Annotation: syntactic parsing Licence: CC-BY 3.0	Finnish	This corpus contains 19,000 sentences from the Large Grammar of Finnish. The corpus is available for download from the Language Bank of Finland.	Download
Turku Dependency Treebank Size: 204,000 tokens Annotation: syntactic parsing Licence: CC-BY-SA	Finnish	The syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from the Turku BioNLP Group. For the relevant publication, see Haverinen et al. (2013)	Download
Syntactic Reference Corpus of Medieval French Size: 245,000 words Annotation: syntactic parsing Licence: CLARIN ACA	French	This corpus contains Old French texts. The corpus is available for download from the IMS CLARIN-D repository. For the relevant publication, see Stein and Prévost (2013)	Download
GRUG Parallel Treebank Size: 10,400 sentence pairs Annotation: syntactic parsing, PoS tagging Licence: CC-BY	Georgian, Ukranian, Russian, German	The corpus is syntactically parsed following the TIGER guidelines. The corpus is available for download from a dedicated website provided by the CLARIN-D consortium.	Download
B4 Heliand Size: 3495 tokens Annotation: PoS tagging, syntactic parsing Licence: CC-BY	German	This corpus contains historical German texts. The corpus is available for download from the HZSK repository.	Download
Dependency-Annotated Subset of the CREG Corpus Size: 109 sentences Annotation: PoS tagging, syntactic parsing Licence: CLARIN RES	German	This corpus consists of answers to reading comprehension questions written by American college students learning German. The corpus is available for download from the Tübingen CLARIN Repository.	Download
Tübingen Treebank of Written German / Newspaper Corpus (TüBa-D/Z) Size: 1.9 million tokens Annotation: syntactic parsing Licence: CLARIN RES	German	This corpus contains newspaper articles. The corpus is available for download from the Tübingen CLARIN Repository.	Download
Szeged Treebank 2.0 Size: 82,000 sentences Annotation: syntactic parsing Licence: licence agreement	Hungarian	This corpus is available for download from a dedicated webpage. For the relevant publication, see Csendes et al. (2005)	Download
Icelandic Parsed Historical Corpus (IcePaHC) Size: 1 million tokens Annotation: morphosyntactic tagging, lemmatisation, syntactic parsing Licence: GNU LGPL	Icelandic	This corpus contains Icelandic texts from the 12th through the 21st centuries – approximately 100,000 words from each century. The corpus is syntactically parsed following the UUPenn scheme for historical textse The corpus is available for online search through treebankstudio.org and for download in different formats from a dedicated webpage. For the relevant publication, see Rögnvaldsson et al. (2012)	Download Concordancer
LVTB - Latvian Treebank Size: 289,791 tokens; 17,127 sentences Annotation: syntactic parsing Licence: CC BY-SA 4.0	Latvian	This treebank is manually annotated according to a hybrid dependency-constituency grammar. The treebank is available for download from the CLARIN-LV repository. For the relevant publication, see Rituma et al. (2023)	Download
Lithuanian Treebank ALKSNIS Size: 2,355 sentences Annotation: syntactic parsing Licence: CLARIN PUB	Lithuanian	Syntactic parsing follows the rules of the Prague Dependency Treebank This corpus is available for download from the CLARIN-LT repository. The second version is available upon request.	Download
Polish Dependency Bank in Universal Dependency format Size: 22,000 trees, 351,000 tokens Annotation: syntactic parsing Licence: CC BY-NC-SA 4.0	Polish	This corpus also contains sentences showing certain problematic syntactic phenomena – sentences with ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction jako, etc. The syntactic parsing follows the Universal Dependencies schema. The first version of the corpus is available for download from the Computational Linguistics in Poland website. The second version is available upon request. For the relevant publication, see Wróblewska (2018)	Download
CINTIL DependencyBank Size: 110,000 tokens Annotation: morphosyntactic tagging and syntactic parsing Licence: MS-NC-No ReD-ND	Portuguese	This corpus contains literary and newspaper texts. The corpus is available for download from the PORTULAN CLARIN repository.	Download
CINTIL TreeBank Size: 110,000 tokens Annotation: syntactic parsing Licence: MS-NC-No ReD-ND	Portuguese	This corpus contains literary and newspaper texts. The corpus is available for download from the PORTULAN CLARIN repository.	Download
CINTIL-DeepBank Size: 110,000 tokens Annotation: PoS-tagging, syntactic parsing, grammatical functions, logical forms Licence: MS-NC-No ReD-ND	Portuguese	This corpus contains literary and newspaper texts. The corpus is available for download from the PORTULAN CLARIN repository.	Download
CINTIL-PropBank Size: 110,000 tokens Annotation: syntactic parsing and phrase semantic roles Licence: MS-NC-No ReD-ND	Portuguese	This corpus contains literary and newspaper texts. The corpus is available for download from the ELRA catalogue.	Download
Training corpus SETimes.SR 1.0 Size: 87,000 tokens Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition Licence: CC BY-SA 4.0	Serbian	This corpus contains posts from the Southeast European Times news portal, which is now defunct. The syntactic parsing follows the Universal Dependencies framework. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.	KonText noSketch Download
Tamil Dependency Treebank v0.1 Size: 600 sentences Annotation: syntactic parsing and morphosyntactic tagging Licence: CC BY-NC-SA 3.0	Tamil	The syntactic parsing follows the rules of the https://ufal.mff.cuni.cz/pdt/. The corpus is available for download from the LINDAT repository.	Download
HamleDT 3.0 Size: 19 treebanks Annotation: syntactic parsing and morphosyntactic tagging Licence: HamleDT 3.0 Licence Terms	19 languages	This treebank collection is available for download from LINDAT. The treebanks can be individually queried through KonText and the treebank tool PML-TQ. We list them here by language: Arabic(KonText, PML-TQ) Bengali (KonText) Catalan (KonText) Czech (KonText, PML-TQ) Dutch (KonText, PML-TQ) English (KonText) Estonian (KonText, PML-TQ) German (KonText) Greek (KonText) Hindi (KonText) Latin (KonText, PML-TQ) Persian (KonText, PML-TQ) Polish (KonText, PML-TQ) Portuguese (KonText, PML-TQ) Romanian (KonText, PML-TQ) Russian (KonText) Slovenian (KonText, PML-TQ) Spanish (KonText) Tamil (KonText, PML-TQ) For the relevant publication, see Zeman et al. (2012)	Download
Treebanks of INESS Size: 532 treebanks Annotation: syntactic parsing Licence: CC-BY	71 languages	This is a collection of treebanks made available through the Infrastructure for the Exploration of Syntax and Semantics (INESS). The corpora are available for online querying through INESS. For the relevant publication, see Rosén et al. (2012)
Universal Dependencies 2.12 Size: 30 million tokens; 30.6 million words; 1.8 million sentences Annotation: syntactic parsing Licence: Licence Universal Dependencies v2.12	75 languages	This corpus collection contains treebanks following theUniversal Dependencies framework. The corpus collection is available for download from the LINDAT repository. The individual treebanks in Universal Dependencies 2.3 can also be queried through the concordancer KonText and the treebank query tool PML-TQ. Below we provide links to these search environments for all the treebanks. For a detailed description of the treebanks, see the Universal Dependencies project page. UD_Akkadian-PISANDUB (KonText) UD_Amharic-ATT (KonText, PML-TQ) UD_Armenian-ArmTDP (KonText, PML-TQ) UD_Breton-KEB (KonText, PML-TQ) UD_Buryat-BDT (KonText, PML-TQ) UD_Cantonese-HK (KonText, PML-TQ) UD_Chinese-HK (KonText, PML-TQ) UD_Chinese-CFL (KonText, PML-TQ) UD_Coptic-Scriptorium (KonText, PML-TQ) UD_Croatian-SET (KonText, PML-TQ) UD_English-ESL (KonText, PML-TQ) UD_Faroese-OFT (KonText, PML-TQ) UD_Galician-TreeGal (KonText, PML-TQ) UD_Hindi_English-HIENCS (KonText) UD_Kazakh-KTB 2.2 (KonText, PML-TQ) UD_Komi_Zyrian-Lattice (KonText, PML-TQ) UD_Komi_Zyrian-IKDP (KonText, PML-TQ) UD_Kurmanji-MG (KonText, PML-TQ) UD_Lithuanian-HSE (KonText, PML-TQ) UD_Maltese-MUDT (KonText, PML-TQ) UD_Marathi-UFAL (KonText, PML-TQ) UD_Naija-NSC (KonText, PML-TQ) UD_Persian-Seraji (KonText, PML-TQ) UD_Russian-Taiga (KonText, PML-TQ) UD_Sanskrit-UFAL (KonText, PML-TQ) UD_Serbian-SET (KonText, PML-TQ) UD_Slovenian-SST (KonText, PML-TQ) UD_Tagalog-TRG (KonText, PML-TQ) UD_Telugu-MTG (KonText, PML-TQ) UD_Ukrainian-IU (KonText, PML-TQ) UD_Upper_Sorbian-UFAL (KonText, PML-TQ) UD_Uyghur-UDT (KonText, PML-TQ) UD_Warlpiri-UFAL (KonText, PML-TQ) UD_Yoruba-YTB (KonText, PML-TQ) UD_Afrikaans-AfriBooms (KonText) UD_Ancient_Greek-PROIEL (KonText) UD_Ancient_Greek-Perseus (KonText, PML-TQ) UD_Arabic-PADT (KonText, PML-TQ) UD_Arabic-PUD (KonText, PML-TQ) UD_Arabic-NYUAD (KonText) UD_Bambara-CRB (KonText, PML-TQ) UD_Basque-BDT (KonText, PML-TQ) UD_Belarusian-HSE (KonText, PML-TQ) UD_Bulgarian-BTB (KonText, PML-TQ) UD_Catalan-AnCora (KonText, PML-TQ) UD_Chinese-GSD (KonText, PML-TQ) UD_Chinese-PUD (KonText, PML-TQ) UD_Czech-PDT (KonText, PML-TQ) UD_Czech-CAC (KonText, PML-TQ) UD_Czech-FicTree (KonText, PML-TQ) UD_Czech-PUD (KonText, PML-TQ) UD_Czech-CLTT (KonText, PML-TQ) UD_Danish-DDT (KonText, PML-TQ) UD_Dutch-Alpino (KonText, PML-TQ) UD_Dutch-LassySmall (KonText, PML-TQ) UD_English-ParTUT (KonText, PML-TQ) UD_English-GUM (KonText, PML-TQ) UD_English-EWT (KonText, PML-TQ) UD_English-PUD (KonText, PML-TQ) UD_English-LinES (KonText, PML-TQ) UD_Erzya-JR (KonText, PML-TQ) UD_Finnish-FTB (KonText, PML-TQ) UD_Finnish-TDT (KonText, PML-TQ) UD_Finnish-PUD (KonText, PML-TQ) UD_French-ParTUT (KonText, PML-TQ) UD_French-GSD (KonText, PML-TQ) UD_French-Sequoia (KonText, PML-TQ) UD_French-Spoken (KonText, PML-TQ) UD_French-PUD (KonText, PML-TQ) UD_French-FTB (KonText) UD_Galician-CTG (KonText, PML-TQ) UD_German-GSD (KonText, PML-TQ) UD_German-PUD (KonText, PML-T ) UD_Gothic-PROIEL (KonText, PML-TQ) UD_Greek-GDT (KonText, PML-TQ) UD_Hebrew-HTB (KonText, PML-TQ) UD_Hindi-HDTB (KonText, PML-TQ) UD_Hindi-PUD (KonText, PML-TQ) UD_Hungarian-Szeged (KonText, PML-TQ) UD_Indonesian-GSD (KonText, PML-TQ) UD_Indonesian-PUD (KonText, PML-TQ) UD_Irish-IDT (KonText, PML-TQ) UD_Italian-ISDT (KonText, PML-TQ) UD_Italian-ParTUT (KonText, PML-TQ) UD_Italian-PUD (KonText, PML-TQ) UD_Japanese-GSD (KonText, PML-TQ) UD_Japanese-PUD (KonText, PML-TQ) UD_Japanese-Modern (KonText, PML-TQ) UD_Korean-Kaist (KonText, PML-TQ) UD_Korean-GSD (KonText, PML-TQ) UD_Korean-PUD (KonText, PML-TQ) UD_Latin-PROIEL (KonText, PML-TQ) UD_Latin-ITTB (KonText, PML-TQ) UD_Latin-Perseus (KonText, PML-TQ) UD_Latvian-LVTB (KonText, PML-TQ) UD_North_Sami-Giella (KonText, PML-TQ) UD_Norwegian-Bokmaal (KonText, PML-TQ) UD_Norwegian-Nynorsk (KonText, PML-TQ) UD_Norwegian-NynorskLIA (KonText, PML-TQ) UD_Old_Church_Slavonic-PROIEL (KonText, PML-TQ) UD_Old_French-SRCMF (KonText, PML-TQ) UD_Polish-LFG (KonText, PML-TQ) UD_Polish-SZ (KonText, PML-TQ) UD_Portuguese-Bosque (KonText, PML-TQ) UD_Portuguese-GSD (KonText, PML-TQ) UD_Portuguese-PUD (KonText, PML-TQ) UD_Romanian-RRT (KonText, PML-TQ) UD_Romanian-Nonstandard (KonText, PML-TQ) UD_Russian-GSD (KonText, PML-TQ) UD_Russian-PUD (KonText, PML-TQ) UD_Russian-SynTagRus (KonText, PML-TQ) UD_Slovak-SNK (KonText, PML-TQ) UD_Slovenian-SSJ (KonText, PML-TQ) UD_Spanish-AnCora (KonText, PML-TQ) UD_Spanish-GSD (KonText, PML-TQ) UD_Spanish-PUD (KonText, PML-TQ) UD_Swedish-Talbanken (KonText, PML-TQ) UD_Swedish-LinES (KonText, PML-TQ) UD_Swedish-PUD (KonText, PML-TQ) UD_Swedish_Sign_Language-SSLC (KonText, PML-TQ) UD_Tamil-TTB (KonText, PML-TQ) UD_Thai-PUD (KonText, PML-TQ) UD_Turkish-IMST (KonText, PML-TQ) UD_Turkish-PUD (KonText, PML-TQ) UD_Urdu-UDTB (KonText, PML-TQ) UD_Vietnamese-VTB (KonText, PML-TQ) For the relevant publication, see de Marneffe et al. (2021)	Download
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens Annotation: half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. Full corpus – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Licence: CC BY-SA 4.0	Croatian	This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016)	Download
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities Licence: CC BY-SA 4.0	Serbian	This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017)	Download

Corpus

Language

Description

Availability

Prague Arabic Dependency Treebank 1.0

Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 3.0

Arabic

This corpus is available for download from the LINDAT repository.

For the relevant publication, see Hajič et al. (2004)

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences
Annotation: syntactic parsing, labelling of semantic entities
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

For the relevant publication, see Kríž and Hladká (2018)

Size: 12760 sentences
Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains fictional texts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Jelínek (2017)

KonText

Download

Prague Dependency Treebank 3.5

Size: 2 million words
Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 4.0

Czech

This corpus is manually annotated at several levels – aside from syntactic parsing and morphological information, it is annotation for sentence information structure, multiword expression, coreference, bridging relations and discourse relations.

The corpus is available for download from the LINDAT repository.

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences
Annotation: syntactic parsing, mark-up of discourse phenomena enriched by the annotation of secondary connectives
Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5

The corpus is available through the PML-TQ tool.

PML-TQ

Slovak Dependency Treebank

Size: 106,000 tokens, 10,600 sentences
Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Czech

This syntactic parsing is modelled after the Prague Dependency Treebank.

The corpus is available for download from the LINDAT repository.

Download

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences
Annotation: syntactic parsing, mark-up of coreference
Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 style.

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool (Czech part only).

For the relevant publication, see Hajič et al. (2012)

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences
Annotation: syntactic parsing, mark-up of elliptical constructions
Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

This syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Lassy Klein-corpus

Size: 1 million tokens
Annotation: PoS tagging, syntactic parsing
Licence: VAGUE

Dutch

This corpus is available for download from the Dutch Language Institute and through the online environments PaQu and GrETEL.

For the relevant publication, see Noord (2009)

Size: 1 million words
Annotation: PoS tagging, syntactic parsing, semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

Estonian Treebank

Size: 1,000 sentences
Annotation: syntactic parsing
Licence: CLARIN_ACA

Estonian

The corpus contains fictional and newspaper texts.

The corpus is available for download from META-SHARE (CELR distribution).

Download

UD Estonian ver.2.3

Size: 434,000 tokens
Annotation: syntactic parsing
Licence: CC-BY-SA

Estonian

This corpus contains fictional, newspaper and scientific texts. The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from (CELR distribution).

For the relevant publication, see Muischnek et al. (2014)

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words
Annotation: morphosyntactic tagging and syntactic parsing
Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

For the relevant publication, see Orasmaa (2014)

Download

Finnish TreeBank 1

Size: 160,000 tokens
Annotation: syntactic parsing
Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Finnish TreeBank 2

Size: 160,000 tokens
Annotation: syntactic parsing
Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Turku Dependency Treebank

Size: 204,000 tokens
Annotation: syntactic parsing
Licence: CC-BY-SA

Finnish

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the Turku BioNLP Group.

For the relevant publication, see Haverinen et al. (2013)

Download

Syntactic Reference Corpus of Medieval French

Size: 245,000 words
Annotation: syntactic parsing
Licence: CLARIN ACA

French

This corpus contains Old French texts.

The corpus is available for download from the IMS CLARIN-D repository.

For the relevant publication, see Stein and Prévost (2013)

Download

GRUG Parallel Treebank

Size: 10,400 sentence pairs
Annotation: syntactic parsing, PoS tagging
Licence: CC-BY

Georgian, Ukranian, Russian, German

The corpus is syntactically parsed following the TIGER guidelines.

The corpus is available for download from a dedicated website provided by the CLARIN-D consortium.

Download

B4 Heliand

Size: 3495 tokens
Annotation: PoS tagging, syntactic parsing
Licence: CC-BY

German

This corpus contains historical German texts.

The corpus is available for download from the HZSK repository.

Download

Dependency-Annotated Subset of the CREG Corpus

Size: 109 sentences
Annotation: PoS tagging, syntactic parsing
Licence: CLARIN RES

German

This corpus consists of answers to reading comprehension questions written by American college students learning German.

The corpus is available for download from the Tübingen CLARIN Repository.

Download

Tübingen Treebank of Written German / Newspaper Corpus (TüBa-D/Z)

Size: 1.9 million tokens
Annotation: syntactic parsing
Licence: CLARIN RES

German

This corpus contains newspaper articles.

The corpus is available for download from the Tübingen CLARIN Repository.

Download

Szeged Treebank 2.0

Size: 82,000 sentences
Annotation: syntactic parsing
Licence: licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

For the relevant publication, see Csendes et al. (2005)

Download

Icelandic Parsed Historical Corpus (IcePaHC)

Size: 1 million tokens
Annotation: morphosyntactic tagging, lemmatisation, syntactic parsing
Licence: GNU LGPL

Icelandic

This corpus contains Icelandic texts from the 12th through the 21st centuries – approximately 100,000 words from each century. The corpus is syntactically parsed following the UUPenn scheme for historical textse

The corpus is available for online search through treebankstudio.org and for download in different formats from a dedicated webpage.

For the relevant publication, see Rögnvaldsson et al. (2012)

Download

Concordancer

LVTB - Latvian Treebank

Size: 289,791 tokens; 17,127 sentences
Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Latvian

This treebank is manually annotated according to a hybrid dependency-constituency grammar.

The treebank is available for download from the CLARIN-LV repository.

For the relevant publication, see Rituma et al. (2023)

Download

Lithuanian Treebank ALKSNIS

Size: 2,355 sentences
Annotation: syntactic parsing
Licence: CLARIN PUB

Lithuanian

Syntactic parsing follows the rules of the Prague Dependency Treebank

This corpus is available for download from the CLARIN-LT repository. The second version is available upon request.

Download

Polish Dependency Bank in Universal Dependency format

Size: 22,000 trees, 351,000 tokens
Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

Polish

This corpus also contains sentences showing certain problematic syntactic phenomena – sentences with ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction jako, etc. The syntactic parsing follows the Universal Dependencies schema.

The first version of the corpus is available for download from the Computational Linguistics in Poland website. The second version is available upon request.

For the relevant publication, see Wróblewska (2018)

Download

CINTIL DependencyBank

Size: 110,000 tokens
Annotation: morphosyntactic tagging and syntactic parsing
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL TreeBank

Size: 110,000 tokens
Annotation: syntactic parsing
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL-DeepBank

Size: 110,000 tokens
Annotation: PoS-tagging, syntactic parsing, grammatical functions, logical forms
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL-PropBank

Size: 110,000 tokens
Annotation: syntactic parsing and phrase semantic roles
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct. The syntactic parsing follows the Universal Dependencies framework.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Tamil Dependency Treebank v0.1

Size: 600 sentences
Annotation: syntactic parsing and morphosyntactic tagging
Licence: CC BY-NC-SA 3.0

Tamil

The syntactic parsing follows the rules of the https://ufal.mff.cuni.cz/pdt/.

The corpus is available for download from the LINDAT repository.

Download

HamleDT 3.0

Size: 19 treebanks
Annotation: syntactic parsing and morphosyntactic tagging
Licence: HamleDT 3.0 Licence Terms

19 languages

This treebank collection is available for download from LINDAT.

The treebanks can be individually queried through KonText and the treebank tool PML-TQ. We list them here by language:

Arabic(KonText, PML-TQ)
Bengali (KonText)
Catalan (KonText)
Czech (KonText, PML-TQ)
Dutch (KonText, PML-TQ)
English (KonText)
Estonian (KonText, PML-TQ)
German (KonText)
Greek (KonText)
Hindi (KonText)
Latin (KonText, PML-TQ)
Persian (KonText, PML-TQ)
Polish (KonText, PML-TQ)
Portuguese (KonText, PML-TQ)
Romanian (KonText, PML-TQ)
Russian (KonText)
Slovenian (KonText, PML-TQ)
Spanish (KonText)
Tamil (KonText, PML-TQ)

For the relevant publication, see Zeman et al. (2012)

Download

Treebanks of INESS

Size: 532 treebanks
Annotation: syntactic parsing
Licence: CC-BY

71 languages

This is a collection of treebanks made available through the Infrastructure for the Exploration of Syntax and Semantics (INESS).

The corpora are available for online querying through INESS.

For the relevant publication, see Rosén et al. (2012)

Universal Dependencies 2.12

Size: 30 million tokens; 30.6 million words; 1.8 million sentences
Annotation: syntactic parsing
Licence: Licence Universal Dependencies v2.12

75 languages

This corpus collection contains treebanks following theUniversal Dependencies framework.

The corpus collection is available for download from the LINDAT repository.

The individual treebanks in Universal Dependencies 2.3 can also be queried through the concordancer KonText and the treebank query tool PML-TQ. Below we provide links to these search environments for all the treebanks. For a detailed description of the treebanks, see the Universal Dependencies project page.

UD_Akkadian-PISANDUB (KonText)
UD_Amharic-ATT (KonText, PML-TQ)
UD_Armenian-ArmTDP (KonText, PML-TQ)
UD_Breton-KEB (KonText, PML-TQ)
UD_Buryat-BDT (KonText, PML-TQ)
UD_Cantonese-HK (KonText, PML-TQ)
UD_Chinese-HK (KonText, PML-TQ)
UD_Chinese-CFL (KonText, PML-TQ)
UD_Coptic-Scriptorium (KonText, PML-TQ)
UD_Croatian-SET (KonText, PML-TQ)
UD_English-ESL (KonText, PML-TQ)
UD_Faroese-OFT (KonText, PML-TQ)
UD_Galician-TreeGal (KonText, PML-TQ)
UD_Hindi_English-HIENCS (KonText)
UD_Kazakh-KTB 2.2 (KonText, PML-TQ)
UD_Komi_Zyrian-Lattice (KonText, PML-TQ)
UD_Komi_Zyrian-IKDP (KonText, PML-TQ)
UD_Kurmanji-MG (KonText, PML-TQ)
UD_Lithuanian-HSE (KonText, PML-TQ)
UD_Maltese-MUDT (KonText, PML-TQ)
UD_Marathi-UFAL (KonText, PML-TQ)
UD_Naija-NSC (KonText, PML-TQ)
UD_Persian-Seraji (KonText, PML-TQ)
UD_Russian-Taiga (KonText, PML-TQ)
UD_Sanskrit-UFAL (KonText, PML-TQ)
UD_Serbian-SET (KonText, PML-TQ)
UD_Slovenian-SST (KonText, PML-TQ)
UD_Tagalog-TRG (KonText, PML-TQ)
UD_Telugu-MTG (KonText, PML-TQ)
UD_Ukrainian-IU (KonText, PML-TQ)
UD_Upper_Sorbian-UFAL (KonText, PML-TQ)
UD_Uyghur-UDT (KonText, PML-TQ)
UD_Warlpiri-UFAL (KonText, PML-TQ)
UD_Yoruba-YTB (KonText, PML-TQ)
UD_Afrikaans-AfriBooms (KonText)
UD_Ancient_Greek-PROIEL (KonText)
UD_Ancient_Greek-Perseus (KonText, PML-TQ)
UD_Arabic-PADT (KonText, PML-TQ)
UD_Arabic-PUD (KonText, PML-TQ)
UD_Arabic-NYUAD (KonText)
UD_Bambara-CRB (KonText, PML-TQ)
UD_Basque-BDT (KonText, PML-TQ)
UD_Belarusian-HSE (KonText, PML-TQ)
UD_Bulgarian-BTB (KonText, PML-TQ)
UD_Catalan-AnCora (KonText, PML-TQ)
UD_Chinese-GSD (KonText, PML-TQ)
UD_Chinese-PUD (KonText, PML-TQ)
UD_Czech-PDT (KonText, PML-TQ)
UD_Czech-CAC (KonText, PML-TQ)
UD_Czech-FicTree (KonText, PML-TQ)
UD_Czech-PUD (KonText, PML-TQ)
UD_Czech-CLTT (KonText, PML-TQ)
UD_Danish-DDT (KonText, PML-TQ)
UD_Dutch-Alpino (KonText, PML-TQ)
UD_Dutch-LassySmall (KonText, PML-TQ)
UD_English-ParTUT (KonText, PML-TQ)
UD_English-GUM (KonText, PML-TQ)
UD_English-EWT (KonText, PML-TQ)
UD_English-PUD (KonText, PML-TQ)
UD_English-LinES (KonText, PML-TQ)
UD_Erzya-JR (KonText, PML-TQ)
UD_Finnish-FTB (KonText, PML-TQ)
UD_Finnish-TDT (KonText, PML-TQ)
UD_Finnish-PUD (KonText, PML-TQ)
UD_French-ParTUT (KonText, PML-TQ)
UD_French-GSD (KonText, PML-TQ)
UD_French-Sequoia (KonText, PML-TQ)
UD_French-Spoken (KonText, PML-TQ)
UD_French-PUD (KonText, PML-TQ)
UD_French-FTB (KonText)
UD_Galician-CTG (KonText, PML-TQ)
UD_German-GSD (KonText, PML-TQ)
UD_German-PUD (KonText, PML-T )
UD_Gothic-PROIEL (KonText, PML-TQ)
UD_Greek-GDT (KonText, PML-TQ)
UD_Hebrew-HTB (KonText, PML-TQ)
UD_Hindi-HDTB (KonText, PML-TQ)
UD_Hindi-PUD (KonText, PML-TQ)
UD_Hungarian-Szeged (KonText, PML-TQ)
UD_Indonesian-GSD (KonText, PML-TQ)
UD_Indonesian-PUD (KonText, PML-TQ)
UD_Irish-IDT (KonText, PML-TQ)
UD_Italian-ISDT (KonText, PML-TQ)
UD_Italian-ParTUT (KonText, PML-TQ)
UD_Italian-PUD (KonText, PML-TQ)
UD_Japanese-GSD (KonText, PML-TQ)
UD_Japanese-PUD (KonText, PML-TQ)
UD_Japanese-Modern (KonText, PML-TQ)
UD_Korean-Kaist (KonText, PML-TQ)
UD_Korean-GSD (KonText, PML-TQ)
UD_Korean-PUD (KonText, PML-TQ)
UD_Latin-PROIEL (KonText, PML-TQ)
UD_Latin-ITTB (KonText, PML-TQ)
UD_Latin-Perseus (KonText, PML-TQ)
UD_Latvian-LVTB (KonText, PML-TQ)
UD_North_Sami-Giella (KonText, PML-TQ)
UD_Norwegian-Bokmaal (KonText, PML-TQ)
UD_Norwegian-Nynorsk (KonText, PML-TQ)
UD_Norwegian-NynorskLIA (KonText, PML-TQ)
UD_Old_Church_Slavonic-PROIEL (KonText, PML-TQ)
UD_Old_French-SRCMF (KonText, PML-TQ)
UD_Polish-LFG (KonText, PML-TQ)
UD_Polish-SZ (KonText, PML-TQ)
UD_Portuguese-Bosque (KonText, PML-TQ)
UD_Portuguese-GSD (KonText, PML-TQ)
UD_Portuguese-PUD (KonText, PML-TQ)
UD_Romanian-RRT (KonText, PML-TQ)
UD_Romanian-Nonstandard (KonText, PML-TQ)
UD_Russian-GSD (KonText, PML-TQ)
UD_Russian-PUD (KonText, PML-TQ)
UD_Russian-SynTagRus (KonText, PML-TQ)
UD_Slovak-SNK (KonText, PML-TQ)
UD_Slovenian-SSJ (KonText, PML-TQ)
UD_Spanish-AnCora (KonText, PML-TQ)
UD_Spanish-GSD (KonText, PML-TQ)
UD_Spanish-PUD (KonText, PML-TQ)
UD_Swedish-Talbanken (KonText, PML-TQ)
UD_Swedish-LinES (KonText, PML-TQ)
UD_Swedish-PUD (KonText, PML-TQ)
UD_Swedish_Sign_Language-SSLC (KonText, PML-TQ)
UD_Tamil-TTB (KonText, PML-TQ)
UD_Thai-PUD (KonText, PML-TQ)
UD_Turkish-IMST (KonText, PML-TQ)
UD_Turkish-PUD (KonText, PML-TQ)
UD_Urdu-UDTB (KonText, PML-TQ)
UD_Vietnamese-VTB (KonText, PML-TQ)

For the relevant publication, see de Marneffe et al. (2021)

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens
Annotation: half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. Full corpus – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities.
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities
Licence: CC BY-SA 4.0

Serbian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Named Entity Recognition

Corpus	Language	Description	Availability
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition Licence: CC BY 4.0	Croatian	This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016)	Download
Training corpus hr500k 1.0 Size: 500,000 tokens Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of corpus also syntactically parsed Licence: CC BY-SA 4.0	Croatian	This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.	KonText noSketch Download
Czech Named Entity Corpus 1.1 Size: 5868 sentences, 35220 NEs Annotation: Named Entity recognition Licence: CC BY-NC-SA 3.0	Czech	This corpus is available for download from LINDAT. For the relevant publication, see Kravalová and Žabokrtský (2009)	Download
xLiMe Twitter Corpus XTC 1.0.1 Size: 364,000 tokens Annotation: PoS tagging, Named Entity recognition, sentiment analysis Licence: MIT License	German, Italian, Spanish	This corpus contains Tweets. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Rei et al. (2016)	Download
KPWr (Polish Corpus of Wrocław University of Technology) 1.2 Size: 447,000 tokens Annotation: chunks and selected predicate-argument relations, Named Entity recognition, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases Licence: CC BY-SA 3.0	Polish	This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.). The corpus is available for download from the CLARIN-PL repository.	Download
Polish Spatial Texts 1.0 Size: 46,000 tokens Annotation: Named Entity recognition (spatial expressions) Licence: CC BY-SA 4.0	Polish	This corpus contains travel blogs. The corpus is available for download from the CLARIN-PL repository.	Download
CINTIL-Corpus Internacional do Português Size: 1 million tokens Annotation: morphosyntactic tagging, Named Entity recognition Licence: CLARIN RES	Portuguese	The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.). The corpus is available for download from the CLARIN PORTULAN repository.	Download
Training corpus SETimes.SR 1.0 Size: 87,000 tokens Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition Licence: CC BY-SA 4.0	Serbian	This corpus contains posts from the Southeast European Times news portal, which is now no longer being updated. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Batanović et al. (2018)	KonText noSketch Download
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition Licence: CC BY 4.0	Serbian	This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016).	Download
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition Licence: CC BY-SA 4.0	Slovenian	This corpus contains computer-mediated communication (CMC). The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018)	KonText noSketch Download
Training corpus ssj500k 2.1 Size: 586,000 tokens Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles Licence: CC BY-NC-SA 4.0	Slovenian	This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.	KonText noSketch Download
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. Licence: CC BY-SA 4.0	Croatian	This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016)	Download
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities Licence: CC BY-SA 4.0	Serbian	This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017)	Download

Corpus

Language

Description

Availability

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Croatian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

Training corpus hr500k 1.0

Size: 500,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of corpus also syntactically parsed
Licence: CC BY-SA 4.0

Croatian

This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Czech Named Entity Corpus 1.1

Size: 5868 sentences, 35220 NEs
Annotation: Named Entity recognition
Licence: CC BY-NC-SA 3.0

Czech

This corpus is available for download from LINDAT.

For the relevant publication, see Kravalová and Žabokrtský (2009)

Download

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens
Annotation: PoS tagging, Named Entity recognition, sentiment analysis
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

KPWr (Polish Corpus of Wrocław University of Technology) 1.2

Size: 447,000 tokens
Annotation: chunks and selected predicate-argument relations, Named Entity recognition, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases
Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Spatial Texts 1.0

Size: 46,000 tokens
Annotation: Named Entity recognition (spatial expressions)
Licence: CC BY-SA 4.0

Polish

This corpus contains travel blogs.

The corpus is available for download from the CLARIN-PL repository.

Download

CINTIL-Corpus Internacional do Português

Size: 1 million tokens
Annotation: morphosyntactic tagging, Named Entity recognition
Licence: CLARIN RES

Portuguese

The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.).

The corpus is available for download from the CLARIN PORTULAN repository.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now no longer being updated.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanović et al. (2018)

KonText

noSketch

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY 4.0

Serbian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016).

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC).

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus ssj500k 2.1

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Croatian linguistic training corpus hr500k 2.0

Croatian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities
Licence: CC BY-SA 4.0

Serbian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Sentiment analysis

Corpus	Language	Description	Availability
xLiMe Twitter Corpus XTC 1.0.1 Size: 364,000 tokens Annotation: PoS tagging, Named Entity recognition, sentiment analysis Licence: MIT License	German, Italian, Spanish	This corpus contains Tweets. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Rei et al. (2016)	Download
Twitter sentiment for 15 European languages Size: 1.6 million tweets Annotation: sentiment analysis Licence: CC BY-SA 4.0	Albanian, Bosnian, Bulgarian, Croatian, English, German, Hungarian, Polish, Portuguese, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish	This corpus contains Tweet IDs with sentiment annotations. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Mozetič et al. (2016)	Download
Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0 Size: 407.5 million words Annotation: sentiment analysis (socially unacceptable discourse) Licence: CC BY-SA 4.0	Croatian	This corpus contains news comments from the website 24sata.hr. The corpus is available for download from CLARIN.SI.	Download
Aspect-Term Annotated Customer Reviews in Czech Size: 2200 reviews Annotation: sentiment analysis Licence: CC BY-NC-SA 3.0	Czech	This corpus contains online user-product reviews. The corpus is available for download from LINDAT.	Download
Facebook Data for Sentiment Analysis Size: 10,000 Facebook posts Annotation: sentiment analysis Licence: CC BY-SA 3.0	Czech	This corpus contains Facebook posts. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Habernal et al. (2013)	KonText Download
FinnSentiment 1.1 Size: 27,000 sentences Annotation: sentiment analysis Licence: CC BY	Finnish	This corpus contains sentences from Finnish social media that have been manually annotated for sentiment polarity by three native annotators. The corpus is available for download from META-SHARE (the Finnish Language Bank). For the relevant publication, see Lindén et al. (2023)	Download
NoReC: The Norwegian Review Corpus Size: 14.8 million tokens Annotation: sentiment analysis Licence: CC BY-NC 3.0	Norwegian	This corpus contains reviews in different domains (e.g., literature, videogames, etc.). The corpus is available for download from the CLARINO repository. For the relevant publication, see Velldal et al. (2018)	Download
Manually sentiment annotated Slovenian news corpus SentiNews 1.0 Size: 10,427 articles Annotation: sentiment analysis Licence: CC BY-SA 4.0	Slovenian	This corpus contains news articles. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Bučar et al. (2018)	Download

Other annotation layers

Corpus	Language	Description	Availability
PARSEME corpora annotated for verbal multiword expressions (version 1.3) Size: 5.8 million tokens Annotation: identification of verbal multi-word expressions (idioms, light-verb constructions, verb-particle constructions, inherently reflexive verbs, multi-verb constructions) Licence: PARSEME Shared Task Data (v. 1.1) Agreement	Arabic, Basque, Bulgarian, Chinese, Croatian, Czech, English, French, German, Hebrew, Hindi, Hungarian, Irish, Italian, Lithuanian, Maltese, Modern Greek (1453-), Persian, Polish, Portuguese, Romanian, Serbian, Slovenian, Spanish, Swedish, Turkish	This multilingual resource contains corpora in which verbal multi-word expressions (MWEs) have been manually annotated. Verbal MWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). The 1.0 versions of the PARSEME corpora can be queried individually through KonText. We provide the individual links to each corpus: Parseme VMWE 1.0 – Czech Parseme VMWE 1.0 – German Parseme VMWE 1.0 – Greek Parseme VMWE 1.0 – Spanish Parseme VMWE 1.0 – Persian Parseme VMWE 1.0 – French Parseme VMWE 1.0 – Hungarian Parseme VMWE 1.0 – Italian Parseme VMWE 1.0 – Maltese Parseme VMWE 1.0 – Polish Parseme VMWE 1.0 – Portuguese Parseme VMWE 1.0 – Romanian Parseme VMWE 1.0 – Slovenian Parseme VMWE 1.0 – Swedish Parseme VMWE 1.0 – Turkish For the relevant publication, see Savary et al. (2023)	Download
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens Annotation: a subset tagged for multi-word expressions and semantic roles Licence: CC BY-SA 4.0	Croatian	This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016)	Download
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens Annotation: word normalisation Licence: CC BY 4.0	Croatian	This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016)	Download
Czech Legal Text Treebank 2.0 Size: 1121 sentences Annotation: semantic role labelling Licence: CC BY-NC-SA 4.0	Czech	This corpus contains legal texts. The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.	KonText PML-TQ Download
Prague Discourse Treebank 2.0 Size: 49,500 sentences Annotation: mark-up of discourse phenomena enriched by the annotation of secondary connectives Licence: CC-BY	Czech	This corpus is a subset of the Prague Dependency Treebank 3.5. The corpus is available through the PML-TQ tool.	PML-TQ
Prague Czech-English Dependency Treebank 2.0 Coref Size: 49,000 sentences Annotation: mark-up of coreference Licence: CC-BY-NC-SA + LDC99T42 (restricted use)	Czech, English	This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 styleD The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool.T 2.0 style.	KonText PML-TQ Download
Artificial Treebank with Ellipsis Size: 106,000 tokens, 10,604 sentences Annotation: mark-up of elliptical constructions Licence: Licence Universal dependencies v2.1	Czech, English, Finnish, Russian, Slovak	The syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from the LINDAT repository.	Download
Grundtvig's Works Corpus Size: 11,417,194 words Annotation: linked data (places, persons, bible citations, etc.) Licence: CC BY-NC 4.0	Danish	This corpus contains the literary works of the Danish bishop N.F.S Grundtvig. The corpus is available for download from the CLARIN-DK repository.	Download
SoNaR-1 Size: 1 million words Annotation: semantic role labelling	Dutch	This is a manually annotated subset of the much larger (approx.. 500 million) word) SoNaR corpus. The corpus is available for download from the Dutch Language Institute.	Download
Natural Language 2 Semantic Hypergraph Dataset NL2SH 1.0 Size: 6,851 tokens Annotation: semantic role labelling, coreference, tokenisation, PoS-tagging, lemmatisation, syntactic dependencies, named entities Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED	English	This corpus can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources: John Eastwood, Oxford Guide to English Grammar, Oxford University Press, 2002. Andrew Redford, An Introduction to English Sentence Structure, Cambridge University Press, 2009. Essential English Grammar, Philip Gucker, Dover Publications, Inc. New York, 1966.	Download
Speech, Thought and Writing Presentation Corpus Size: 260,000 words Annotation: identification of reported speech Licence: CC BY-NC-SA 3.0	English	This corpus contains literary, newspaper and biography texts. The corpus is available for download from the Oxford Text Archive.	Download
The ACL RD-TEX 2.0 Size: 33216 tokens Annotation: terminology extraction/classification Licence: CC BY-NC-SA 4.0	English	This corpus contains 6818 terms extracted from abstracts of computational linguistics papers. The corpus is available for download from LINDAT and through KonText. For the relevant publication, see QasemiZadeh and Schumann (2016)	KonText Download
Estonian Treebank annotated with coreference relations Size: 107,000 words Annotation: anaphora relations Licence: GPL	Estonian	This corpus contains newspaper texts plus one scientific medical text. The corpus is available for download from META-SHARE (CELR distribution).	Download
Semantically disambiguated corpus of Estonian Size: 375,733 tokens Annotation: word sense disambiguation Licence: CLARIN ACA	Estonian	The corpus is available for download from META-SHARE (CELR distribution).	Download
TimeML annotated corpus of Estonian newspaper articles Size: 22,000 words Annotation: temporal semantic annotations Licence: CC-BY-SA	Estonian	This corpus contains newspaper articles. The corpus is available for download from META-SHARE (CELR distribution).	Download
Greek Coreference Corpus Size: 62,988 tokens Annotation: coreference Licence: CC-BY-NC-SA	Greek	In addition to coreference, the corpus is annotated for identity and bridging relations. In addition to coreference, the corpus is annotated for identity and bridging relations. For the relevant publication, see Ogrodnizcuk et al. (2015)	Download
Greek Textual Entailment Corpus Size: 600 sentence-pairs Annotation: logical entailment Licence: CC-BY	Greek	This corpus contains texts from the domains of politics, law and travel. This corpus is available for download from the clarin:el repository.	Download
KPWr (Polish Corpus of Wrocław University of Technology) 1.2 Size: 447,000 tokens Annotation: selected predicate-argument relations, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases Licence: CC BY-SA 3.0	Polish	This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.). The corpus is available for download from the CLARIN-PL repository.	Download
Polish Coreference Corpus Size: 540,000 tokens Annotation: coreference Licence: CC BY 3	Polish	This corpus contains texts in a variety of domains (magazines, fiction literature, non-fiction literature, computer-mediated communication, academic writing, etc.). The corpus is available for download and online browsing.	Concordancer Download
Polish Summaries Corpus Size: 10845 summaries Annotation: summarization Licence: CC BY 3	Polish	This corpus is available for download from the ZIL IPI PAN repository. For the relevant publication, see Ogrodniczuk and Kopeć (2014)	Download
WUT Relations Between Sentences Corpus Size: 5654 sentences Annotation: relations between sentences - Cross-document Structure Theory (CST) Licence: CC BY-SA 3.0	Polish	This corpus contains news items. The corpus is available for download from the CLARIN.PL repository.	Download
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens Annotation: word normalisation Licence: CC BY 4.0	Serbian	This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016)	Download
ASR database ARTUR 1.0 Size: 884 hours Annotation: orthographically transcribed speech Licence: CC BY-SA 4.0	Slovenian	This corpus was designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. The audio files are available in a separate repository entry. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected. The data are structured as follows: Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc. (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly.	Download (transcriptions) Download (audio files)
CMC training corpus Janes-Norm 1.2 Size: 184,755 tokens Annotation: normalization Licence: CC BY-SA 4.0	Slovenian	This corpus is partially also manually annotated with MSD tags and lemmatized. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.	KonText noSketch Download
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens Annotation: word normalisation Licence: CC BY-SA 4.0	Slovenian	This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018)	KonText noSketch Download
Corpus of comma placement Vejica 1.3 Size: 104,000 sentences Annotation: comma placement Licence: CC BY-NC-SA 4.0	Slovenian	This corpus contains texts from various Slovenian corpora (KUST, Šolar aLektorm JANES-Vejican Wikpedia. The corpus is available for dow.nload from CLARIN.SI.	Download
Slovenian Definition Extraction evaluation datasets RSDO-def 1.0 Size: 2,216 sentences Annotation: term definition evaluation Licence: CC BY-SA 4.0	Slovenian	This corpus contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1, which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus. The labels added to the sentences included in the dataset denote: 0: Non-definition; 1: Weak definition; 2: Definition. The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions; and 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions. Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts. The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Tran et al. (2023)#SEPPollak (2014)	Download
Slovenian Word in Context dataset SloWiC 1.0 Size: 14,958 items Annotation: word sense disambiguation Licence: CC BY-SA 4.0	Slovenian	The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task. Each example contains the following data fields: word: The target word with multiple meanings sentence1: The first sentence containing the target word sentence2: The second sentence containing the target word idx: The index of the example in the dataset label: Label showing if the sentences contain the same meaning of the target word start1: Start of the target word in the first sentence start2: Start of the target word in the second sentence end1: End of the target word in the first sentence end2: End of the target word in the second sentence version: The version of the annotation manual_annotation: Boolean showing if the label was manually annotated group: The group of annotators that labelled the example	Download
Terminology identification dataset KAS-term 1.0 Size: 22,950 term candidates Annotation: monolingual term extraction Licence: CC BY-SA 4.0	Slovenian	This corpus contains term candidates from PhD theses in chemistry, computer science and political science. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Holozan (2018)	Download
Training corpus ssj500k 2.1 Size: 586,000 tokens Annotation: verbal multiword expression tagging, semantic role labelling Licence: CC BY-NC-SA 4.0	Slovenian	This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.	KonText noSketch Download
Bilingual terminology extraction dataset KAS-biterm 1.0 Size: 1,950 sentences, 78,500 tokens, 3,700 terms Annotation: bi-lingual term extraction Licence: CC BY-SA 4.0	Slovenian, English	This corpus contains PHD theses. The corpus is available for download from the CLARIN.SI repository.	Download

Corpus

Language

Description

Availability

PARSEME corpora annotated for verbal multiword expressions (version 1.3)

Size: 5.8 million tokens
Annotation: identification of verbal multi-word expressions (idioms, light-verb constructions, verb-particle constructions, inherently reflexive verbs, multi-verb constructions)
Licence: PARSEME Shared Task Data (v. 1.1) Agreement

Arabic, Basque, Bulgarian, Chinese, Croatian, Czech, English, French, German, Hebrew, Hindi, Hungarian, Irish, Italian, Lithuanian, Maltese, Modern Greek (1453-), Persian, Polish, Portuguese, Romanian, Serbian, Slovenian, Spanish, Swedish, Turkish

This multilingual resource contains corpora in which verbal multi-word expressions (MWEs) have been manually annotated. Verbal MWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do).

The 1.0 versions of the PARSEME corpora can be queried individually through KonText. We provide the individual links to each corpus:

For the relevant publication, see Savary et al. (2023)

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens
Annotation: a subset tagged for multi-word expressions and semantic roles
Licence: CC BY-SA 4.0

Croatian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens
Annotation: word normalisation
Licence: CC BY 4.0

Croatian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences
Annotation: semantic role labelling
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

KonText

PML-TQ

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences
Annotation: mark-up of discourse phenomena enriched by the annotation of secondary connectives
Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5.

The corpus is available through the PML-TQ tool.

PML-TQ

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences
Annotation: mark-up of coreference
Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 styleD

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool.T 2.0 style.

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences
Annotation: mark-up of elliptical constructions
Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Grundtvig's Works Corpus

Size: 11,417,194 words
Annotation: linked data (places, persons, bible citations, etc.)
Licence: CC BY-NC 4.0

Danish

This corpus contains the literary works of the Danish bishop N.F.S Grundtvig.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR-1

Size: 1 million words
Annotation: semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx.. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

Natural Language 2 Semantic Hypergraph Dataset NL2SH 1.0

Size: 6,851 tokens
Annotation: semantic role labelling, coreference, tokenisation, PoS-tagging, lemmatisation, syntactic dependencies, named entities
Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED

English

This corpus can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources:

John Eastwood, Oxford Guide to English Grammar, Oxford University Press, 2002.
Andrew Redford, An Introduction to English Sentence Structure, Cambridge University Press, 2009.
Essential English Grammar, Philip Gucker, Dover Publications, Inc. New York, 1966.

Download

Speech, Thought and Writing Presentation Corpus

Size: 260,000 words
Annotation: identification of reported speech
Licence: CC BY-NC-SA 3.0

English

This corpus contains literary, newspaper and biography texts.

The corpus is available for download from the Oxford Text Archive.

Download

The ACL RD-TEX 2.0

Size: 33216 tokens
Annotation: terminology extraction/classification
Licence: CC BY-NC-SA 4.0

English

This corpus contains 6818 terms extracted from abstracts of computational linguistics papers.

The corpus is available for download from LINDAT and through KonText.

For the relevant publication, see QasemiZadeh and Schumann (2016)

KonText

Download

Estonian Treebank annotated with coreference relations

Size: 107,000 words
Annotation: anaphora relations
Licence: GPL

Estonian

This corpus contains newspaper texts plus one scientific medical text.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Semantically disambiguated corpus of Estonian

Size: 375,733 tokens
Annotation: word sense disambiguation
Licence: CLARIN ACA

Estonian

The corpus is available for download from META-SHARE (CELR distribution).

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words
Annotation: temporal semantic annotations
Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Greek Coreference Corpus

Size: 62,988 tokens
Annotation: coreference
Licence: CC-BY-NC-SA

Greek

In addition to coreference, the corpus is annotated for identity and bridging relations.

For the relevant publication, see Ogrodnizcuk et al. (2015)

Download

Greek Textual Entailment Corpus

Size: 600 sentence-pairs
Annotation: logical entailment
Licence: CC-BY

Greek

This corpus contains texts from the domains of politics, law and travel.

This corpus is available for download from the clarin:el repository.

Download

KPWr (Polish Corpus of Wrocław University of Technology) 1.2

Size: 447,000 tokens
Annotation: selected predicate-argument relations, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases
Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Coreference Corpus

Size: 540,000 tokens
Annotation: coreference
Licence: CC BY 3

Polish

This corpus contains texts in a variety of domains (magazines, fiction literature, non-fiction literature, computer-mediated communication, academic writing, etc.).

The corpus is available for download and online browsing.

Concordancer

Download

Polish Summaries Corpus

Size: 10845 summaries
Annotation: summarization
Licence: CC BY 3

Polish

This corpus is available for download from the ZIL IPI PAN repository.

For the relevant publication, see Ogrodniczuk and Kopeć (2014)

Download

WUT Relations Between Sentences Corpus

Size: 5654 sentences
Annotation: relations between sentences - Cross-document Structure Theory (CST)
Licence: CC BY-SA 3.0

Polish

This corpus contains news items.

The corpus is available for download from the CLARIN.PL repository.

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens
Annotation: word normalisation
Licence: CC BY 4.0

Serbian

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

ASR database ARTUR 1.0

Size: 884 hours
Annotation: orthographically transcribed speech
Licence: CC BY-SA 4.0

Slovenian

This corpus was designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only.

The audio files are available in a separate repository entry. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected.

The data are structured as follows:

Artur-B, read speech, 573 hours in total.

It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment.
(2) Artur-J, public speech, 62 hours in total.

It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc.
(3) Artur-N, private speech, 74 hours in total.

It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics.
(4) Artur-P, parliamentary speech, 201 hours in total.

It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly.

Download (transcriptions)

Download (audio files)

CMC training corpus Janes-Norm 1.2

Size: 184,755 tokens
Annotation: normalization
Licence: CC BY-SA 4.0

Slovenian

This corpus is partially also manually annotated with MSD tags and lemmatized.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens
Annotation: word normalisation
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Corpus of comma placement Vejica 1.3

Size: 104,000 sentences
Annotation: comma placement
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains texts from various Slovenian corpora (KUST, Šolar aLektorm JANES-Vejican Wikpedia.

The corpus is available for dow.nload from CLARIN.SI.

Download

Slovenian Definition Extraction evaluation datasets RSDO-def 1.0

Size: 2,216 sentences
Annotation: term definition evaluation
Licence: CC BY-SA 4.0

Slovenian

This corpus contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1, which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus. The labels added to the sentences included in the dataset denote: 0: Non-definition; 1: Weak definition; 2: Definition.

The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions; and 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions. Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts.

The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Tran et al. (2023)#SEPPollak (2014)

Download

Slovenian Word in Context dataset SloWiC 1.0

Size: 14,958 items
Annotation: word sense disambiguation
Licence: CC BY-SA 4.0

Slovenian

The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task.

Each example contains the following data fields:

word: The target word with multiple meanings
sentence1: The first sentence containing the target word
sentence2: The second sentence containing the target word
idx: The index of the example in the dataset
label: Label showing if the sentences contain the same meaning of the target word
start1: Start of the target word in the first sentence
start2: Start of the target word in the second sentence
end1: End of the target word in the first sentence
end2: End of the target word in the second sentence
version: The version of the annotation
manual_annotation: Boolean showing if the label was manually annotated
group: The group of annotators that labelled the example

Download

Terminology identification dataset KAS-term 1.0

Size: 22,950 term candidates
Annotation: monolingual term extraction
Licence: CC BY-SA 4.0

Slovenian

This corpus contains term candidates from PhD theses in chemistry, computer science and political science.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Holozan (2018)

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens
Annotation: verbal multiword expression tagging, semantic role labelling
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Bilingual terminology extraction dataset KAS-biterm 1.0

Size: 1,950 sentences, 78,500 tokens, 3,700 terms
Annotation: bi-lingual term extraction
Licence: CC BY-SA 4.0

Slovenian, English

This corpus contains PHD theses.

The corpus is available for download from the CLARIN.SI repository.

Download

Publications

[Batanović et al. 2018] Vuk Batanović, Nikola Ljubešić, and Tanja Samadržić. 2018. SETimes.SR – A Reference Training Corpus of Serbian.

[Bučar et al. 2018] Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[Csendes et al. 2005] Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged Treebank.

[Erjavec 2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Erjavec et al. 2010] Tomaž Erjavec, Darja Fišer, Simon Krek, and Nina Ledinek. 2010. The JOS Linguistically Tagged Corpus of Slovene.

[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.

[Habernal et al. 2013] Ivan Habernal, Tomáš Ptáček, and Josef Steinberger. 2013. Sentiment Analysis in Czech Social Media Using Supervised Machine Learning.

[Hajič et al. 2004] Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools

[Hajič et al. 2012] Jan, Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announcing Prague Czech-English Dependency Treebank 2.0

[Haverinen et al. 2014] Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2014. Building the essential resources for Finnish: the Turku Dependency Treebank.

[Holozan 2018] Peter Holozan. 2018. Corpus of comma placement Vejica 1.3.

[Kravalová and Žabokrtský 2009] Jana Kravalová and Zdenek Žabokrtský. 2009. Czech Named Entity Corpus and SVM-based Recognizer.

[Kríž and Hladká 2018] Vincent Kríz and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0.

[Miličević and Ljubešić 2016] Maja Miličević and Nikola Ljubešić. 2016. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets.

[Mozetič et al. 2016] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

[Muischnek et al. 2014] Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme

[van Noord 2009] Gertjan van Noord. 2009. Huge Parsed Corpora in LASSY.

[Jelínek 2017] Tomáš Jelínek. 2017. FicTree: a Manually Annotated Treebank of Czech Fiction.

[Ogrodniczuk and Kopeć 2014] Maciej Ogrodniczuk and Mateusz Kopeć. The Polish Summaries Corpus.

[Ogrodnizcuk et al. 2015] Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, and Magdalena Zawisławska. Coreference in Polish: Annotation, Resolution and Evaluation in Polish.

[Orasmaa 2014] Siim Orasmaa. Towards an Integration of Syntactic and Temporal Annotations in Estonian.

[Przepiórkowski and Murzynowski 2011] Adam Przepiórkowski and Grzegorz Murzynowski. 2011. Manual annotation of the National Corpus of Polish with Anotatornia.

[QasemiZadeh and Schumann 2016] Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods.

[Rei et al. 2016] Luis Rei, Dunja Mladenić, and Simon Krek. 2016. A Multilingual Social Media Linguistic Corpus.

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[Rögnvaldsson et al. 2012] Eiríkur Rögnvaldsson, Anton Karl Ingason, Einar Freyr Sigurðsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC).

[Rosén et al. 2012] Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik. 2012. An Open Infrastructure for Advanced Treebanking.

[Stein and Prévost 2013] Achim Stein and Sophie Prévost. 2013. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF).

[Velldal et al. 2018] Erik Velldal, Lilja Øvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, and Fredrik Jørgensen. 2018. NoReC: The Norwegian Review Corpus

[Wróblewska 2018] Alina Wróblewska. 2018. Extended and enhanced Polish dependency bank in Universal Dependencies format.

[Zeman et al. 2012] Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, and Jan Hajič. 2012. HamleDT: To Parse or Not to Parse?