Skip to main content

Manually Annotated Corpora

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools, as well as testing the accuracy of existing annotation tools. 

There are more than 70 manually annotated training corpora and corpus collections in the CLARIN infrastructure. Among the multilingual corpora, there are 4 collections in the CLARIN infrastructure that were annotated under the following umbrella initiatives: HamleDT 3.0, Treebanks of INESS, Universal Dependencies, and Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1).  

The corpora and corpus collections are classified into six categories based on the type of manual annotation:

If a corpus is manually annotated for more than one linguistic information, then it is listed under all the relevant sections. For instance, the xLiMe Twitter Corpus XTC 1.0.1 is manually annotated for PoS tags, Named Entities and sentiment, so it is listed under all the three relevant sections.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

The Manually Annotated Corpora

PoS MSD tagging

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words 
Annotation: morphosyntactic tagging, lemmatisation, sentence alignment 
Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

The Morphologically Annotated Part of BulTreeBank

Size: 214,000 tokens 
Annotation: morphosyntactic tagging 
Licence: MS-NC-NoReD

Bulgarian This corpus is available for download through the concordancer Corpuscle. Concordancer

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY 4.0

Croatian

This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

BNC Sampler

Size: 2 million tokens 
Annotation: PoS tagging 
Licence: BNC Licence

English

The corpus was manually post-edited to correct the PoS tags automatically assigned by CLAWS.

The corpus is available for online querying via CQPWeb (registration required) for download from the Oxford Text Archive

Concordancer

Download

Corpus of morphologically disambiguated Estonian texts

Size: 513,000 tokens 
Annotation: morphological disambiguation 
Licence: CLARIN_ACA-NC

Estonian This corpus contains texts from the 1980s subcorpus of the Corpus of Written Estonian 1890-1990. Download

Austrian Baroque Corpus

Size: 200,000 tokens 
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens 
Annotation: PoS tagging, Named Entity recognition, sentiment analysis 
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

Szeged Corpus 2.0

Size: 1.5 million tokens 
Annotation: morphosyntactic tagging 
Licence: Licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

To download the versions of the Szeged Corpus and Szeged Treebank, you are obliged to fill and send a Licence Agreement.

Download

Lithuanian morphologically annotated corpus - MATAS

Size: 1.6 million words 
Annotation: morphosyntactic tagging 
Licence: CLARIN ACA

Lithuanian

The corpus contains texts from various domains (documents, fiction, periodicals, scientific texts, wordforms).

This corpus is available for download from the CLARIN-LT repository.

Download

NKJP1M

Size: 1 million tokens 
Annotation: morphosyntactic tagging 
Licence: GNU GPL 3

Polish

This corpus is a manually annotated subset of the National Corpus of Polish.

The corpus is available for download from the Computational Linguistics in Poland website.

For the relevant publication, see Przepiórkowski and Murzynowski (2011)

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens 
Annotation: morphosyntactic tagging, tokenisation, sentence segmentation, word normalisation, lemmatisation and Named Entity recognition 
Licence: CC BY 4.0

Serbian

This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words 
Annotation: morphosyntactic tagging and lemmatisation 
Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010)

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens 
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles 
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens 
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. 
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities 
Licence: CC BY-SA 4.0

Serbian

This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Lemmatisation

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 80,000 sentences, 1 million words 
Annotation: morphosyntactic tagging, lemmatisation, sentence alignment 
Licence: CC BY-NC-SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012)

Download

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY 4.0

Croatian

This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

Austrian Baroque Corpus

Size: 200,000 tokens 
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al (2016)

Concordancer

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY 4.0

Serbian

This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016).

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition 
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanović et al. (2018).

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus jos1M 1.1

Size: 1 million words 
Annotation: morphosyntactic tagging and lemmatisation 
Licence: CC BY-NC 4.0

Slovenian

This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec et al. (2010).

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens 
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles 
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens 
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. 
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities 
Licence: CC BY-SA 4.0

Serbian

This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Syntatic parsing

Corpus Language Description Availability

Prague Arabic Dependency Treebank 1.0

Annotation: syntactic parsing and morphosyntactic tagging 
Licence: CC BY-NC-SA 3.0

Arabic

This corpus is available for download from the LINDAT repository.

For the relevant publication, see Hajič et al. (2004)

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences 
Annotation: syntactic parsing, labelling of semantic entities 
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

For the relevant publication, see Kríž and Hladká (2018)

KonText

PML-TQ

Download

FicTree 1.0

Size: 12760 sentences 
Annotation: syntactic parsing and morphosyntactic tagging 
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains fictional texts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Jelínek (2017)

KonText

Download

Prague Dependency Treebank 3.5

Size: 2 million words 
Annotation: syntactic parsing and morphosyntactic tagging 
Licence: CC BY-NC-SA 4.0

Czech

This corpus is manually annotated at several levels – aside from syntactic parsing and morphological information, it is annotation for sentence information structure, multiword expression, coreference, bridging relations and discourse relations.

The corpus is available for download from the LINDAT repository.

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences 
Annotation: syntactic parsing, mark-up of discourse phenomena enriched by the annotation of secondary connectives 
Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5

The corpus is available through the PML-TQ tool.

PML-TQ

Slovak Dependency Treebank

Size: 106,000 tokens, 10,600 sentences 
Annotation: syntactic parsing 
Licence: CC BY-SA 4.0

Czech

This syntactic parsing is modelled after the Prague Dependency Treebank.

The corpus is available for download from the LINDAT repository.

Download

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences 
Annotation: syntactic parsing, mark-up of coreference 
Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 style.

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool (Czech part only).

For the relevant publication, see Hajič et al. (2012)

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences 
Annotation: syntactic parsing, mark-up of elliptical constructions 
Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

This syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Lassy Klein-corpus

Size: 1 million tokens 
Annotation: PoS tagging, syntactic parsing 
Licence: VAGUE

Dutch

This corpus is available for download from the Dutch Language Institute and through the online environments PaQu and GrETEL.

For the relevant publication, see Noord (2009)

Download

Pa-Qu

GrETEL

SoNaR-1

Size: 1 million words 
Annotation: PoS tagging, syntactic parsing, semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

Estonian Treebank

Size: 1,000 sentences 
Annotation: syntactic parsing 
Licence: CLARIN_ACA

Estonian

The corpus contains fictional and newspaper texts.

The corpus is available for download from META-SHARE (CELR distribution).

Download

UD Estonian ver.2.3

Size: 434,000 tokens 
Annotation: syntactic parsing 
Licence: CC-BY-SA

Estonian

This corpus contains fictional, newspaper and scientific texts. The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from (CELR distribution).

For the relevant publication, see Muischnek et al. (2014)

Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words 
Annotation: morphosyntactic tagging and syntactic parsing 
Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

For the relevant publication, see Orasmaa (2014)

Download

Finnish TreeBank 1

Size: 160,000 tokens 
Annotation: syntactic parsing 
Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Finnish TreeBank 2

Size: 160,000 tokens 
Annotation: syntactic parsing 
Licence: CC-BY 3.0

Finnish

This corpus contains 19,000 sentences from the Large Grammar of Finnish.

The corpus is available for download from the Language Bank of Finland.

Download

Turku Dependency Treebank

Size: 204,000 tokens 
Annotation: syntactic parsing 
Licence: CC-BY-SA

Finnish

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the Turku BioNLP Group.

For the relevant publication, see Haverinen et al. (2013)

Download

Syntactic Reference Corpus of Medieval French

Size: 245,000 words 
Annotation: syntactic parsing 
Licence: CLARIN ACA

French

This corpus contains Old French texts.

The corpus is available for download from the IMS CLARIN-D repository.

For the relevant publication, see Stein and Prévost (2013)

Download

GRUG Parallel Treebank

Size: 10,400 sentence pairs 
Annotation: syntactic parsing, PoS tagging 
Licence: CC-BY

Georgian, Ukranian, Russian, German

The corpus is syntactically parsed following the TIGER guidelines.

The corpus is available for download from a dedicated website provided by the CLARIN-D consortium.

Download

B4 Heliand

Size: 3495 tokens 
Annotation: PoS tagging, syntactic parsing 
Licence: CC-BY

German

This corpus contains historical German texts.

The corpus is available for download from the HZSK repository.

Download

Dependency-Annotated Subset of the CREG Corpus

Size: 109 sentences 
Annotation: PoS tagging, syntactic parsing 
Licence: CLARIN RES

German

This corpus consists of answers to reading comprehension questions written by American college students learning German.

The corpus is available for download from the Tübingen CLARIN Repository.

Download

Tübingen Treebank of Written German / Newspaper Corpus (TüBa-D/Z)

Size: 1.9 million tokens 
Annotation: syntactic parsing 
Licence: CLARIN RES

German

This corpus contains newspaper articles.

The corpus is available for download from the Tübingen CLARIN Repository.

Download

Szeged Treebank 2.0

Size: 82,000 sentences 
Annotation: syntactic parsing 
Licence: licence agreement

Hungarian

This corpus is available for download from a dedicated webpage.

For the relevant publication, see Csendes et al. (2005)

Download

Icelandic Parsed Historical Corpus (IcePaHC)

Size: 1 million tokens 
Annotation: morphosyntactic tagging, lemmatisation, syntactic parsing 
Licence: GNU LGPL

Icelandic

This corpus contains Icelandic texts from the 12th through the 21st centuries – approximately 100,000 words from each century. The corpus is syntactically parsed following the UUPenn scheme for historical textse

The corpus is available for online search through treebankstudio.org and for download in different formats from a dedicated webpage.

For the relevant publication, see Rögnvaldsson et al. (2012)

Download

Concordancer

LVTB - Latvian Treebank

Size: 289,791 tokens; 17,127 sentences 
Annotation: syntactic parsing 
Licence: CC BY-SA 4.0

Latvian

This treebank is manually annotated according to a hybrid dependency-constituency grammar.

The treebank is available for download from the CLARIN-LV repository.

For the relevant publication, see Rituma et al. (2023)

Download

Lithuanian Treebank ALKSNIS

Size: 2,355 sentences 
Annotation: syntactic parsing 
Licence: CLARIN PUB

Lithuanian

Syntactic parsing follows the rules of the Prague Dependency Treebank

This corpus is available for download from the CLARIN-LT repository. The second version is available upon request.

Download

Polish Dependency Bank in Universal Dependency format

Size: 22,000 trees, 351,000 tokens 
Annotation: syntactic parsing 
Licence: CC BY-NC-SA 4.0

Polish

This corpus also contains sentences showing certain problematic syntactic phenomena – sentences with ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction jako, etc. The syntactic parsing follows the Universal Dependencies schema.

The first version of the corpus is available for download from the Computational Linguistics in Poland website. The second version is available upon request.

For the relevant publication, see Wróblewska (2018)

Download

CINTIL DependencyBank

Size: 110,000 tokens 
Annotation: morphosyntactic tagging and syntactic parsing 
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL TreeBank

Size: 110,000 tokens 
Annotation: syntactic parsing 
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL-DeepBank

Size: 110,000 tokens 
Annotation: PoS-tagging, syntactic parsing, grammatical functions, logical forms 
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the PORTULAN CLARIN repository.

Download

CINTIL-PropBank

Size: 110,000 tokens 
Annotation: syntactic parsing and phrase semantic roles 
Licence: MS-NC-No ReD-ND

Portuguese

This corpus contains literary and newspaper texts.

The corpus is available for download from the ELRA catalogue.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition 
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now defunct. The syntactic parsing follows the Universal Dependencies framework.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Tamil Dependency Treebank v0.1

Size: 600 sentences 
Annotation: syntactic parsing and morphosyntactic tagging 
Licence: CC BY-NC-SA 3.0

Tamil

The syntactic parsing follows the rules of the https://ufal.mff.cuni.cz/pdt/.

The corpus is available for download from the LINDAT repository.

Download

HamleDT 3.0

Size: 19 treebanks 
Annotation: syntactic parsing and morphosyntactic tagging 
Licence: HamleDT 3.0 Licence Terms

19 languages

This treebank collection is available for download from LINDAT.

The treebanks can be individually queried through KonText and the treebank tool PML-TQ. We list them here by language:

 

  1. Arabic(KonText, PML-TQ)
  2. Bengali (KonText)
  3. Catalan (KonText)
  4. Czech (KonText, PML-TQ)
  5. Dutch (KonText, PML-TQ)
  6. English (KonText)
  7. Estonian (KonText, PML-TQ)
  8. German (KonText)
  9. Greek (KonText)
  10. Hindi (KonText)
  11. Latin (KonText, PML-TQ)
  12. Persian (KonText, PML-TQ)
  13. Polish (KonText, PML-TQ)
  14. Portuguese (KonText, PML-TQ)
  15. Romanian (KonText, PML-TQ)
  16. Russian (KonText)
  17. Slovenian (KonText, PML-TQ)
  18. Spanish (KonText)
  19. Tamil (KonText, PML-TQ)

 

For the relevant publication, see Zeman et al. (2012)

Download

Treebanks of INESS

Size: 532 treebanks 
Annotation: syntactic parsing 
Licence: CC-BY

71 languages

This is a collection of treebanks made available through the Infrastructure for the Exploration of Syntax and Semantics (INESS).

The corpora are available for online querying through INESS.

For the relevant publication, see Rosén et al. (2012)

 

Universal Dependencies 2.12

Size: 30 million tokens; 30.6 million words; 1.8 million sentences 
Annotation: syntactic parsing 
Licence: Licence Universal Dependencies v2.12

75 languages

This corpus collection contains treebanks following theUniversal Dependencies framework.

The corpus collection is available for download from the LINDAT repository.

The individual treebanks in Universal Dependencies 2.3 can also be queried through the concordancer KonText and the treebank query tool PML-TQ. Below we provide links to these search environments for all the treebanks. For a detailed description of the treebanks, see the Universal Dependencies project page.

 

  1. UD_Akkadian-PISANDUB (KonText)
  2. UD_Amharic-ATT (KonText, PML-TQ)
  3. UD_Armenian-ArmTDP (KonText, PML-TQ)
  4. UD_Breton-KEB (KonText, PML-TQ)
  5. UD_Buryat-BDT (KonText, PML-TQ)
  6. UD_Cantonese-HK (KonText, PML-TQ)
  7. UD_Chinese-HK (KonText, PML-TQ)
  8. UD_Chinese-CFL (KonText, PML-TQ)
  9. UD_Coptic-Scriptorium (KonText, PML-TQ)
  10. UD_Croatian-SET (KonText, PML-TQ)
  11. UD_English-ESL (KonText, PML-TQ)
  12. UD_Faroese-OFT (KonText, PML-TQ)
  13. UD_Galician-TreeGal (KonText, PML-TQ)
  14. UD_Hindi_English-HIENCS (KonText)
  15. UD_Kazakh-KTB 2.2 (KonText, PML-TQ)
  16. UD_Komi_Zyrian-Lattice (KonText, PML-TQ)
  17. UD_Komi_Zyrian-IKDP (KonText, PML-TQ
  18. UD_Kurmanji-MG (KonText, PML-TQ)
  19. UD_Lithuanian-HSE (KonText, PML-TQ)
  20. UD_Maltese-MUDT (KonText, PML-TQ)
  21. UD_Marathi-UFAL (KonText, PML-TQ)
  22. UD_Naija-NSC (KonText, PML-TQ)
  23. UD_Persian-Seraji (KonText, PML-TQ)
  24. UD_Russian-Taiga (KonText, PML-TQ)
  25. UD_Sanskrit-UFAL (KonText, PML-TQ)
  26. UD_Serbian-SET (KonText, PML-TQ)  
  27. UD_Slovenian-SST (KonText, PML-TQ)
  28. UD_Tagalog-TRG (KonText, PML-TQ)
  29. UD_Telugu-MTG (KonText, PML-TQ)
  30. UD_Ukrainian-IU (KonText, PML-TQ)
  31. UD_Upper_Sorbian-UFAL (KonText, PML-TQ)
  32. UD_Uyghur-UDT (KonText, PML-TQ)
  33. UD_Warlpiri-UFAL (KonText, PML-TQ)
  34. UD_Yoruba-YTB (KonText, PML-TQ)
  35. UD_Afrikaans-AfriBooms (KonText)
  36. UD_Ancient_Greek-PROIEL (KonText)
  37. UD_Ancient_Greek-Perseus (KonText, PML-TQ)
  38. UD_Arabic-PADT (KonText, PML-TQ)
  39. UD_Arabic-PUD (KonText, PML-TQ)
  40. UD_Arabic-NYUAD (KonText)
  41. UD_Bambara-CRB (KonText, PML-TQ)
  42. UD_Basque-BDT (KonText, PML-TQ)
  43. UD_Belarusian-HSE  (KonText, PML-TQ)
  44. UD_Bulgarian-BTB (KonText, PML-TQ)
  45. UD_Catalan-AnCora (KonText, PML-TQ)
  46. UD_Chinese-GSD (KonText, PML-TQ)
  47. UD_Chinese-PUD (KonText, PML-TQ)
  48. UD_Czech-PDT  (KonText, PML-TQ)
  49. UD_Czech-CAC  (KonText, PML-TQ)
  50. UD_Czech-FicTree  (KonText, PML-TQ
  51. UD_Czech-PUD (KonTextPML-TQ)
  52. UD_Czech-CLTT (KonTextPML-TQ)
  53. UD_Danish-DDT (KonText, PML-TQ)
  54. UD_Dutch-Alpino (KonText, PML-TQ)
  55. UD_Dutch-LassySmall (KonText, PML-TQ)
  56. UD_English-ParTUT (KonTextPML-TQ)
  57. UD_English-GUM (KonText, PML-TQ)
  58. UD_English-EWT (KonText, PML-TQ)
  59. UD_English-PUD (KonText, PML-TQ)
  60. UD_English-LinES (KonText, PML-TQ)
  61. UD_Erzya-JR (KonText, PML-TQ)
  62. UD_Finnish-FTB (KonText, PML-TQ)
  63. UD_Finnish-TDT (KonText, PML-TQ)
  64. UD_Finnish-PUD (KonText, PML-TQ)
  65. UD_French-ParTUT (KonText, PML-TQ)
  66. UD_French-GSD (KonText, PML-TQ)
  67. UD_French-Sequoia (KonText, PML-TQ)
  68. UD_French-Spoken (KonText, PML-TQ)
  69. UD_French-PUD (KonText, PML-TQ)
  70. UD_French-FTB (KonText)
  71. UD_Galician-CTG (KonText, PML-TQ)
  72. UD_German-GSD  (KonText, PML-TQ)
  73. UD_German-PUD (KonText, PML-T )
  74. UD_Gothic-PROIEL (KonText, PML-TQ)
  75. UD_Greek-GDT (KonText, PML-TQ)
  76. UD_Hebrew-HTB (KonText, PML-TQ)
  77. UD_Hindi-HDTB (KonText, PML-TQ)
  78. UD_Hindi-PUD (KonText, PML-TQ)
  79. UD_Hungarian-Szeged (KonText, PML-TQ)
  80. UD_Indonesian-GSD (KonText, PML-TQ)
  81. UD_Indonesian-PUD  (KonText, PML-TQ)
  82. UD_Irish-IDT  (KonText, PML-TQ)
  83. UD_Italian-ISDT (KonText, PML-TQ)
  84. UD_Italian-ParTUT (KonText, PML-TQ)
  85. UD_Italian-PUD (KonText, PML-TQ)
  86. UD_Japanese-GSD (KonText, PML-TQ
  87. UD_Japanese-PUD (KonText, PML-TQ)
  88. UD_Japanese-Modern (KonText, PML-TQ)
  89. UD_Korean-Kaist (KonText, PML-TQ)
  90. UD_Korean-GSD (KonText, PML-TQ)
  91. UD_Korean-PUD (KonText, PML-TQ)
  92. UD_Latin-PROIEL (KonText, PML-TQ)
  93. UD_Latin-ITTB (KonText, PML-TQ)
  94. UD_Latin-Perseus (KonText, PML-TQ)
  95. UD_Latvian-LVTB (KonText, PML-TQ)
  96. UD_North_Sami-Giella (KonText, PML-TQ)
  97. UD_Norwegian-Bokmaal (KonText, PML-TQ)
  98. UD_Norwegian-Nynorsk (KonText, PML-TQ)
  99. UD_Norwegian-NynorskLIA (KonText, PML-TQ)
  100. UD_Old_Church_Slavonic-PROIEL (KonText, PML-TQ)
  101. UD_Old_French-SRCMF (KonText, PML-TQ)
  102. UD_Polish-LFG (KonText, PML-TQ)
  103. UD_Polish-SZ (KonText, PML-TQ)
  104. UD_Portuguese-Bosque (KonText, PML-TQ)
  105. UD_Portuguese-GSD (KonText, PML-TQ)
  106. UD_Portuguese-PUD (KonText, PML-TQ)
  107. UD_Romanian-RRT (KonText, PML-TQ)
  108. UD_Romanian-Nonstandard (KonText, PML-TQ)
  109. UD_Russian-GSD (KonText, PML-TQ)
  110. UD_Russian-PUD (KonText, PML-TQ)
  111. UD_Russian-SynTagRus (KonText, PML-TQ)
  112. UD_Slovak-SNK (KonText, PML-TQ)
  113. UD_Slovenian-SSJ (KonText, PML-TQ)
  114. UD_Spanish-AnCora (KonText, PML-TQ)
  115. UD_Spanish-GSD (KonText, PML-TQ)
  116. UD_Spanish-PUD (KonText, PML-TQ)
  117. UD_Swedish-Talbanken (KonText, PML-TQ)
  118. UD_Swedish-LinES (KonText, PML-TQ)
  119. UD_Swedish-PUD (KonText, PML-TQ)
  120. UD_Swedish_Sign_Language-SSLC (KonText, PML-TQ)
  121. UD_Tamil-TTB (KonText, PML-TQ)
  122. UD_Thai-PUD (KonText, PML-TQ)
  123. UD_Turkish-IMST (KonText, PML-TQ)
  124. UD_Turkish-PUD (KonText, PML-TQ)
  125. UD_Urdu-UDTB (KonText, PML-TQ)
  126. UD_Vietnamese-VTB (KonText, PML-TQ)

 

For the relevant publication, see de Marneffe et al. (2021)

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens 
Annotation: half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. Full corpus – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. 
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities 
Licence: CC BY-SA 4.0

Serbian

This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Named Entity Recognition

Corpus Language Description Availability

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY 4.0

Croatian

This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

Training corpus hr500k 1.0

Size: 500,000 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and Named Entity recognition. Half of corpus also syntactically parsed 
Licence: CC BY-SA 4.0

Croatian This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Czech Named Entity Corpus 1.1

Size: 5868 sentences, 35220 NEs 
Annotation: Named Entity recognition 
Licence: CC BY-NC-SA 3.0

Czech

This corpus is available for download from LINDAT.

For the relevant publication, see Kravalová and Žabokrtský (2009)

Download

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens 
Annotation: PoS tagging, Named Entity recognition, sentiment analysis 
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

KPWr (Polish Corpus of Wrocław University of Technology) 1.2

Size: 447,000 tokens 
Annotation: chunks and selected predicate-argument relations, Named Entity recognition, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases 
Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Spatial Texts 1.0

Size: 46,000 tokens 
Annotation: Named Entity recognition (spatial expressions) 
Licence: CC BY-SA 4.0

Polish

This corpus contains travel blogs.

The corpus is available for download from the CLARIN-PL repository.

Download

CINTIL-Corpus Internacional do Português

Size: 1 million tokens 
Annotation: morphosyntactic tagging, Named Entity recognition 
Licence: CLARIN RES

Portuguese

The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.).

The corpus is available for download from the CLARIN PORTULAN repository.

Download

Training corpus SETimes.SR 1.0

Size: 87,000 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic parsing, and Named Entity recognition 
Licence: CC BY-SA 4.0

Serbian

This corpus contains posts from the Southeast European Times news portal, which is now no longer being updated.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Batanović et al. (2018)

KonText

noSketch

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY 4.0

Serbian

This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016).

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens 
Annotation: tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and Named Entity recognition 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC).

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens 
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. Half of the corpus – syntactic parsing, Named Entity recognition, and verbal multiword expression tagging. Quarter of corpus: semantic roles 
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens 
Annotation: fully – tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, named entities. Half of the corpus – syntactic parsing, a subset also for multi-word expressions. Fifth of the corpus: semantic roles. 
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Serbian linguistic training corpus SETimes.SR 2.0

Size: 97,673 tokens 
Annotation: tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, named entities 
Licence: CC BY-SA 4.0

Serbian

This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Samardžić et al. (2017)

Download

Sentiment analysis

Corpus Language Description Availability

xLiMe Twitter Corpus XTC 1.0.1

Size: 364,000 tokens 
Annotation: PoS tagging, Named Entity recognition, sentiment analysis 
Licence: MIT License

German, Italian, Spanish

This corpus contains Tweets.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Rei et al. (2016)

Download

Twitter sentiment for 15 European languages

Size: 1.6 million tweets 
Annotation: sentiment analysis 
Licence: CC BY-SA 4.0

Albanian, Bosnian, Bulgarian, Croatian, English, German, Hungarian, Polish, Portuguese, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish

This corpus contains Tweet IDs with sentiment annotations.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Mozetič et al. (2016)

Download

Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0

Size: 407.5 million words 
Annotation: sentiment analysis (socially unacceptable discourse) 
Licence: CC BY-SA 4.0

Croatian

This corpus contains news comments from the website 24sata.hr.

The corpus is available for download from CLARIN.SI.

Download

Aspect-Term Annotated Customer Reviews in Czech

Size: 2200 reviews 
Annotation: sentiment analysis 
Licence: CC BY-NC-SA 3.0

Czech

This corpus contains online user-product reviews.

The corpus is available for download from LINDAT.

Download

Facebook Data for Sentiment Analysis

Size: 10,000 Facebook posts 
Annotation: sentiment analysis 
Licence: CC BY-SA 3.0

Czech

This corpus contains Facebook posts.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Habernal et al. (2013)

KonText

Download

FinnSentiment 1.1

Size: 27,000 sentences 
Annotation: sentiment analysis 
Licence: CC BY

Finnish

This corpus contains sentences from Finnish social media that have been manually annotated for sentiment polarity by three native annotators.

The corpus is available for download from META-SHARE (the Finnish Language Bank).

For the relevant publication, see Lindén et al. (2023)

Download

NoReC: The Norwegian Review Corpus

Size: 14.8 million tokens 
Annotation: sentiment analysis 
Licence: CC BY-NC 3.0

Norwegian

This corpus contains reviews in different domains (e.g., literature, videogames, etc.).

The corpus is available for download from the CLARINO repository.

For the relevant publication, see Velldal et al. (2018)

Download

Manually sentiment annotated Slovenian news corpus SentiNews 1.0

Size: 10,427 articles 
Annotation: sentiment analysis 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains news articles.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Bučar et al. (2018)

Download

Other annotation layers

Corpus Language Description Availability

PARSEME corpora annotated for verbal multiword expressions (version 1.3)

Size: 5.8 million tokens 
Annotation: identification of verbal multi-word expressions (idioms, light-verb constructions, verb-particle constructions, inherently reflexive verbs, multi-verb constructions) 
Licence: PARSEME Shared Task Data (v. 1.1) Agreement

Arabic, Basque, Bulgarian, Chinese, Croatian, Czech, English, French, German, Hebrew, Hindi, Hungarian, Irish, Italian, Lithuanian, Maltese, Modern Greek (1453-), Persian, Polish, Portuguese, Romanian, Serbian, Slovenian, Spanish, Swedish, Turkish

This multilingual resource contains corpora in which verbal multi-word expressions (MWEs) have been manually annotated. Verbal MWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do).

The 1.0 versions of the PARSEME corpora can be queried individually through KonText. We provide the individual links to each corpus:

 

  1. Parseme VMWE 1.0 – Czech
  2. Parseme VMWE 1.0 – German
  3. Parseme VMWE 1.0 – Greek
  4. Parseme VMWE 1.0 – Spanish
  5. Parseme VMWE 1.0 – Persian
  6. Parseme VMWE 1.0 – French
  7. Parseme VMWE 1.0 – Hungarian
  8. Parseme VMWE 1.0 – Italian
  9. Parseme VMWE 1.0 – Maltese
  10. Parseme VMWE 1.0 – Polish
  11. Parseme VMWE 1.0 – Portuguese
  12. Parseme VMWE 1.0 – Romanian
  13. Parseme VMWE 1.0 – Slovenian
  14. Parseme VMWE 1.0 – Swedish
  15. Parseme VMWE 1.0 – Turkish

 

For the relevant publication, see Savary et al. (2023)

Download

Croatian linguistic training corpus hr500k 2.0

Size: 499,635 tokens 
Annotation: a subset tagged for multi-word expressions and semantic roles 
Licence: CC BY-SA 4.0

Croatian

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels.

The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2016)

Download

Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Size: 89,855 tokens 
Annotation: word normalisation 
Licence: CC BY 4.0

Croatian

This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

Czech Legal Text Treebank 2.0

Size: 1121 sentences 
Annotation: semantic role labelling 
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains legal texts.

The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository.

KonText

PML-TQ

Download

Prague Discourse Treebank 2.0

Size: 49,500 sentences 
Annotation: mark-up of discourse phenomena enriched by the annotation of secondary connectives 
Licence: CC-BY

Czech

This corpus is a subset of the Prague Dependency Treebank 3.5.

The corpus is available through the PML-TQ tool.

PML-TQ

Prague Czech-English Dependency Treebank 2.0 Coref

Size: 49,000 sentences 
Annotation: mark-up of coreference 
Licence: CC-BY-NC-SA + LDC99T42 (restricted use)

Czech, English

This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 styleD

The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool.T 2.0 style.

KonText

PML-TQ

Download

Artificial Treebank with Ellipsis

Size: 106,000 tokens, 10,604 sentences 
Annotation: mark-up of elliptical constructions 
Licence: Licence Universal dependencies v2.1

Czech, English, Finnish, Russian, Slovak

The syntactic parsing follows the Universal Dependencies schema.

The corpus is available for download from the LINDAT repository.

Download

Grundtvig's Works Corpus 

Size: 11,417,194 words 
Annotation: linked data (places, persons, bible citations, etc.) 
Licence: CC BY-NC 4.0

Danish

This corpus contains the literary works of the Danish bishop N.F.S Grundtvig.

The corpus is available for download from the CLARIN-DK repository.

Download

SoNaR-1

Size: 1 million words 
Annotation: semantic role labelling

Dutch

This is a manually annotated subset of the much larger (approx.. 500 million) word) SoNaR corpus.

The corpus is available for download from the Dutch Language Institute.

Download

Natural Language 2 Semantic Hypergraph Dataset NL2SH 1.0

Size: 6,851 tokens 
Annotation: semantic role labelling, coreference, tokenisation, PoS-tagging, lemmatisation, syntactic dependencies, named entities 
Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED

English

This corpus can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources:

  • John Eastwood, Oxford Guide to English Grammar, Oxford University Press, 2002.
  • Andrew Redford, An Introduction to English Sentence Structure, Cambridge University Press, 2009.
  • Essential English Grammar, Philip Gucker, Dover Publications, Inc. New York, 1966.
Download

Speech, Thought and Writing Presentation Corpus

Size: 260,000 words 
Annotation: identification of reported speech 
Licence: CC BY-NC-SA 3.0

English

This corpus contains literary, newspaper and biography texts.

The corpus is available for download from the Oxford Text Archive.

Download

The ACL RD-TEX 2.0

Size: 33216 tokens 
Annotation: terminology extraction/classification 
Licence: CC BY-NC-SA 4.0

English

This corpus contains 6818 terms extracted from abstracts of computational linguistics papers.

The corpus is available for download from LINDAT and through KonText.

For the relevant publication, see QasemiZadeh and Schumann (2016)

KonText

Download

Estonian Treebank annotated with coreference relations

Size: 107,000 words 
Annotation: anaphora relations 
Licence: GPL

Estonian

This corpus contains newspaper texts plus one scientific medical text.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Semantically disambiguated corpus of Estonian

Size: 375,733 tokens 
Annotation: word sense disambiguation 
Licence: CLARIN ACA

Estonian The corpus is available for download from META-SHARE (CELR distribution). Download

TimeML annotated corpus of Estonian newspaper articles

Size: 22,000 words 
Annotation: temporal semantic annotations 
Licence: CC-BY-SA

Estonian

This corpus contains newspaper articles.

The corpus is available for download from META-SHARE (CELR distribution).

Download

Greek Coreference Corpus

Size: 62,988 tokens 
Annotation: coreference 
Licence: CC-BY-NC-SA

Greek

In addition to coreference, the corpus is annotated for identity and bridging relations.

In addition to coreference, the corpus is annotated for identity and bridging relations.

For the relevant publication, see Ogrodnizcuk et al. (2015)

Download

Greek Textual Entailment Corpus

Size: 600 sentence-pairs 
Annotation: logical entailment 
Licence: CC-BY

Greek

This corpus contains texts from the domains of politics, law and travel.

This corpus is available for download from the clarin:el repository.

Download

KPWr (Polish Corpus of Wrocław University of Technology) 1.2

Size: 447,000 tokens 
Annotation: selected predicate-argument relations, relations between named entities, anaphora relations, word senses, events, temporal expressions, spatial relations between entities, keywords and semantic roles within nominal and adjective phrases 
Licence: CC BY-SA 3.0

Polish

This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.).

The corpus is available for download from the CLARIN-PL repository.

Download

Polish Coreference Corpus

Size: 540,000 tokens 
Annotation: coreference 
Licence: CC BY 3

Polish

This corpus contains texts in a variety of domains (magazines, fiction literature, non-fiction literature, computer-mediated communication, academic writing, etc.).

The corpus is available for download and online browsing.

Concordancer

Download

Polish Summaries Corpus

Size: 10845 summaries 
Annotation: summarization 
Licence: CC BY 3

Polish

This corpus is available for download from the ZIL IPI PAN repository.

For the relevant publication, see Ogrodniczuk and Kopeć (2014)

Download

WUT Relations Between Sentences Corpus

Size: 5654 sentences 
Annotation: relations between sentences - Cross-document Structure Theory (CST) 
Licence: CC BY-SA 3.0

Polish

This corpus contains news items.

The corpus is available for download from the CLARIN.PL repository.

Download

Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Size: 92,271 tokens 
Annotation: word normalisation 
Licence: CC BY 4.0

Serbian

This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness)..

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Miličević and Ljubešić (2016)

Download

ASR database ARTUR 1.0

Size: 884 hours 
Annotation: orthographically transcribed speech 
Licence: CC BY-SA 4.0

Slovenian

This corpus was designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only.

The audio files are available in a separate repository entry. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected.

The data are structured as follows:

  1. Artur-B, read speech, 573 hours in total.

     

    It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment.

  2. (2) Artur-J, public speech, 62 hours in total.

     

    It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc.

  3. (3) Artur-N, private speech, 74 hours in total.

     

    It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics.

  4. (4) Artur-P, parliamentary speech, 201 hours in total.

     

    It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly.

Download (transcriptions)

Download (audio files)

CMC training corpus Janes-Norm 1.2

Size: 184,755 tokens 
Annotation: normalization 
Licence: CC BY-SA 4.0

Slovenian

This corpus is partially also manually annotated with MSD tags and lemmatized.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

CMC training corpus Janes-Tag 2.0

Size: 75,000 tokens 
Annotation: word normalisation 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

For the relevant publication, see Fišer et al. (2018)

KonText

noSketch

Download

Corpus of comma placement Vejica 1.3

Size: 104,000 sentences 
Annotation: comma placement 
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains texts from various Slovenian corpora (KUST, Šolar aLektorm JANES-Vejican Wikpedia.

The corpus is available for dow.nload from CLARIN.SI.

Download

Slovenian Definition Extraction evaluation datasets RSDO-def 1.0

Size: 2,216 sentences 
Annotation: term definition evaluation 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1, which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus. The labels added to the sentences included in the dataset denote: 0: Non-definition; 1: Weak definition; 2: Definition.

The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions; and 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions. Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts.

The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Tran et al. (2023)#SEPPollak (2014)

Download

Slovenian Word in Context dataset SloWiC 1.0

Size: 14,958 items 
Annotation: word sense disambiguation 
Licence: CC BY-SA 4.0

Slovenian

The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task.

Each example contains the following data fields:

  • word: The target word with multiple meanings
  • sentence1: The first sentence containing the target word
  • sentence2: The second sentence containing the target word
  • idx: The index of the example in the dataset
  • label: Label showing if the sentences contain the same meaning of the target word
  • start1: Start of the target word in the first sentence
  • start2: Start of the target word in the second sentence
  • end1: End of the target word in the first sentence
  • end2: End of the target word in the second sentence
  • version: The version of the annotation
  • manual_annotation: Boolean showing if the label was manually annotated
  • group: The group of annotators that labelled the example

 

Download

Terminology identification dataset KAS-term 1.0

Size: 22,950 term candidates 
Annotation: monolingual term extraction 
Licence: CC BY-SA 4.0

Slovenian

This corpus contains term candidates from PhD theses in chemistry, computer science and political science.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Holozan (2018)

Download

Training corpus ssj500k 2.1

Size: 586,000 tokens 
Annotation: verbal multiword expression tagging, semantic role labelling 
Licence: CC BY-NC-SA 4.0

Slovenian

This corpus contains standard Slovenian.

The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository.

KonText

noSketch

Download

Bilingual terminology extraction dataset KAS-biterm 1.0

Size: 1,950 sentences, 78,500 tokens, 3,700 terms 
Annotation: bi-lingual term extraction 
Licence: CC BY-SA 4.0

Slovenian, English

This corpus contains PHD theses.

The corpus is available for download from the CLARIN.SI repository.

Download

Publications

[Batanović et al. 2018] Vuk Batanović, Nikola Ljubešić, and Tanja Samadržić. 2018. SETimes.SR – A Reference Training Corpus of Serbian.

[Bučar et al. 2018]  Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[Csendes et al. 2005]  Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged Treebank.

[Erjavec 2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Erjavec et al. 2010] Tomaž Erjavec, Darja Fišer, Simon Krek, and Nina Ledinek. 2010. The JOS Linguistically Tagged Corpus of Slovene.

[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.

[Habernal et al. 2013] Ivan Habernal, Tomáš Ptáček, and Josef Steinberger. 2013. Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. 

[Hajič et al. 2004] Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools

[Hajič et al. 2012]  Jan, Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announcing Prague Czech-English Dependency Treebank 2.0

[Haverinen et al. 2014] Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2014. Building the essential resources for Finnish: the Turku Dependency Treebank.

[Holozan 2018] Peter Holozan. 2018. Corpus of comma placement Vejica 1.3.

[Kravalová and Žabokrtský 2009] Jana Kravalová and Zdenek Žabokrtský. 2009. Czech Named Entity Corpus and SVM-based Recognizer.

[Kríž and Hladká 2018] Vincent Kríz and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0.

[Miličević and Ljubešić 2016] Maja Miličević and Nikola Ljubešić. 2016. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets.

[Mozetič et al. 2016] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

[Muischnek et al. 2014] Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme

[van Noord 2009] Gertjan van Noord. 2009. Huge Parsed Corpora in LASSY. 

[Jelínek 2017] Tomáš Jelínek. 2017. FicTree: a Manually Annotated Treebank of Czech Fiction.

[Ogrodniczuk and Kopeć 2014]  Maciej Ogrodniczuk and Mateusz Kopeć. The Polish Summaries Corpus.

[Ogrodnizcuk et al. 2015] Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, and Magdalena Zawisławska. Coreference in Polish: Annotation, Resolution and Evaluation in Polish.

[Orasmaa 2014] Siim Orasmaa. Towards an Integration of Syntactic and Temporal Annotations in Estonian.

[Przepiórkowski and Murzynowski  2011]  Adam Przepiórkowski and Grzegorz Murzynowski. 2011. Manual annotation of the National Corpus of Polish with Anotatornia.

[QasemiZadeh and Schumann 2016] Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods.

[Rei et al. 2016] Luis Rei, Dunja Mladenić, and Simon Krek. 2016. A Multilingual Social Media Linguistic Corpus.

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[Rögnvaldsson et al. 2012] Eiríkur Rögnvaldsson, Anton Karl Ingason, Einar Freyr Sigurðsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC).

[Rosén et al. 2012] Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik. 2012. An Open Infrastructure for Advanced Treebanking.

[Stein and Prévost 2013]  Achim Stein and Sophie Prévost. 2013. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF).

[Velldal et al. 2018] Erik Velldal, Lilja Øvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, and Fredrik Jørgensen. 2018. NoReC: The Norwegian Review Corpus

[Wróblewska 2018] Alina Wróblewska. 2018. Extended and enhanced Polish dependency bank in Universal Dependencies format.

[Zeman et al. 2012] Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, and Jan Hajič. 2012. HamleDT: To Parse or Not to Parse?