Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools, as well as testing the accuracy of existing annotation tools.
There are more than 70 manually annotated training corpora and corpus collections in the CLARIN infrastructure. Among the multilingual corpora, there are 4 collections in the CLARIN infrastructure that were annotated under the following umbrella initiatives: HamleDT 3.0, Treebanks of INESS, Universal Dependencies, and Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1).
The corpora and corpus collections are classified into six categories based on the type of manual annotation:
If a corpus is manually annotated for more than one linguistic information, then it is listed under all the relevant sections. For instance, the xLiMe Twitter Corpus XTC 1.0.1 is manually annotated for PoS tags, Named Entities and sentiment, so it is listed under all the three relevant sections.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
The Manually Annotated Corpora
PoS MSD tagging
Corpus | Language | Description | Availability |
---|---|---|---|
MULTEXT-East "1984" annotated corpus 4.0 Size: 80,000 sentences, 1 million words |
Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian |
This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec (2012) |
Download |
The Morphologically Annotated Part of BulTreeBank Size: 214,000 tokens |
Bulgarian | This corpus is available for download through the concordancer Corpuscle. | Concordancer |
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens |
Croatian |
This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016) |
Download |
Size: 2 million tokens |
English |
The corpus was manually post-edited to correct the PoS tags automatically assigned by CLAWS. The corpus is available for online querying via CQPWeb (registration required) for download from the Oxford Text Archive |
|
Corpus of morphologically disambiguated Estonian texts Size: 513,000 tokens |
Estonian | This corpus contains texts from the 1980s subcorpus of the Corpus of Written Estonian 1890-1990. | Download |
Size: 200,000 tokens |
German |
This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked. The corpus is available through a dedicated concordancer. For the relevant publication, see Resch et al (2016) |
Concordancer |
xLiMe Twitter Corpus XTC 1.0.1 Size: 364,000 tokens |
German, Italian, Spanish |
This corpus contains Tweets. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Rei et al. (2016) |
Download |
Size: 1.5 million tokens |
Hungarian |
This corpus is available for download from a dedicated webpage. To download the versions of the Szeged Corpus and Szeged Treebank, you are obliged to fill and send a Licence Agreement. |
Download |
Lithuanian morphologically annotated corpus - MATAS Size: 1.6 million words |
Lithuanian |
The corpus contains texts from various domains (documents, fiction, periodicals, scientific texts, wordforms). This corpus is available for download from the CLARIN-LT repository. |
Download |
Size: 1 million tokens |
Polish |
This corpus is a manually annotated subset of the National Corpus of Polish. The corpus is available for download from the Computational Linguistics in Poland website. For the relevant publication, see Przepiórkowski and Murzynowski (2011) |
Download |
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens |
Serbian |
This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016) |
Download |
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens |
Slovenian |
This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018) |
|
Size: 1 million words |
Slovenian |
This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec et al. (2010) |
Download |
Size: 586,000 tokens |
Slovenian |
This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. |
|
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens |
Croatian |
This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016) |
Download |
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens |
Serbian |
This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017) |
Download |
Lemmatisation
Corpus | Language | Description | Availability |
---|---|---|---|
MULTEXT-East "1984" annotated corpus 4.0 Size: 80,000 sentences, 1 million words |
Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian |
This corpus contains 11 human translations of George Orwell’s Nineteen Eighty-Four, as well as the original text. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec (2012) |
Download |
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens |
Croatian |
This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016) |
Download |
Size: 200,000 tokens |
German |
This historical corpus contains sermons from 1650 to 1750. For linguistic annotation, each individual token was automatically assigned to a morphosyntactic word class using the TreeTagger software. As a classification system, the 54-part Stuttgart-Tübingen TagSet (STTS) was used. For lemmatization , a normalized basic word form was used for each token and the Duden and the German dictionary by Jacob and Wilhelm Grimm were used as reference works. The part-of-speech tagging and lemmatization was then manually checked. The corpus is available through a dedicated concordancer. For the relevant publication, see Resch et al (2016) |
Concordancer |
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens |
Serbian |
This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016). |
Download |
Training corpus SETimes.SR 1.0 Size: 87,000 tokens |
Serbian |
This corpus contains posts from the Southeast European Times news portal, which is now defunct. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Batanović et al. (2018). |
|
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens |
Slovenian |
This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018) |
|
Size: 1 million words |
Slovenian |
This corpus contains sampled paragraphs from the Slovenian national corpus FidaPLUS. The corpus is morphosyntactically tagged following the MULTEXT-East Version 4 tagset. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec et al. (2010). |
Download |
Size: 586,000 tokens |
Slovenian |
This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. |
|
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens |
Croatian |
This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016) |
Download |
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens |
Serbian |
This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017) |
Download |
Syntatic parsing
Corpus | Language | Description | Availability |
---|---|---|---|
Prague Arabic Dependency Treebank 1.0 Annotation: syntactic parsing and morphosyntactic tagging |
Arabic |
This corpus is available for download from the LINDAT repository. For the relevant publication, see Hajič et al. (2004) |
Download |
Size: 1121 sentences |
Czech |
This corpus contains legal texts. The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository. For the relevant publication, see Kríž and Hladká (2018) |
|
Size: 12760 sentences |
Czech |
This corpus contains fictional texts. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Jelínek (2017) |
|
Prague Dependency Treebank 3.5 Size: 2 million words |
Czech |
This corpus is manually annotated at several levels – aside from syntactic parsing and morphological information, it is annotation for sentence information structure, multiword expression, coreference, bridging relations and discourse relations. The corpus is available for download from the LINDAT repository. |
Download |
Size: 49,500 sentences |
Czech |
This corpus is a subset of the Prague Dependency Treebank 3.5 The corpus is available through the PML-TQ tool. |
PML-TQ |
Size: 106,000 tokens, 10,600 sentences |
Czech |
This syntactic parsing is modelled after the Prague Dependency Treebank. The corpus is available for download from the LINDAT repository. |
Download |
Prague Czech-English Dependency Treebank 2.0 Coref Size: 49,000 sentences |
Czech, English |
This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 style. The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool (Czech part only). For the relevant publication, see Hajič et al. (2012) |
|
Artificial Treebank with Ellipsis Size: 106,000 tokens, 10,604 sentences |
Czech, English, Finnish, Russian, Slovak |
This syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from the LINDAT repository. |
Download |
Size: 1 million tokens |
Dutch |
This corpus is available for download from the Dutch Language Institute and through the online environments PaQu and GrETEL. For the relevant publication, see Noord (2009) |
|
Size: 1 million words |
Dutch |
This is a manually annotated subset of the much larger (approx. 500 million) word) SoNaR corpus. The corpus is available for download from the Dutch Language Institute. |
Download |
Size: 1,000 sentences |
Estonian |
The corpus contains fictional and newspaper texts. The corpus is available for download from META-SHARE (CELR distribution). |
Download |
Size: 434,000 tokens |
Estonian |
This corpus contains fictional, newspaper and scientific texts. The syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from (CELR distribution). For the relevant publication, see Muischnek et al. (2014) |
Download |
TimeML annotated corpus of Estonian newspaper articles Size: 22,000 words |
Estonian |
This corpus contains newspaper articles. The corpus is available for download from META-SHARE (CELR distribution). For the relevant publication, see Orasmaa (2014) |
Download |
Size: 160,000 tokens |
Finnish |
This corpus contains 19,000 sentences from the Large Grammar of Finnish. The corpus is available for download from the Language Bank of Finland. |
Download |
Size: 160,000 tokens |
Finnish |
This corpus contains 19,000 sentences from the Large Grammar of Finnish. The corpus is available for download from the Language Bank of Finland. |
Download |
Size: 204,000 tokens |
Finnish |
The syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from the Turku BioNLP Group. For the relevant publication, see Haverinen et al. (2013) |
Download |
Syntactic Reference Corpus of Medieval French Size: 245,000 words |
French |
This corpus contains Old French texts. The corpus is available for download from the IMS CLARIN-D repository. For the relevant publication, see Stein and Prévost (2013) |
Download |
Size: 10,400 sentence pairs |
Georgian, Ukranian, Russian, German |
The corpus is syntactically parsed following the TIGER guidelines. The corpus is available for download from a dedicated website provided by the CLARIN-D consortium. |
Download |
Size: 3495 tokens |
German |
This corpus contains historical German texts. The corpus is available for download from the HZSK repository. |
Download |
Dependency-Annotated Subset of the CREG Corpus Size: 109 sentences |
German |
This corpus consists of answers to reading comprehension questions written by American college students learning German. The corpus is available for download from the Tübingen CLARIN Repository. |
Download |
Tübingen Treebank of Written German / Newspaper Corpus (TüBa-D/Z) Size: 1.9 million tokens |
German |
This corpus contains newspaper articles. The corpus is available for download from the Tübingen CLARIN Repository. |
Download |
Size: 82,000 sentences |
Hungarian |
This corpus is available for download from a dedicated webpage. For the relevant publication, see Csendes et al. (2005) |
Download |
Icelandic Parsed Historical Corpus (IcePaHC) Size: 1 million tokens |
Icelandic |
This corpus contains Icelandic texts from the 12th through the 21st centuries – approximately 100,000 words from each century. The corpus is syntactically parsed following the UUPenn scheme for historical textse The corpus is available for online search through treebankstudio.org and for download in different formats from a dedicated webpage. For the relevant publication, see Rögnvaldsson et al. (2012) |
|
Size: 289,791 tokens; 17,127 sentences |
Latvian |
This treebank is manually annotated according to a hybrid dependency-constituency grammar. The treebank is available for download from the CLARIN-LV repository. For the relevant publication, see Rituma et al. (2023) |
Download |
Size: 2,355 sentences |
Lithuanian |
Syntactic parsing follows the rules of the Prague Dependency Treebank This corpus is available for download from the CLARIN-LT repository. The second version is available upon request. |
Download |
Polish Dependency Bank in Universal Dependency format Size: 22,000 trees, 351,000 tokens |
Polish |
This corpus also contains sentences showing certain problematic syntactic phenomena – sentences with ellipsis, comparative constructions, constructions with the bi-functional subordinating conjunction jako, etc. The syntactic parsing follows the Universal Dependencies schema. The first version of the corpus is available for download from the Computational Linguistics in Poland website. The second version is available upon request. For the relevant publication, see Wróblewska (2018) |
Download |
Size: 110,000 tokens |
Portuguese |
This corpus contains literary and newspaper texts. The corpus is available for download from the PORTULAN CLARIN repository. |
Download |
Size: 110,000 tokens |
Portuguese |
This corpus contains literary and newspaper texts. The corpus is available for download from the PORTULAN CLARIN repository. |
Download |
Size: 110,000 tokens |
Portuguese |
This corpus contains literary and newspaper texts. The corpus is available for download from the PORTULAN CLARIN repository. |
Download |
Size: 110,000 tokens |
Portuguese |
This corpus contains literary and newspaper texts. The corpus is available for download from the ELRA catalogue. |
Download |
Training corpus SETimes.SR 1.0 Size: 87,000 tokens |
Serbian |
This corpus contains posts from the Southeast European Times news portal, which is now defunct. The syntactic parsing follows the Universal Dependencies framework. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. |
|
Tamil Dependency Treebank v0.1 Size: 600 sentences |
Tamil |
The syntactic parsing follows the rules of the https://ufal.mff.cuni.cz/pdt/. The corpus is available for download from the LINDAT repository. |
Download |
Size: 19 treebanks |
19 languages |
This treebank collection is available for download from LINDAT. The treebanks can be individually queried through KonText and the treebank tool PML-TQ. We list them here by language:
For the relevant publication, see Zeman et al. (2012) |
Download |
Size: 532 treebanks |
71 languages |
This is a collection of treebanks made available through the Infrastructure for the Exploration of Syntax and Semantics (INESS). The corpora are available for online querying through INESS. For the relevant publication, see Rosén et al. (2012) |
|
Size: 30 million tokens; 30.6 million words; 1.8 million sentences |
75 languages |
This corpus collection contains treebanks following theUniversal Dependencies framework. The corpus collection is available for download from the LINDAT repository. The individual treebanks in Universal Dependencies 2.3 can also be queried through the concordancer KonText and the treebank query tool PML-TQ. Below we provide links to these search environments for all the treebanks. For a detailed description of the treebanks, see the Universal Dependencies project page.
For the relevant publication, see de Marneffe et al. (2021) |
Download |
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens |
Croatian |
This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016) |
Download |
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens |
Serbian |
This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017) |
Download |
Named Entity Recognition
Corpus | Language | Description | Availability |
---|---|---|---|
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens |
Croatian |
This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016) |
Download |
Size: 500,000 tokens |
Croatian | This corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. | |
Size: 5868 sentences, 35220 NEs |
Czech |
This corpus is available for download from LINDAT. For the relevant publication, see Kravalová and Žabokrtský (2009) |
Download |
xLiMe Twitter Corpus XTC 1.0.1 Size: 364,000 tokens |
German, Italian, Spanish |
This corpus contains Tweets. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Rei et al. (2016) |
Download |
KPWr (Polish Corpus of Wrocław University of Technology) 1.2 Size: 447,000 tokens |
Polish |
This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.). The corpus is available for download from the CLARIN-PL repository. |
Download |
Size: 46,000 tokens |
Polish |
This corpus contains travel blogs. The corpus is available for download from the CLARIN-PL repository. |
Download |
CINTIL-Corpus Internacional do Português Size: 1 million tokens |
Portuguese |
The corpus contains transcriptions of spoken communication as well as written texts from several genres (news, literature, magazines, etc.). The corpus is available for download from the CLARIN PORTULAN repository. |
Download |
Training corpus SETimes.SR 1.0 Size: 87,000 tokens |
Serbian |
This corpus contains posts from the Southeast European Times news portal, which is now no longer being updated. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Batanović et al. (2018) |
|
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens |
Serbian |
This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016). |
Download |
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens |
Slovenian |
This corpus contains computer-mediated communication (CMC). The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018) |
|
Size: 586,000 tokens |
Slovenian |
This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. |
|
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens |
Croatian |
This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016) |
Download |
Serbian linguistic training corpus SETimes.SR 2.0 Size: 97,673 tokens |
Serbian |
This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) MULTEXT-East V6 morphosyntactic specifications, (2) the UDv2 Guidelines, and (3) Janes annotation guidelines for named entities. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Samardžić et al. (2017) |
Download |
Sentiment analysis
Corpus | Language | Description | Availability |
---|---|---|---|
xLiMe Twitter Corpus XTC 1.0.1 Size: 364,000 tokens |
German, Italian, Spanish |
This corpus contains Tweets. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Rei et al. (2016) |
Download |
Twitter sentiment for 15 European languages Size: 1.6 million tweets |
Albanian, Bosnian, Bulgarian, Croatian, English, German, Hungarian, Polish, Portuguese, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish |
This corpus contains Tweet IDs with sentiment annotations. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Mozetič et al. (2016) |
Download |
Dataset and baseline model of moderated content FRENK-STYRIA-24sata 1.0 Size: 407.5 million words |
Croatian |
This corpus contains news comments from the website 24sata.hr. The corpus is available for download from CLARIN.SI. |
Download |
Aspect-Term Annotated Customer Reviews in Czech Size: 2200 reviews |
Czech |
This corpus contains online user-product reviews. The corpus is available for download from LINDAT. |
Download |
Facebook Data for Sentiment Analysis Size: 10,000 Facebook posts |
Czech |
This corpus contains Facebook posts. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Habernal et al. (2013) |
|
Size: 27,000 sentences |
Finnish |
This corpus contains sentences from Finnish social media that have been manually annotated for sentiment polarity by three native annotators. The corpus is available for download from META-SHARE (the Finnish Language Bank). For the relevant publication, see Lindén et al. (2023) |
Download |
NoReC: The Norwegian Review Corpus Size: 14.8 million tokens |
Norwegian |
This corpus contains reviews in different domains (e.g., literature, videogames, etc.). The corpus is available for download from the CLARINO repository. For the relevant publication, see Velldal et al. (2018) |
Download |
Manually sentiment annotated Slovenian news corpus SentiNews 1.0 Size: 10,427 articles |
Slovenian |
This corpus contains news articles. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Bučar et al. (2018) |
Download |
Other annotation layers
Corpus | Language | Description | Availability |
---|---|---|---|
PARSEME corpora annotated for verbal multiword expressions (version 1.3) Size: 5.8 million tokens |
Arabic, Basque, Bulgarian, Chinese, Croatian, Czech, English, French, German, Hebrew, Hindi, Hungarian, Irish, Italian, Lithuanian, Maltese, Modern Greek (1453-), Persian, Polish, Portuguese, Romanian, Serbian, Slovenian, Spanish, Swedish, Turkish |
This multilingual resource contains corpora in which verbal multi-word expressions (MWEs) have been manually annotated. Verbal MWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). The 1.0 versions of the PARSEME corpora can be queried individually through KonText. We provide the individual links to each corpus:
For the relevant publication, see Savary et al. (2023) |
Download |
Croatian linguistic training corpus hr500k 2.0 Size: 499,635 tokens |
Croatian |
This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the UDv2 Guidelines, (3) the Janes annotation guidelines for named entities, (4) the PARSEME guidelines for annotating multi-word expressions and (4) the semantic role labelling annotation protocol for Slovenian and Croatian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2016) |
Download |
Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 Size: 89,855 tokens |
Croatian |
This corpus contains manually annotated Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016) |
Download |
Size: 1121 sentences |
Czech |
This corpus contains legal texts. The corpus is available through the concordance KonText, the PML-TQ tool and for download from the LINDAT repository. |
|
Size: 49,500 sentences |
Czech |
This corpus is a subset of the Prague Dependency Treebank 3.5. The corpus is available through the PML-TQ tool. |
PML-TQ |
Prague Czech-English Dependency Treebank 2.0 Coref Size: 49,000 sentences |
Czech, English |
This corpus is an extended version of Prague Czech-English Dependency Treebank 2.0, with added mark-up of coreference. The syntactic parsing follows the PDT 2.0 styleD The corpus is available for download from the LINDAT repository. The version without coreference annotation is available through the concordancer KonText and the PML-TQ tool.T 2.0 style. |
|
Artificial Treebank with Ellipsis Size: 106,000 tokens, 10,604 sentences |
Czech, English, Finnish, Russian, Slovak |
The syntactic parsing follows the Universal Dependencies schema. The corpus is available for download from the LINDAT repository. |
Download |
Size: 11,417,194 words |
Danish |
This corpus contains the literary works of the Danish bishop N.F.S Grundtvig. The corpus is available for download from the CLARIN-DK repository. |
Download |
Size: 1 million words |
Dutch |
This is a manually annotated subset of the much larger (approx.. 500 million) word) SoNaR corpus. The corpus is available for download from the Dutch Language Institute. |
Download |
Natural Language 2 Semantic Hypergraph Dataset NL2SH 1.0 Size: 6,851 tokens |
English |
This corpus can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources:
|
Download |
Speech, Thought and Writing Presentation Corpus Size: 260,000 words |
English |
This corpus contains literary, newspaper and biography texts. The corpus is available for download from the Oxford Text Archive. |
Download |
Size: 33216 tokens |
English |
This corpus contains 6818 terms extracted from abstracts of computational linguistics papers. The corpus is available for download from LINDAT and through KonText. For the relevant publication, see QasemiZadeh and Schumann (2016) |
|
Estonian Treebank annotated with coreference relations Size: 107,000 words |
Estonian |
This corpus contains newspaper texts plus one scientific medical text. The corpus is available for download from META-SHARE (CELR distribution). |
Download |
Semantically disambiguated corpus of Estonian Size: 375,733 tokens |
Estonian | The corpus is available for download from META-SHARE (CELR distribution). | Download |
TimeML annotated corpus of Estonian newspaper articles Size: 22,000 words |
Estonian |
This corpus contains newspaper articles. The corpus is available for download from META-SHARE (CELR distribution). |
Download |
Size: 62,988 tokens |
Greek |
In addition to coreference, the corpus is annotated for identity and bridging relations. In addition to coreference, the corpus is annotated for identity and bridging relations. For the relevant publication, see Ogrodnizcuk et al. (2015) |
Download |
Greek Textual Entailment Corpus Size: 600 sentence-pairs |
Greek |
This corpus contains texts from the domains of politics, law and travel. This corpus is available for download from the clarin:el repository. |
Download |
KPWr (Polish Corpus of Wrocław University of Technology) 1.2 Size: 447,000 tokens |
Polish |
This corpus contains texts in a variety of domains (blogs, science, stenographic recordings, etc.). The corpus is available for download from the CLARIN-PL repository. |
Download |
Size: 540,000 tokens |
Polish |
This corpus contains texts in a variety of domains (magazines, fiction literature, non-fiction literature, computer-mediated communication, academic writing, etc.). The corpus is available for download and online browsing. |
|
Size: 10845 summaries |
Polish |
This corpus is available for download from the ZIL IPI PAN repository. For the relevant publication, see Ogrodniczuk and Kopeć (2014) |
Download |
WUT Relations Between Sentences Corpus Size: 5654 sentences |
Polish |
This corpus contains news items. The corpus is available for download from the CLARIN.PL repository. |
Download |
Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 Size: 92,271 tokens |
Serbian |
This corpus contains manually annotated Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Miličević and Ljubešić (2016) |
Download |
Size: 884 hours |
Slovenian |
This corpus was designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. The audio files are available in a separate repository entry. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected. The data are structured as follows:
|
|
CMC training corpus Janes-Norm 1.2 Size: 184,755 tokens |
Slovenian |
This corpus is partially also manually annotated with MSD tags and lemmatized. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. |
|
CMC training corpus Janes-Tag 2.0 Size: 75,000 tokens |
Slovenian |
This corpus contains computer-mediated communication (CMC). The corpus is morphosyntactically tagged following the MULTEXT-East Version 5 tagset. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. For the relevant publication, see Fišer et al. (2018) |
|
Corpus of comma placement Vejica 1.3 Size: 104,000 sentences |
Slovenian |
This corpus contains texts from various Slovenian corpora (KUST, Šolar aLektorm JANES-Vejican Wikpedia. The corpus is available for dow.nload from CLARIN.SI. |
Download |
Slovenian Definition Extraction evaluation datasets RSDO-def 1.0 Size: 2,216 sentences |
Slovenian |
This corpus contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1, which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus. The labels added to the sentences included in the dataset denote: 0: Non-definition; 1: Weak definition; 2: Definition. The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions; and 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions. Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts. The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Tran et al. (2023)#SEPPollak (2014) |
Download |
Slovenian Word in Context dataset SloWiC 1.0 Size: 14,958 items |
Slovenian |
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task. Each example contains the following data fields:
|
Download |
Terminology identification dataset KAS-term 1.0 Size: 22,950 term candidates |
Slovenian |
This corpus contains term candidates from PhD theses in chemistry, computer science and political science. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Holozan (2018) |
Download |
Size: 586,000 tokens |
Slovenian |
This corpus contains standard Slovenian. The corpus is available through the concordancers KonText and noSketchEngine and for download from the CLARIN.SI repository. |
|
Bilingual terminology extraction dataset KAS-biterm 1.0 Size: 1,950 sentences, 78,500 tokens, 3,700 terms |
Slovenian, English |
This corpus contains PHD theses. The corpus is available for download from the CLARIN.SI repository. |
Download |
Publications
[Batanović et al. 2018] Vuk Batanović, Nikola Ljubešić, and Tanja Samadržić. 2018. SETimes.SR – A Reference Training Corpus of Serbian.
[Bučar et al. 2018] Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.
[Csendes et al. 2005] Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged Treebank.
[Erjavec 2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.
[Erjavec et al. 2010] Tomaž Erjavec, Darja Fišer, Simon Krek, and Nina Ledinek. 2010. The JOS Linguistically Tagged Corpus of Slovene.
[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.
[Habernal et al. 2013] Ivan Habernal, Tomáš Ptáček, and Josef Steinberger. 2013. Sentiment Analysis in Czech Social Media Using Supervised Machine Learning.
[Hajič et al. 2004] Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools
[Hajič et al. 2012] Jan, Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announcing Prague Czech-English Dependency Treebank 2.0
[Haverinen et al. 2014] Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2014. Building the essential resources for Finnish: the Turku Dependency Treebank.
[Holozan 2018] Peter Holozan. 2018. Corpus of comma placement Vejica 1.3.
[Kravalová and Žabokrtský 2009] Jana Kravalová and Zdenek Žabokrtský. 2009. Czech Named Entity Corpus and SVM-based Recognizer.
[Kríž and Hladká 2018] Vincent Kríz and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0.
[Miličević and Ljubešić 2016] Maja Miličević and Nikola Ljubešić. 2016. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets.
[Mozetič et al. 2016] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.
[Muischnek et al. 2014] Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme
[van Noord 2009] Gertjan van Noord. 2009. Huge Parsed Corpora in LASSY.
[Jelínek 2017] Tomáš Jelínek. 2017. FicTree: a Manually Annotated Treebank of Czech Fiction.
[Ogrodniczuk and Kopeć 2014] Maciej Ogrodniczuk and Mateusz Kopeć. The Polish Summaries Corpus.
[Ogrodnizcuk et al. 2015] Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, and Magdalena Zawisławska. Coreference in Polish: Annotation, Resolution and Evaluation in Polish.
[Orasmaa 2014] Siim Orasmaa. Towards an Integration of Syntactic and Temporal Annotations in Estonian.
[Przepiórkowski and Murzynowski 2011] Adam Przepiórkowski and Grzegorz Murzynowski. 2011. Manual annotation of the National Corpus of Polish with Anotatornia.
[QasemiZadeh and Schumann 2016] Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods.
[Rei et al. 2016] Luis Rei, Dunja Mladenić, and Simon Krek. 2016. A Multilingual Social Media Linguistic Corpus.
[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.
[Rögnvaldsson et al. 2012] Eiríkur Rögnvaldsson, Anton Karl Ingason, Einar Freyr Sigurðsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC).
[Rosén et al. 2012] Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik. 2012. An Open Infrastructure for Advanced Treebanking.
[Stein and Prévost 2013] Achim Stein and Sophie Prévost. 2013. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF).
[Velldal et al. 2018] Erik Velldal, Lilja Øvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, and Fredrik Jørgensen. 2018. NoReC: The Norwegian Review Corpus
[Wróblewska 2018] Alina Wróblewska. 2018. Extended and enhanced Polish dependency bank in Universal Dependencies format.
[Zeman et al. 2012] Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, and Jan Hajič. 2012. HamleDT: To Parse or Not to Parse?