Language models are pretrained probabilistic models of word sequences. They are used to determine that a language string such as briefed reporters on is, in the case of English, more probable than the alternative string briefed to reporters, which is grammatically well-formed but significantly less idiomatic (Jurafsky and Martin 2021: 2). This allows a tool trained on the model to select the correct sequence. While language models can assign probabilities over simple sequences of words, there also exist models in which probabilities are assigned to more complex structures.
There are 98 language models in the CLARIN infrastructure for the training of the following tool functionalities:
- Morphosyntax
- Machine Translation
- Syntactic Parsing
- Named Entity Recognition
- Lemmatisation
- Baseline Models
- Other
- Contextual Word Embeddings
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Language Models in the CLARIN Infrastructure
Morphosyntax
Corpus | Language | Description | Availability |
---|---|---|---|
The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian 2.1 Annotation: morphosyntax |
Bulgarian |
The model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the BulTreeBank training corpus and using the CLARIN.SI-embed.bg word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.83. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1 Annotation: morphosyntax |
Croatian |
The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Croatian 2.1 Annotation: morphosyntax |
Croatian |
The model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and the ReLDI-NormTagNER-hr corpus, using the CLARIN.SI-embed.hr word embeddings. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.49. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115 Annotation: morphosyntax |
Czech |
These models were developed for MorphoDiTa, which performs morphological analysis, morphological generation and part-of-speech tagging (see also the PoS-taggers and lemmatizers Resource Family). The morphological dictionary is created from the 161115 version of the MorfFlex CZ lexicon and the 1.2 version of the DeriNet lexical network. The PoS tagger is trained on Prague Dependency Treebank 3.0. The models are available for download from the LINDAT repository. |
Download |
POS Tagging and Lemmatization (Czech model) Annotation: morphosyntax and lemmatisation |
Czech |
This model is trained using RobeCzech, which is the Czech version of BERT. The model is trained on the Prague Dependency Treebank 3.5. The model is available for download from the LINDAT repository. For the relevant publication, see Vysušilová (2021) |
Download |
English Models (Morphium + WSJ) for MorphoDiTa Annotation: morphosyntax |
English |
These models are for MorphoDiTa, which performs morphological analysis, morphological generation and part-of-speech tagging (see also the PoS-taggers and lemmatizers Resource Family). The morphological dictionary is created from Morphium and SCOWL (Spell Checker Oriented Word Lists), the PoS tagger is trained on the Wall Street Journal. |
Download |
Annotation: morphosyntax |
Finnish |
This BERT model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. The model is available for download from the Language Bank of Finland. |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1 Annotation: morphosyntax |
Macedonian |
The model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published) and using the Macedonian CLARIN.SI word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.14. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Annotation: morphosyntax |
Polish |
This is a model for the Liner2.5 tool for the recognition of verbs without explicit subjects. The model is available for download from the CLARIN-PL repository. |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1 Annotation: morphosyntax |
Serbian |
The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Croatian hr500k training dataset to ensure sufficient representation of certain labels, and using the CLARIN.SI-embed.sr word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Serbian 2.1 Annotation: morphosyntax |
Serbian (non-standard) |
The model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and the hr500k training corpus and using the CLARIN.SI-embed.sr word embeddings. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.64. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Slovak MorphoDiTa Models 170914 Annotation: morphosyntax |
Slovak |
These are Slovak models for MorphoDiTa, a tool which provides morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex (SK 170914) and the PoS tagger is trained on automatic translations in Prague Dependency Treebank 3.0. The models are available for download from the LINDAT repository. |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0 Annotation: morphosyntax |
Slovenian |
The model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus.The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~98.27. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Slovenian 2.1 Annotation: morphosyntax |
Slovenian (non-standard) |
The model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and on the Janes-Tag corpus using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.17. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Annotation: morphosyntax |
Swedish |
This is a set of 2 models. flair_eval is trained on SUC3 with Talbanken_SBX_devas dev set. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. flair_full is trained on SUC3, Talbanken_SBX_test, SIC2 with Talbanken_SBX_dev as dev set. The models are available for download from the Swedish Language Bank. |
Download |
Annotation: morphosyntax |
Swedish |
This is a set of 2 models. marmot_eval is trained on SUC3 and the Talbanken_SBX_dev treebank, using Saldo as dictionary. marmot_full is trained on SUC3, the Talbanken_SBX_dev treebank, and SIC2 (with Saldo as dictionary). The models are available for download from the Swedish Language Bank. |
Download |
Annotation: morphosyntax |
Swedish |
This is a set of 2 models. stanza_eval is trained on SUC3 and the Talbanken_SBX_dev treebank. stanza_full is trained on the SUC3, Talbanken_SBX_test, and SIC2 sets, with Talbanken_SBX_dev as dev set. The models are available for download from the Swedish Language Bank. |
Download |
Machine Translation
Corpus | Language | Description | Availability |
---|---|---|---|
WMT21 Marian translation models (ca-ro,it,oc) Annotation: machine translation |
Catalan, Italian, Occitan, Romanian |
This is a translation model from Catalan into Romanian, Italian, and Occitan that was part of the submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The model is available for download from the LINDAT repository. For the relevant publication, see Jon et al. (2021) |
Download |
WMT21 Marian translation model (ca-oc) Annotation: machine translation |
Catalan, Occitan |
This is a translation model from Catalan Occitan that was part of the submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The model is available for download from the LINDAT repository. |
|
WMT21 Marian translation model (ca-oc) Annotation: machine translation |
Catalan, Occitan |
This is a neural machine translation model for Catalan to Occitan translation and constitutes the primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The model is available for download from the LINDAT repository. For the relevant publication, see Jon et al. (2021) |
Download |
WMT21 Marian translation model (ca-oc multi-task) Annotation: machine translation |
Catalan, Occitan |
This is a neural machine translation model for Catalan to Occitan translation. It is a multi-task model, also producing phonemic transcription of the Catalan source. The model was submitted to WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task as a CUNI-Contrastive system for Catalan to Occitan. The model is available for download from the LINDAT repository. For the relevant publication, see Jon et al. (2021) |
Download |
Annotation: machine translation |
Czech, English |
These models are for the Neural Monkey toolkit for Czech and English, solving four tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The same models can also be invoked via an online demo. This entry also includes models for automatic news summarization for Czech and English. The Czech models were trained using the SumeCzech dataset, while the English models were trained using the CNN-Daily Mail corpus, using the standard recurrent sequence-to-sequence architecture. The models are available for download from the LINDAT repository. For the relevant publication, see Libovicky et al. (2018) |
Download |
CUBBITT Translation Models (en-cs) (v1.0) Annotation: machine translation |
Czech, English |
These English-Czech translation models are used by the . The model is available for download from the LINDAT repository. |
Download |
WMT16 Tuning Shared Task Models (Czech-to-English) Annotation: machine translation |
Czech, English |
These Czech to English translation models are trained on the parallel CzEng 1.6 corpus. The data is tokenized with Moses). Alignment is done using fast_align and the standard Moses pipeline is used for training. The models are available for download from the LINDAT repository. |
Download |
CUBBITT Translation Models (en-fr) (v1.0) Annotation: machine translation |
English, French |
These are CUBBITT English-French translation models available in the LINDAT translation service. The models are available for download from the LINDAT repository. For the relevant publication, see Popel et al. (2020) |
Download |
Translation Models (English-German) Annotation: machine translation |
English, German |
These English-German translation models are used by the Lindat translation service. The models are available for download from the LINDAT repository. |
Download |
MCSQ Translation Models (en-de) (v1.0) Annotation: machine translation |
English, German |
These are English-German translation models available in the LINDAT translation service. The models are trained using the MCSQ social surveys dataset (available here ). The models are available for download from the LINDAT repository. |
Download |
GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models (1.0) Annotation: machine translation |
English, Icelandic |
This CLARIN-IS repository entry includes code and models required to run the GreynirT2T Transformer NMT system for translation between English and Icelandic. The models along with the code are available for download from the CLARIN-IS repository. |
Download |
CUBBITT Translation Models (en-pl) (v1.0) Annotation: machine translation |
English, Polish |
These are CUBBITT English-Polish translation models available in the LINDAT translation service. The models are available for download from the LINDAT repository. For the relevant publication, see Popel et al. (2020) |
Download |
Translation Models (en-ru) (v1.0) Annotation: machine translation |
English, Russian |
These are CUBBITT English-Russiantranslation models available in the LINDAT translation service. The models are available for download from the LINDAT repository. |
Download |
MCSQ Translation Models (en-ru) (v1.0) Annotation: machine translation |
English, Russian |
These are English-Russian translation models available in the LINDAT translation service. The models are trained using the MCSQ social surveys dataset (available here ). The models are available for download from the LINDAT repository. |
Download |
Annotation: machine translation |
Icelandic, English |
These are a variant of GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0), trained with a 40% layer drop. They are suitable for inference using every other layer for optimized inference speed with lower translation performance. These models are available for download from the repository of CLARIN-IS. For the relevant publication, see Simonarson et al. (2021) |
Download |
Syntactic Parsing
Corpus | Language | Description | Availability |
---|---|---|---|
Annotation: syntactic parsing |
Afrikaans, Akkadian, Amharic, Ancient Greek (until 1453), Arabic, Armenian, Bambara, Basque, Belarusian, Breton, Bulgarian, Catalan, Chinese, Church Slavonic, Coptic, Croatian, Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Komi-Zyrian, Korean, Latin, Latvian, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Nigerian Pidgin, Northern Kurdish, Northern Sami, Norwegian, Old French (842-ca. 1400), Persian, Polish, Portuguese, Romanian, Buryat, Russian, Sanskrit, Serbian, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language, Tagalog, Tamil, Telugu, Thai, Turkish, Uighur, Ukrainian, Upper Sorbian, Urdu, Vietnamese, Warlpiri, Yoruba, Yue Chinese |
UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (trained on UD v2.3 with 124 treebanks). For the relevant publication, see Kondratyuk and Straka (2019) |
Download |
Universal Dependencies 2.5 Models for UDPipe Annotation: syntactic parsing |
Afrikaans, Ancient Greek (until 1453), Arabic, Armenian, Basque, Belarusian, Bulgarian, Catalan, Chinese, Church Slavonic, Coptic, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Gambian Wolof, German, Gothic, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Literary Chinese, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Northern Sami, Norwegian Bokmål, Norwegian Nynorsk, Old French (842-ca. 1400), Old Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, Wolof |
These models are for the Universal Dependencies 2.5 treebanks (94 treebanks of 61 languages). In addition to dependency parsing, the models are also for toeknisation, part-of-speech tagging and lemmatisation. The models are available for download from the LINDAT repository. |
Download |
The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1 Annotation: syntactic parsing |
Bulgarian |
The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the UD-parsed portion of the BulTreeBank training corpus and using the CLARIN.SI-embed.bg word embeddings. The estimated LAS of the parser is ~91.18. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for UD dependency parsing of standard Croatian 2.1 Annotation: syntactic parsing |
Croatian |
The model for UD dependency parsing of standard Croatian was built with the CLASSLA-Stanza tool by training on the UD-parsed portion of the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings.The estimated LAS of the parser is ~87.46. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Slavic Forest, Norwegian Wood (models) Annotation: syntactic parsing |
Croatian, Norwegian, Slovak |
These are models for the dependency parser UDPipe used to produce the authors' final submission to the Vardial 2017 CLP shared task. The scripts and commands used to create the models are part of a separate LINDAT repository entry. The models were trained with UDPipe version 3e65d69 from 3 January 2017; their functionality with newer or older versions of UDPipe is not guaranteed. The models are available for download from the LINDAT repository. For the relevant publication, see Rosa et al. (2017) |
Download |
Universal Dependencies 1.2 Models for Parsito Annotation: syntactic parsing |
English | These are models for the dependency parser Parsito. They are trained on Universal Dependencies 1.2 Treebanks. | Download |
CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials Annotation: syntactic parsing |
Multiple languages |
This is a baseline model for UDPipe (version 1.2 and up), created for the CoNLL 2018 Shared Task in UD Parsing. The models were trained using a custom data split for treebanks where no development data is provided. The model is available for download from the LINDAT repository. |
Download |
CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials Annotation: syntactic parsing |
Multiple languages |
These are models for the dependency parser UDPipe, developed as part of the CoNLL 2017 Shared Task in UD Parsing. The models are available for download from the LINDAT repository. |
Download |
Dependency parsing models for Polish Annotation: syntactic parsing |
Polish |
These models are trained on the 3.5 version of the Polish Dependency Treebank with the publicly available parsing systems: MaltParser, MateParser, and UDPipe. The models are available for download from the CLARIN-PL repository. For the relevant publication, see Wroblewska and Rybak (2019) |
Download |
The CLASSLA-Stanza model for UD dependency parsing of standard Serbian 2.1 Annotation: syntactic parsing |
Serbian |
The model for UD dependency parsing of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus and using the CLARIN.SI-embed.sr word embeddings.The estimated LAS of the parser is ~89.83. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for JOS dependency parsing of standard Slovenian 2.0 Annotation: syntactic parsing |
Slovenian |
The model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated LAS of the parser is ~93.89. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0 Annotation: syntactic parsing |
Slovenian |
The model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated LAS of the parser is ~91.11. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Dependency parsing model: Stanza Annotation: syntactic parsing |
Swedish |
This is a set of 2 models that enable the dependency parsing of Swedish (in the Mamba-Dep format, the format of TalbankenSBX). The models are available for download from the Swedish Language Bank. |
Download |
Named Entity Recognition
Corpus | Language | Description | Availability |
---|---|---|---|
The CLASSLA-StanfordNLP model for named entity recognition of standard Bulgarian 1.0 Annotation: named entity recognition |
Bulgarian |
This model for named entity recognition of standard Bulgarian was built with the CLASSLA-StanfordNLP tool by training on the BulTreeBank training corpus and using the CoNLL2017 word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0 Annotation: named entity recognition |
Croatian |
This model for named entity recognition of standard Croatian was built with the CLASSLA-StanfordNLP tool by training on the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0 Annotation: named entity recognition |
Croatian (non-standard) |
This model for named entity recognition of non-standard Croatian was built with the CLASSLA-StanfordNLP tool by training on the hr500k training corpus, the ReLDI-NormTagNER-hr corpus and the ReLDI-NormTagNER-sr corpus, using the CLARIN.SI-embed.hr word embeddings . The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Czech Models (CNEC) for NameTag Annotation: named entity recognition |
Czech |
These are models for the named entity recognizer NameTag. The models are available for download from the LINDAT repository. |
Download |
Annotation: named entity recognition |
Czech, Dutch, English, German, Spanish |
These models are for NameTag 2, a named entity recognition tool (see also the Named Entity Recognizers Resource Family). The documentation is available separately on the project webpage. The models are available for download from the LINDAT repository. For the relevant publication, see Straková et al. (2019) |
Download |
English Model (CoNLL-2003) for NameTag Annotation: named entity recognition |
English |
This is an English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data and recognizes PER, ORG, LOC and MISC named entities. It achieves an F-measure 84.73 on the CoNLL-2003 test data. The model is available for download from the LINDAT repository. |
Download |
Annotation: named entity recognition |
Polish |
This is a model for the Liner 2.5 tool. The model is available for download from the CLARIN-PL repository. |
Download |
Annotation: named entity recognition |
Polish |
This is a Liner2 model for the recognition of named entities. The model was trained on the NKJP corpus and evaluated in the PolEval 2018 Task 2. The model is available for download from the CLARIN-PL repository. |
Download |
The CLASSLA-StanfordNLP model for named entity recognition of standard Serbian 1.0 Annotation: named entity recognition |
Serbian |
This model for named entity recognition of standard Serbian was built with the CLASSLA-StanfordNLP tool by training on the SETimes.SR training corpus and using the CLARIN.SI-embed.sr word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Serbian 1.0 Annotation: named entity recognition |
Serbian (non-standard) |
This model for named entity recognition of non-standard Serbian was built with the CLASSLA-StanfordNLP tool by training on the SETimes.SR training corpus/a>, the hr500k training corpus, the ReLDI-NormTagNER-sr corpus, and the ReLDI-NormTagNER-hr corpus, using the CLARIN.SI-embed.sr word embeddings. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0 Annotation: named entity recognition |
Slovenian |
This model for named entity recognition of standard Slovenian was built with the CLASSLA-StanfordNLP tool by training on the ssj500k training corpus and using the CLARIN.SI-embed.sl word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0 Annotation: named entity recognition |
Slovenian (non-standard) |
This model for named entity recognition of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool by training on the ssj500k training corpus and the Janes-Tag training corpus, using the CLARIN.SI-embed.sl word embeddings. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
PyTorch model for Slovenian Named Entity Recognition SloNER 1.0 Annotation: named entity recognition |
Slovenian |
This is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library . The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0. The model was trained on the SUK 1.0 training corpus.The source code of the model is available on GitHub repository. The model is available for download from the CLARIN.SI repository. |
Download |
Lemmatisation
Corpus | Language | Description | Availability |
---|---|---|---|
The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1 Annotation: lemmatisation |
Bulgarian |
The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the BulTreeBank training corpus and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for lemmatisation of standard Croatian 2.1 Annotation: lemmatisation |
Croatian |
The model for lemmatisation of standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and using the hrLex inflectional lexicon. The estimated F1 of the lemma annotations is ~98.02. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1 Annotation: lemmatisation |
Croatian (non-standard) |
The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and the ReLDI-NormTagNER-hr corpus, using the hrLex inflectional lexicon. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.23. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1 Annotation: lemmatisation |
Macedonian |
The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published). The estimated F1 of the lemma annotations is ~98.81. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for lemmatisation of standard Serbian 2.1 Annotation: lemmatisation |
Serbian |
The model for lemmatisation of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and using the srLex inflectional lexicon. The estimated F1 of the lemma annotations is ~98.02. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for lemmatisation of non-standard Serbian 2.1 Annotation: lemmatisation |
Serbian (non-standard) |
The model for lemmatisation of non-standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and using the srLex inflectional lexicon. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.92. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 2.0 Annotation: lemmatisation |
Slovenian |
The model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated F1 of the lemma annotations is ~99.7. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
The CLASSLA-Stanza model for lemmatisation of non-standard Slovenian 2.1 Annotation: lemmatisation |
Slovenian (non-standard) |
The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and on the Janes-Tag corpus using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~91.45. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019) |
Download |
Annotation: lemmatisation |
Swedish |
This model enables lemmatisation of Swedish text following the SUC3 standard. The models are available for download from the Swedish Language Bank. |
Download |
Baseline
Corpus | Language | Description | Availability |
---|---|---|---|
Annotation: Baseline |
Czech |
These models are trained on CERED, a dataset created by distant supervision on Czech Wikipedia and Wikidata, and recognize a subset of Wikidata relations. The model is available for download from the LINDAT repository. |
Download |
Annotation: Baseline |
Latvian |
This model is trained on the original implementation of BERT on the TensorFlow machine-learning platform with the whole-word masking and the next sentence prediction objectives. This uses the BERT configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and a 32,000 token vocabulary. THe model is available for download from the CLARIN-LV repository. For the relevant publication, see Znotinš and Barzdinš (2020) |
Download |
Annotation: Baseline |
Lithuanian, Latvian, English |
This BERT-like model represents tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian. The model is available for download from the CLARIN-LT repository. |
Download |
Annotation: Baseline |
Portuguese |
This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the DeBERTa model. This model is for American Portuguese spoken in Brazil, is trained on the brWaC dataset, and is a larger version of the Albertina PT-BR base model. This model is available for download through Hugging Face. |
Download |
Annotation: Baseline |
Portuguese | This model is for Portuguese spoken in Brazil. It is based on the Transformer neural architecture and is developed over the DeBERTa model. | Download |
Annotation: Baseline |
Portuguese |
This is a model for Portuguese spoken in Brazil trained on adta sets othan than brWaC. It is I developed over the DeBERTa model. The model is available for download from Hugging Face. |
Download |
Annotation: Baseline |
Portuguese |
This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the DeBERTa model. This model is for European Portuguese and is trained on the brWaC dataset, and is a larger version of the Albertina PT-PT base model. This model is available for download through Hugging Face. |
Download |
Annotation: Baseline |
Portuguese |
This model is for European. It is based on the Transformer neural architecture and is developed over the DeBERTa model. This model is available for download through Hugging Face. |
Download |
Annotation: Baseline |
Portuguese |
This model, which is for Portuguese spoken in Brazil, is a decoder of the GPT family that is based on the neural architecture Transformer and developed over the Pythia model. The model is available for download from Hugging Face. |
Download |
Annotation: Baseline |
Portuguese |
This model, which is for European Portuguese, is a decoder of the GPT family that is based on the neural architecture Transformer and developed over the Pythia model. The model is available for download from Hugging Face. |
Download |
BERTimbau - Portuguese BERT-Base language model Annotation: Baseline |
Portuguese |
This is a BERT model, trained on BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. The model is available for download from the PORTULAN repository. |
Download |
BERTimbau - Portuguese BERT-Large language model Annotation: Baseline |
Portuguese |
This is a BERT model, trained on BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. The model is available for download from the PORTULAN repository. |
Download |
Portuguese RoBERTa language model Annotation: Baseline |
Portuguese |
This is a pre-trained roBERTa model in Portuguese, with 6 layers and 12 attention-heads, totaling 68M parameters. Pre-training was done on 10 million Portuguese sentences and 10 million English sentences from the OSCAR corpus. The model is available for download from the PORUTLAN repository. |
Download |
Dataset and baseline model of moderated content FRENK-MMC-RTV 1.0 Annotation: Baseline |
Slovenian |
FRENK-MMC-RTV is a dataset of moderated newspaper comments from the website rtvslo.si with metadata on the time of publishing, user identifier, thread identifier and whether the comment was deleted by the moderators or not. The full text of each comment is encrypted via a character-replacement method so that the comments are not readable by humans. Basic punctuation is not encrypted in order to enable tokenization. The main use of this dataset are experiments on automating comment moderation. For real-world usage, a fastText classification model trained on non-encrypted data is made available as well. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2018) |
Download |
ccGigafida ARPA language model 1.0 Annotation: Baseline |
Slovenian |
This model was created from the ccGigafida written corpus of Slovenian using the KenLM algorithm in the Moses machine translation framework. It is a general language model of contemporary standard Slovenian language that can be used as a language model in statistical machine translation systems. The model is available for download from the CLARIN.SI repository. |
Download |
Other
Corpus | Language | Description | Availability |
---|---|---|---|
Annotation: normalization |
Czech |
These models are for the statistical spellchecker Korektor 2. The models can either perform spellchecking and grammar-checking, or only generate diacritical marks. The models are available for download from the LINDAT repository. |
Download |
Sentiment Analysis (Czech Model) Annotation: sentiment analysis |
Czech |
These models are trained on data from the following sources: Mall (product reviews), CSFD (movie reviews), and Facebook, and joint data from all three datasets above (data available here, using RobeCzech, which is the Czech version of BERT. For the relevant publication, see Vysušilová (2021) |
Download |
Model weights for a study of commonsense reasoning Annotation: commonsense reasoning |
English |
This resource contains model weights for five Transformer-based models: roBERTa, GPT-2, T5, BART and COMET.These models were implemented using HuggingFace, and fine-tuned on the following four commonsense reasoning tasks: Argument Reasoning Comprehension Task (ARCT), AI2 Reasoning Challenge (ARC), Physical Interaction Question Answering (PIQA) and CommonsenseQA (CSQA). The models are available for download form the PORTULAN repository. |
Download |
RÚV-DI Speaker Diarization v5 models (21.05) Annotation: diarization |
Icelandic |
These models are trained on the Althingi Parliamentary Speech corpus hosted by CLARIN-IS. The models use MFCCS, x-vectors, PLDA and AHC The models are available for download from the CLARIN-IS repository. |
Download |
Models for automatic g2p for Icelandic (20.10) Annotation: phonemic transcription |
Icelandic |
These are grapheme-to-phoneme models for Icelandic, trained on an encoder-decoder LSTM neural network. The models are delivered with scripts for automatic transcription of Icelandic in the standard pronunciation variation, in the northern variation, north-east variation, and the south variation. To run the scripts the user needs to install Fairseq. For the relevant publication, see Gorman et al. (2020) |
Download |
Annotation: temporal expressions |
Polish |
This is a model for the Liner2.5 tool for the recognition and normalization on temporal expressions. The model is available for download from the CLARIN-PL repository. |
Download |
Annotation: event mentions |
Polish |
This is a model for the Liner2.5 tool for the recognition of event mentions. The model is available for download from the CLARIN-PL repository. |
Download |
PyTorch model for Slovenian Coreference Resolution Annotation: coreference resolution |
Slovenian |
This is a Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with this code. The model is based on the Slovenian CroSloEngual BERT 1.1 model. It was trained on the SUK 1.0 training corpus, specifically the SentiCoref subcorpus. This resource is available for download from the CLARIN.SI repository. For the relevant publication, see Klemen & Žitnik (2022) |
Download |
Face-domain-specific automatic speech recognition models Annotation: face-domain-specific automatic speech recognition |
Slovenian |
This model contains all the files required to implement face-domain-specific automatic speech recognition (ASR) applications using the Kaldi ASR toolkit, including the acoustic model, language model, and other relevant files. It also includes all the scripts and configuration files needed to use these models for implementing face-domain-specific automatic speech recognition. The acoustic model was trained using the relevant Kaldi ASR tools and the Artur speech corpus (audio,transcriptions). The language model was trained using the domain-specific text data involving face descriptions obtained by translating the Face2Text English dataset into the Slovenian language. These models, combined with other necessary files like the HCLG.fst and decoding scripts, enable the implementation of face-domain-specific ASR applications. This resource is available for download from the CLARIN.SI repository. |
Download |
The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0 Annotation: semantic role labeling |
Slovenian |
The model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings extended with the MaCoCu-sl Slovene web corpus. The estimated F1 of the lemma annotations is ~76.24. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić & Dobrovoljc (2019) |
Download |
Contextual Word Embeddings
Corpus | Language | Description | Availability |
---|---|---|---|
Annotation: word embeddings |
Croatian, English, Slovenian |
Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ulčar and Robnik-Šikonja (2020) |
Download |
ELMo embeddings models for seven languages Annotation: word embeddings |
Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, Swedish |
This model is used to produce contextual word embeddings. It is trained on large monolingual corpora for 7 languages. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. The model is available for download from the CLARIN.SI repository. |
Download |
Annotation: Word embeddings |
Portuguese |
This model represents tokens as contextual word embeddings for Portuguese. It was trained on a corpus of 2 billion tokens and achieved state-of-the-art results on multiple lexical semantic tasks. The model is available for download from the PORTULAN repository. |
Download |
Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0 Annotation: word embeddings |
Slovenian | The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. | |
Word Embeddings trained on English Wikipedia Annotation: word embeddings |
Swedish |
This is a set of contextual word embeddings. The models are available for download from the Swedish Language Bank. |
Download |
Word embeddings CLARIN.SI-embed Annotation: word embeddings |
Bulgarian, Croatian, Macedonian, Serbian, Slovenian |
This is a set of word embeddings for 5 languages.
The models are available for download from the CLARIN.SI repository. |
References
[Gorman et al. 2020] Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta, Arya McCarthy, Shijie Wu, and Daniel You. 2020. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion. 2020. In: Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 40–50.
[Jon et al. 2021] Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, and Ondřej Bojar. 2021. CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task, arXiv pre-print.
[Jurafsky and Martin 2021] Daniel Jurafsky and James H. Martin. 2021. Speech and Language Processing.
[Kondratyuk and Straka 2019] Dan Kondratyuk and Milan Straka. 2019. 75 Languages, 1 Model: Parsing Universal Dependencies Universally. arXiv pre-print.
[Libovicky et al. 2018] Jindrich Libovicky, Rusolf Rosa, Jindrich Helcl, and Martin Popel. 2018. Solving Three Cech NLP Tasks End-to-End with Neural Models. In: CEUR Workshop Proceedings, volume 2203, 138–143.
[Ljubešić and Dobrovoljc 2019] Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 29–34.
[Ljubešić et al. 2018] Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2018. Datasets of Slovene and Croatian Moderated News Comments. In: Proceedings of the 2nd Workshop on Abusive Language Online, 124–131.
[Popel et al. 2020] Martin Popel, Marketa Tomkova, Jakub Tomek, Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bojar, and Zdeněk Žabokrtský. 2020. Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature Communications 11.
[Rosa et al. 2017] Rudolf Rosa, Daniel Zeman, David Mareček, and Zdenek Žabokrtsky. 2017. Slavic Forest, Norwegian Wood. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, 210–219.
[Símonarson et al. 2021] Haukur Barri Símonarson, Vésteinn Snæbjarnarson, Pétur Orri Ragnarsson, Haukur Páll Jónsson, and Vilhjálmur Þorsteinsson. 2021. Miðeind's WMT 2021 submission. arXiv pre-print. .
[Straková et al. 2019] Jana Straková, Milan Straka, and Jan Hajič. 2019. Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5326–5331.
[Ulčar and Robnik-Šikonja 2020] Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSloEngual BERT: less is more in multilingual models. arXiv pre-print.
[Vysušilová 2021] Petra Vysušilová. 2021. Czech NLP with Contextualized Embeddings. Diploma thesis.
[Wróblewska and Rybak 2019] Alina Wróblewska and Piotr Rybak. 2019. Dependency parsing of Polish. Poznan Studies in Contemporary Linguistics.
[Znotiņš and Barzdiņš 2020] Artūrs Znotiņš and Guntis Barzdiņš. 2020. LVBERT: Transformer-Based Model for Latvian Language Understanding. Frontiers in Artificial Intelligence and Applications 328.