Language Models | CLARIN ERIC

Language models are pretrained probabilistic models of word sequences. They are used to determine that a language string such as briefed reporters on is, in the case of English, more probable than the alternative string briefed to reporters, which is grammatically well-formed but significantly less idiomatic (Jurafsky and Martin 2021: 2). This allows a tool trained on the model to select the correct sequence. While language models can assign probabilities over simple sequences of words, there also exist models in which probabilities are assigned to more complex structures.

There are 98 language models in the CLARIN infrastructure for the training of the following tool functionalities:

Morphosyntax
Machine Translation
Syntactic Parsing
Named Entity Recognition
Lemmatisation
Baseline Models
Other
Contextual Word Embeddings

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Language Models in the CLARIN Infrastructure

Morphosyntax

Corpus	Language	Description	Availability
The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Bulgarian	The model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the BulTreeBank training corpus and using the CLARIN.SI-embed.bg word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.83. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Croatian	The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Croatian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Croatian	The model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and the ReLDI-NormTagNER-hr corpus, using the CLARIN.SI-embed.hr word embeddings. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.49. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115 Annotation: morphosyntax Licence: CC BY-NC-SA 4.0	Czech	These models were developed for MorphoDiTa, which performs morphological analysis, morphological generation and part-of-speech tagging (see also the PoS-taggers and lemmatizers Resource Family). The morphological dictionary is created from the 161115 version of the MorfFlex CZ lexicon and the 1.2 version of the DeriNet lexical network. The PoS tagger is trained on Prague Dependency Treebank 3.0. The models are available for download from the LINDAT repository.	Download
POS Tagging and Lemmatization (Czech model) Annotation: morphosyntax and lemmatisation Licence: CC BY-NC-SA 4.0	Czech	This model is trained using RobeCzech, which is the Czech version of BERT. The model is trained on the Prague Dependency Treebank 3.5. The model is available for download from the LINDAT repository. For the relevant publication, see Vysušilová (2021)	Download
English Models (Morphium + WSJ) for MorphoDiTa Annotation: morphosyntax Licence: CC BY-NC-SA 3.0	English	These models are for MorphoDiTa, which performs morphological analysis, morphological generation and part-of-speech tagging (see also the PoS-taggers and lemmatizers Resource Family). The morphological dictionary is created from Morphium and SCOWL (Spell Checker Oriented Word Lists), the PoS tagger is trained on the Wall Street Journal.	Download
FinBERT Annotation: morphosyntax Licence: CC BY 4.0	Finnish	This BERT model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. The model is available for download from the Language Bank of Finland.	Download
The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Macedonian	The model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published) and using the Macedonian CLARIN.SI word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.14. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Liner2.5 model Minos Annotation: morphosyntax Licence: CC BY-SA 4.0	Polish	This is a model for the Liner2.5 tool for the recognition of verbs without explicit subjects. The model is available for download from the CLARIN-PL repository.	Download
The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Serbian	The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Croatian hr500k training dataset to ensure sufficient representation of certain labels, and using the CLARIN.SI-embed.sr word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Serbian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Serbian (non-standard)	The model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and the hr500k training corpus and using the CLARIN.SI-embed.sr word embeddings. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.64. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Slovak MorphoDiTa Models 170914 Annotation: morphosyntax Licence: CC BY-NC-SA 4.0	Slovak	These are Slovak models for MorphoDiTa, a tool which provides morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex (SK 170914) and the PoS tagger is trained on automatic translations in Prague Dependency Treebank 3.0. The models are available for download from the LINDAT repository.	Download
The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0 Annotation: morphosyntax Licence: CC BY-SA 4.0	Slovenian	The model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus.The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~98.27. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Slovenian 2.1 Annotation: morphosyntax Licence: CC BY-SA 4.0	Slovenian (non-standard)	The model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and on the Janes-Tag corpus using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.17. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
POS-tagging model: Flair Annotation: morphosyntax Licence: CC BY 4.0	Swedish	This is a set of 2 models. flair_eval is trained on SUC3 with Talbanken_SBX_devas dev set. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. flair_full is trained on SUC3, Talbanken_SBX_test, SIC2 with Talbanken_SBX_dev as dev set. The models are available for download from the Swedish Language Bank.	Download
POS-tagging model: Marmot Annotation: morphosyntax Licence: CC BY 4.0	Swedish	This is a set of 2 models. marmot_eval is trained on SUC3 and the Talbanken_SBX_dev treebank, using Saldo as dictionary. marmot_full is trained on SUC3, the Talbanken_SBX_dev treebank, and SIC2 (with Saldo as dictionary). The models are available for download from the Swedish Language Bank.	Download
POS-tagging model: Stanza Annotation: morphosyntax Licence: CC BY 4.0	Swedish	This is a set of 2 models. stanza_eval is trained on SUC3 and the Talbanken_SBX_dev treebank. stanza_full is trained on the SUC3, Talbanken_SBX_test, and SIC2 sets, with Talbanken_SBX_dev as dev set. The models are available for download from the Swedish Language Bank.	Download

Machine Translation

Corpus	Language	Description	Availability
WMT21 Marian translation models (ca-ro,it,oc) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Catalan, Italian, Occitan, Romanian	This is a translation model from Catalan into Romanian, Italian, and Occitan that was part of the submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The model is available for download from the LINDAT repository. For the relevant publication, see Jon et al. (2021)	Download
WMT21 Marian translation model (ca-oc) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Catalan, Occitan	This is a translation model from Catalan Occitan that was part of the submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The model is available for download from the LINDAT repository.
WMT21 Marian translation model (ca-oc) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Catalan, Occitan	This is a neural machine translation model for Catalan to Occitan translation and constitutes the primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The model is available for download from the LINDAT repository. For the relevant publication, see Jon et al. (2021)	Download
WMT21 Marian translation model (ca-oc multi-task) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Catalan, Occitan	This is a neural machine translation model for Catalan to Occitan translation. It is a multi-task model, also producing phonemic transcription of the Catalan source. The model was submitted to WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task as a CUNI-Contrastive system for Catalan to Occitan. The model is available for download from the LINDAT repository. For the relevant publication, see Jon et al. (2021)	Download
Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Czech, English	These models are for the Neural Monkey toolkit for Czech and English, solving four tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The same models can also be invoked via an online demo. This entry also includes models for automatic news summarization for Czech and English. The Czech models were trained using the SumeCzech dataset, while the English models were trained using the CNN-Daily Mail corpus, using the standard recurrent sequence-to-sequence architecture. The models are available for download from the LINDAT repository. For the relevant publication, see Libovicky et al. (2018)	Download
CUBBITT Translation Models (en-cs) (v1.0) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Czech, English	These English-Czech translation models are used by the . The model is available for download from the LINDAT repository.	Download
WMT16 Tuning Shared Task Models (Czech-to-English) Annotation: machine translation Licence: CC BY-NC-SA 4.0	Czech, English	These Czech to English translation models are trained on the parallel CzEng 1.6 corpus. The data is tokenized with Moses). Alignment is done using fast_align and the standard Moses pipeline is used for training. The models are available for download from the LINDAT repository.	Download
CUBBITT Translation Models (en-fr) (v1.0) Annotation: machine translation Licence: CC BY-NC-SA 4.0	English, French	These are CUBBITT English-French translation models available in the LINDAT translation service. The models are available for download from the LINDAT repository. For the relevant publication, see Popel et al. (2020)	Download
Translation Models (English-German) Annotation: machine translation Licence: CC BY-NC-SA 4.0	English, German	These English-German translation models are used by the Lindat translation service. The models are available for download from the LINDAT repository.	Download
MCSQ Translation Models (en-de) (v1.0) Annotation: machine translation Licence: CC BY-NC-SA 4.0	English, German	These are English-German translation models available in the LINDAT translation service. The models are trained using the MCSQ social surveys dataset (available here ). The models are available for download from the LINDAT repository.	Download
GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models (1.0) Annotation: machine translation Licence: The MIT License	English, Icelandic	This CLARIN-IS repository entry includes code and models required to run the GreynirT2T Transformer NMT system for translation between English and Icelandic. The models along with the code are available for download from the CLARIN-IS repository.	Download
CUBBITT Translation Models (en-pl) (v1.0) Annotation: machine translation Licence: CC BY-NC-SA 4.0	English, Polish	These are CUBBITT English-Polish translation models available in the LINDAT translation service. The models are available for download from the LINDAT repository. For the relevant publication, see Popel et al. (2020)	Download
Translation Models (en-ru) (v1.0) Annotation: machine translation Licence: CC BY-NC-SA 4.0	English, Russian	These are CUBBITT English-Russiantranslation models available in the LINDAT translation service. The models are available for download from the LINDAT repository.	Download
MCSQ Translation Models (en-ru) (v1.0) Annotation: machine translation Licence: CC BY-NC-SA 4.0	English, Russian	These are English-Russian translation models available in the LINDAT translation service. The models are trained using the MCSQ social surveys dataset (available here ). The models are available for download from the LINDAT repository.	Download
GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English (1.0) Annotation: machine translation Licence: CC BY 4.0	Icelandic, English	These are a variant of GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0), trained with a 40% layer drop. They are suitable for inference using every other layer for optimized inference speed with lower translation performance. These models are available for download from the repository of CLARIN-IS. For the relevant publication, see Simonarson et al. (2021)	Download

Syntactic Parsing

Corpus	Language	Description	Availability
UDify Pretrained Model Annotation: syntactic parsing Licence: CC BY-SA 4.0	Afrikaans, Akkadian, Amharic, Ancient Greek (until 1453), Arabic, Armenian, Bambara, Basque, Belarusian, Breton, Bulgarian, Catalan, Chinese, Church Slavonic, Coptic, Croatian, Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Komi-Zyrian, Korean, Latin, Latvian, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Nigerian Pidgin, Northern Kurdish, Northern Sami, Norwegian, Old French (842-ca. 1400), Persian, Polish, Portuguese, Romanian, Buryat, Russian, Sanskrit, Serbian, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language, Tagalog, Tamil, Telugu, Thai, Turkish, Uighur, Ukrainian, Upper Sorbian, Urdu, Vietnamese, Warlpiri, Yoruba, Yue Chinese	UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (trained on UD v2.3 with 124 treebanks). For the relevant publication, see Kondratyuk and Straka (2019)	Download
Universal Dependencies 2.5 Models for UDPipe Annotation: syntactic parsing Licence: CC BY-NC-SA 4.0	Afrikaans, Ancient Greek (until 1453), Arabic, Armenian, Basque, Belarusian, Bulgarian, Catalan, Chinese, Church Slavonic, Coptic, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Gambian Wolof, German, Gothic, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Literary Chinese, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Northern Sami, Norwegian Bokmål, Norwegian Nynorsk, Old French (842-ca. 1400), Old Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, Wolof	These models are for the Universal Dependencies 2.5 treebanks (94 treebanks of 61 languages). In addition to dependency parsing, the models are also for toeknisation, part-of-speech tagging and lemmatisation. The models are available for download from the LINDAT repository.	Download
The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1 Annotation: syntactic parsing Licence: CC BY-SA 4.0	Bulgarian	The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the UD-parsed portion of the BulTreeBank training corpus and using the CLARIN.SI-embed.bg word embeddings. The estimated LAS of the parser is ~91.18. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for UD dependency parsing of standard Croatian 2.1 Annotation: syntactic parsing Licence: CC BY-SA 4.0	Croatian	The model for UD dependency parsing of standard Croatian was built with the CLASSLA-Stanza tool by training on the UD-parsed portion of the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings.The estimated LAS of the parser is ~87.46. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Slavic Forest, Norwegian Wood (models) Annotation: syntactic parsing Licence: CC BY-NC-SA 4.0	Croatian, Norwegian, Slovak	These are models for the dependency parser UDPipe used to produce the authors' final submission to the Vardial 2017 CLP shared task. The scripts and commands used to create the models are part of a separate LINDAT repository entry. The models were trained with UDPipe version 3e65d69 from 3 January 2017; their functionality with newer or older versions of UDPipe is not guaranteed. The models are available for download from the LINDAT repository. For the relevant publication, see Rosa et al. (2017)	Download
Universal Dependencies 1.2 Models for Parsito Annotation: syntactic parsing Licence: CC BY-NC-SA 4.0	English	These are models for the dependency parser Parsito. They are trained on Universal Dependencies 1.2 Treebanks.	Download
CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials Annotation: syntactic parsing Licence: License Universal Dependencies v2.2	Multiple languages	This is a baseline model for UDPipe (version 1.2 and up), created for the CoNLL 2018 Shared Task in UD Parsing. The models were trained using a custom data split for treebanks where no development data is provided. The model is available for download from the LINDAT repository.	Download
CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials Annotation: syntactic parsing Licence: CC BY-NC-SA 4.0	Multiple languages	These are models for the dependency parser UDPipe, developed as part of the CoNLL 2017 Shared Task in UD Parsing. The models are available for download from the LINDAT repository.	Download
Dependency parsing models for Polish Annotation: syntactic parsing Licence: CC BY-NC-SA 4.0	Polish	These models are trained on the 3.5 version of the Polish Dependency Treebank with the publicly available parsing systems: MaltParser, MateParser, and UDPipe. The models are available for download from the CLARIN-PL repository. For the relevant publication, see Wroblewska and Rybak (2019)	Download
The CLASSLA-Stanza model for UD dependency parsing of standard Serbian 2.1 Annotation: syntactic parsing Licence: CC BY-SA 4.0	Serbian	The model for UD dependency parsing of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus and using the CLARIN.SI-embed.sr word embeddings.The estimated LAS of the parser is ~89.83. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for JOS dependency parsing of standard Slovenian 2.0 Annotation: syntactic parsing Licence: CC BY-SA 4.0	Slovenian	The model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated LAS of the parser is ~93.89. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0 Annotation: syntactic parsing Licence: CC BY-SA 4.0	Slovenian	The model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated LAS of the parser is ~91.11. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Dependency parsing model: Stanza Annotation: syntactic parsing Licence: CC BY 4.0	Swedish	This is a set of 2 models that enable the dependency parsing of Swedish (in the Mamba-Dep format, the format of TalbankenSBX). The models are available for download from the Swedish Language Bank.	Download

Named Entity Recognition

Corpus	Language	Description	Availability
The CLASSLA-StanfordNLP model for named entity recognition of standard Bulgarian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Bulgarian	This model for named entity recognition of standard Bulgarian was built with the CLASSLA-StanfordNLP tool by training on the BulTreeBank training corpus and using the CoNLL2017 word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Croatian	This model for named entity recognition of standard Croatian was built with the CLASSLA-StanfordNLP tool by training on the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Croatian (non-standard)	This model for named entity recognition of non-standard Croatian was built with the CLASSLA-StanfordNLP tool by training on the hr500k training corpus, the ReLDI-NormTagNER-hr corpus and the ReLDI-NormTagNER-sr corpus, using the CLARIN.SI-embed.hr word embeddings . The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Czech Models (CNEC) for NameTag Annotation: named entity recognition Licence: CC BY-NC-SA 3.0	Czech	These are models for the named entity recognizer NameTag. The models are available for download from the LINDAT repository.	Download
NameTag 2 Models Annotation: named entity recognition Licence: CC BY-NC-SA 4.0	Czech, Dutch, English, German, Spanish	These models are for NameTag 2, a named entity recognition tool (see also the Named Entity Recognizers Resource Family). The documentation is available separately on the project webpage. The models are available for download from the LINDAT repository. For the relevant publication, see Straková et al. (2019)	Download
English Model (CoNLL-2003) for NameTag Annotation: named entity recognition Licence: CC BY-NC-SA 4.0	English	This is an English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data and recognizes PER, ORG, LOC and MISC named entities. It achieves an F-measure 84.73 on the CoNLL-2003 test data. The model is available for download from the LINDAT repository.	Download
Liner2.5 model NER Annotation: named entity recognition Licence: GNU LGPL 3.0	Polish	This is a model for the Liner 2.5 tool. The model is available for download from the CLARIN-PL repository.	Download
Liner2.6 model NER NKJP Annotation: named entity recognition Licence: GNU GPL3	Polish	This is a Liner2 model for the recognition of named entities. The model was trained on the NKJP corpus and evaluated in the PolEval 2018 Task 2. The model is available for download from the CLARIN-PL repository.	Download
The CLASSLA-StanfordNLP model for named entity recognition of standard Serbian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Serbian	This model for named entity recognition of standard Serbian was built with the CLASSLA-StanfordNLP tool by training on the SETimes.SR training corpus and using the CLARIN.SI-embed.sr word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Serbian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Serbian (non-standard)	This model for named entity recognition of non-standard Serbian was built with the CLASSLA-StanfordNLP tool by training on the SETimes.SR training corpus/a>, the hr500k training corpus, the ReLDI-NormTagNER-sr corpus, and the ReLDI-NormTagNER-hr corpus, using the CLARIN.SI-embed.sr word embeddings. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Slovenian	This model for named entity recognition of standard Slovenian was built with the CLASSLA-StanfordNLP tool by training on the ssj500k training corpus and using the CLARIN.SI-embed.sl word embeddings. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Slovenian (non-standard)	This model for named entity recognition of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool by training on the ssj500k training corpus and the Janes-Tag training corpus, using the CLARIN.SI-embed.sl word embeddings. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
PyTorch model for Slovenian Named Entity Recognition SloNER 1.0 Annotation: named entity recognition Licence: CC BY-SA 4.0	Slovenian	This is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library . The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0. The model was trained on the SUK 1.0 training corpus.The source code of the model is available on GitHub repository. The model is available for download from the CLARIN.SI repository.	Download

Lemmatisation

Corpus	Language	Description	Availability
The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Bulgarian	The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the BulTreeBank training corpus and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for lemmatisation of standard Croatian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Croatian	The model for lemmatisation of standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and using the hrLex inflectional lexicon. The estimated F1 of the lemma annotations is ~98.02. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Croatian (non-standard)	The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and the ReLDI-NormTagNER-hr corpus, using the hrLex inflectional lexicon. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.23. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Macedonian	The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published). The estimated F1 of the lemma annotations is ~98.81. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for lemmatisation of standard Serbian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Serbian	The model for lemmatisation of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and using the srLex inflectional lexicon. The estimated F1 of the lemma annotations is ~98.02. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for lemmatisation of non-standard Serbian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Serbian (non-standard)	The model for lemmatisation of non-standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and using the srLex inflectional lexicon. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.92. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 2.0 Annotation: lemmatisation Licence: CC BY-SA 4.0	Slovenian	The model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated F1 of the lemma annotations is ~99.7. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
The CLASSLA-Stanza model for lemmatisation of non-standard Slovenian 2.1 Annotation: lemmatisation Licence: CC BY-SA 4.0	Slovenian (non-standard)	The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and on the Janes-Tag corpus using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~91.45. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić and Dobrovoljc (2019)	Download
Lemmatization model: Stanza Annotation: lemmatisation Licence: CC BY 4.0	Swedish	This model enables lemmatisation of Swedish text following the SUC3 standard. The models are available for download from the Swedish Language Bank.	Download

Baseline

Corpus	Language	Description	Availability
CERED baseline models Annotation: Baseline Licence: CC BY-NC-SA 4.0	Czech	These models are trained on CERED, a dataset created by distant supervision on Czech Wikipedia and Wikidata, and recognize a subset of Wikidata relations. The model is available for download from the LINDAT repository.	Download
LVBERT - Latvian BERT Annotation: Baseline Licence: GNU GPL3	Latvian	This model is trained on the original implementation of BERT on the TensorFlow machine-learning platform with the whole-word masking and the next sentence prediction objectives. This uses the BERT configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and a 32,000 token vocabulary. THe model is available for download from the CLARIN-LV repository. For the relevant publication, see Znotinš and Barzdinš (2020)	Download
LitLat BERT Annotation: Baseline Licence: PUB CLARIN-LT	Lithuanian, Latvian, English	This BERT-like model represents tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian. The model is available for download from the CLARIN-LT repository.	Download
Albertina PT-BR Annotation: Baseline Licence: MIT	Portuguese	This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the DeBERTa model. This model is for American Portuguese spoken in Brazil, is trained on the brWaC dataset, and is a larger version of the Albertina PT-BR base model. This model is available for download through Hugging Face.	Download
Albertina PT-BR base Annotation: Baseline Licence: MIT	Portuguese	This model is for Portuguese spoken in Brazil. It is based on the Transformer neural architecture and is developed over the DeBERTa model.	Download
Albertina PT-BR No-brWaC Annotation: Baseline Licence: MIT	Portuguese	This is a model for Portuguese spoken in Brazil trained on adta sets othan than brWaC. It is I developed over the DeBERTa model. The model is available for download from Hugging Face.	Download
Albertina PT-PT Annotation: Baseline Licence: MIT	Portuguese	This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the DeBERTa model. This model is for European Portuguese and is trained on the brWaC dataset, and is a larger version of the Albertina PT-PT base model. This model is available for download through Hugging Face.	Download
Albertina PT-PT base Annotation: Baseline Licence: MIT	Portuguese	This model is for European. It is based on the Transformer neural architecture and is developed over the DeBERTa model. This model is available for download through Hugging Face.	Download
Gervásio PT-BR base Annotation: Baseline Licence: MIT	Portuguese	This model, which is for Portuguese spoken in Brazil, is a decoder of the GPT family that is based on the neural architecture Transformer and developed over the Pythia model. The model is available for download from Hugging Face.	Download
Gervásio PT-PT base Annotation: Baseline Licence: MIT	Portuguese	This model, which is for European Portuguese, is a decoder of the GPT family that is based on the neural architecture Transformer and developed over the Pythia model. The model is available for download from Hugging Face.	Download
BERTimbau - Portuguese BERT-Base language model Annotation: Baseline Licence: Under negotiation	Portuguese	This is a BERT model, trained on BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. The model is available for download from the PORTULAN repository.	Download
BERTimbau - Portuguese BERT-Large language model Annotation: Baseline Licence: Under negotiation	Portuguese	This is a BERT model, trained on BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. The model is available for download from the PORTULAN repository.	Download
Portuguese RoBERTa language model Annotation: Baseline Licence: CC-BY	Portuguese	This is a pre-trained roBERTa model in Portuguese, with 6 layers and 12 attention-heads, totaling 68M parameters. Pre-training was done on 10 million Portuguese sentences and 10 million English sentences from the OSCAR corpus. The model is available for download from the PORUTLAN repository.	Download
Dataset and baseline model of moderated content FRENK-MMC-RTV 1.0 Annotation: Baseline Licence: CC BY-SA 4.0	Slovenian	FRENK-MMC-RTV is a dataset of moderated newspaper comments from the website rtvslo.si with metadata on the time of publishing, user identifier, thread identifier and whether the comment was deleted by the moderators or not. The full text of each comment is encrypted via a character-replacement method so that the comments are not readable by humans. Basic punctuation is not encrypted in order to enable tokenization. The main use of this dataset are experiments on automating comment moderation. For real-world usage, a fastText classification model trained on non-encrypted data is made available as well. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić et al. (2018)	Download
ccGigafida ARPA language model 1.0 Annotation: Baseline Licence: CC BY 4.0	Slovenian	This model was created from the ccGigafida written corpus of Slovenian using the KenLM algorithm in the Moses machine translation framework. It is a general language model of contemporary standard Slovenian language that can be used as a language model in statistical machine translation systems. The model is available for download from the CLARIN.SI repository.	Download

Other

Corpus	Language	Description	Availability
Czech Models for Korektor 2 Annotation: normalization Licence: CC BY-NC-SA 3.0	Czech	These models are for the statistical spellchecker Korektor 2. The models can either perform spellchecking and grammar-checking, or only generate diacritical marks. The models are available for download from the LINDAT repository.	Download
Sentiment Analysis (Czech Model) Annotation: sentiment analysis Licence: CC BY-NC-SA 4.0	Czech	These models are trained on data from the following sources: Mall (product reviews), CSFD (movie reviews), and Facebook, and joint data from all three datasets above (data available here, using RobeCzech, which is the Czech version of BERT. For the relevant publication, see Vysušilová (2021)	Download
Model weights for a study of commonsense reasoning Annotation: commonsense reasoning Licence: MIT	English	This resource contains model weights for five Transformer-based models: roBERTa, GPT-2, T5, BART and COMET.These models were implemented using HuggingFace, and fine-tuned on the following four commonsense reasoning tasks: Argument Reasoning Comprehension Task (ARCT), AI2 Reasoning Challenge (ARC), Physical Interaction Question Answering (PIQA) and CommonsenseQA (CSQA). The models are available for download form the PORTULAN repository.	Download
RÚV-DI Speaker Diarization v5 models (21.05) Annotation: diarization Licence: CC BY 4.0	Icelandic	These models are trained on the Althingi Parliamentary Speech corpus hosted by CLARIN-IS. The models use MFCCS, x-vectors, PLDA and AHC The models are available for download from the CLARIN-IS repository.	Download
Models for automatic g2p for Icelandic (20.10) Annotation: phonemic transcription Licence: Apache License 2.0	Icelandic	These are grapheme-to-phoneme models for Icelandic, trained on an encoder-decoder LSTM neural network. The models are delivered with scripts for automatic transcription of Icelandic in the standard pronunciation variation, in the northern variation, north-east variation, and the south variation. To run the scripts the user needs to install Fairseq. For the relevant publication, see Gorman et al. (2020)	Download
Liner2.5 model Timex Annotation: temporal expressions Licence: CC BY-SA 4.0	Polish	This is a model for the Liner2.5 tool for the recognition and normalization on temporal expressions. The model is available for download from the CLARIN-PL repository.	Download
Liner2.5 model Events Annotation: event mentions Licence: CC BY-SA 4.0	Polish	This is a model for the Liner2.5 tool for the recognition of event mentions. The model is available for download from the CLARIN-PL repository.	Download
PyTorch model for Slovenian Coreference Resolution Annotation: coreference resolution Licence: CC BY 4.0	Slovenian	This is a Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with this code. The model is based on the Slovenian CroSloEngual BERT 1.1 model. It was trained on the SUK 1.0 training corpus, specifically the SentiCoref subcorpus. This resource is available for download from the CLARIN.SI repository. For the relevant publication, see Klemen & Žitnik (2022)	Download
Face-domain-specific automatic speech recognition models Annotation: face-domain-specific automatic speech recognition Licence: Apache License 2.0	Slovenian	This model contains all the files required to implement face-domain-specific automatic speech recognition (ASR) applications using the Kaldi ASR toolkit, including the acoustic model, language model, and other relevant files. It also includes all the scripts and configuration files needed to use these models for implementing face-domain-specific automatic speech recognition. The acoustic model was trained using the relevant Kaldi ASR tools and the Artur speech corpus (audio,transcriptions). The language model was trained using the domain-specific text data involving face descriptions obtained by translating the Face2Text English dataset into the Slovenian language. These models, combined with other necessary files like the HCLG.fst and decoding scripts, enable the implementation of face-domain-specific ASR applications. This resource is available for download from the CLARIN.SI repository.	Download
The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0 Annotation: semantic role labeling Licence: CC BY-SA 4.0	Slovenian	The model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings extended with the MaCoCu-sl Slovene web corpus. The estimated F1 of the lemma annotations is ~76.24. The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ljubešić & Dobrovoljc (2019)	Download

Contextual Word Embeddings

Corpus	Language	Description	Availability
CroSloEngual BERT 1.1 Annotation: word embeddings Licence: CC BY-SA 4.0	Croatian, English, Slovenian	Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). The model is available for download from the CLARIN.SI repository. For the relevant publication, see Ulčar and Robnik-Šikonja (2020)	Download
ELMo embeddings models for seven languages Annotation: word embeddings Licence: Apache License 2.0	Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, Swedish	This model is used to produce contextual word embeddings. It is trained on large monolingual corpora for 7 languages. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. The model is available for download from the CLARIN.SI repository.	Download
LX-DSemVectors Annotation: Word embeddings Licence: CC-BY	Portuguese	This model represents tokens as contextual word embeddings for Portuguese. It was trained on a corpus of 2 billion tokens and achieved state-of-the-art results on multiple lexical semantic tasks. The model is available for download from the PORTULAN repository.	Download
Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0 Annotation: word embeddings Licence: CC BY-SA 4.0	Slovenian	The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end.
Word Embeddings trained on English Wikipedia Annotation: word embeddings Licence: CC BY 4.0	Swedish	This is a set of contextual word embeddings. The models are available for download from the Swedish Language Bank.	Download
Word embeddings CLARIN.SI-embed Annotation: word embeddings Licence: CC BY-SA 4.0	Bulgarian, Croatian, Macedonian, Serbian, Slovenian	This is a set of word embeddings for 5 languages. CLARIN.SI-embed.bg contains word embeddings for Bulgarian induced from the MaCoCu-bg web crawl corpus. The embeddings are based on the skip-gram model of fastText trained on 4,120,343,820 tokens of running text for 2,746,640 lowercased surface forms. CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC, a 400-million-token-heavy collection of newspaper texts and MaCoCu-hr. The embeddings are based on the skip-gram model of fastText trained on 4,586,769,197 tokens of running text for 3,406,574 lowercased surface forms. CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms. CLARIN.SI-embed.sr contains word embeddings induced from the srWaC and MaCoCu-sr web corpora. The embeddings are based on the skip-gram model of fastText trained on 3,434,602,575 tokens of running text for 2,676,036 lowercased surface forms. CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl, etc. The embeddings are based on the skip-gram model of fastText trained on 5,791,405,942 tokens of running text for 3,471,054 lowercased surface forms. The models are available for download from the CLARIN.SI repository.	Download (Bulgarian) Download (Croatian) Download (Macedonian) Download (Serbian) Download (Slovenian)

References

[Gorman et al. 2020] Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta, Arya McCarthy, Shijie Wu, and Daniel You. 2020. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion. 2020. In: Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 40–50.

[Jon et al. 2021] Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, and Ondřej Bojar. 2021. CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task, arXiv pre-print.

[Jurafsky and Martin 2021] Daniel Jurafsky and James H. Martin. 2021. Speech and Language Processing.

[Kondratyuk and Straka 2019] Dan Kondratyuk and Milan Straka. 2019. 75 Languages, 1 Model: Parsing Universal Dependencies Universally. arXiv pre-print.

[Libovicky et al. 2018] Jindrich Libovicky, Rusolf Rosa, Jindrich Helcl, and Martin Popel. 2018. Solving Three Cech NLP Tasks End-to-End with Neural Models. In: CEUR Workshop Proceedings, volume 2203, 138–143.

[Ljubešić and Dobrovoljc 2019] Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 29–34.

[Ljubešić et al. 2018] Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2018. Datasets of Slovene and Croatian Moderated News Comments. In: Proceedings of the 2^nd Workshop on Abusive Language Online, 124–131.

[Popel et al. 2020] Martin Popel, Marketa Tomkova, Jakub Tomek, Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bojar, and Zdeněk Žabokrtský. 2020. Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature Communications 11.

[Rosa et al. 2017] Rudolf Rosa, Daniel Zeman, David Mareček, and Zdenek Žabokrtsky. 2017. Slavic Forest, Norwegian Wood. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, 210–219.

[Símonarson et al. 2021] Haukur Barri Símonarson, Vésteinn Snæbjarnarson, Pétur Orri Ragnarsson, Haukur Páll Jónsson, and Vilhjálmur Þorsteinsson. 2021. Miðeind's WMT 2021 submission. arXiv pre-print. .

[Straková et al. 2019] Jana Straková, Milan Straka, and Jan Hajič. 2019. Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5326–5331.

[Ulčar and Robnik-Šikonja 2020] Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSloEngual BERT: less is more in multilingual models. arXiv pre-print.

[Vysušilová 2021] Petra Vysušilová. 2021. Czech NLP with Contextualized Embeddings. Diploma thesis.

[Wróblewska and Rybak 2019] Alina Wróblewska and Piotr Rybak. 2019. Dependency parsing of Polish. Poznan Studies in Contemporary Linguistics.

[Znotiņš and Barzdiņš 2020] Artūrs Znotiņš and Guntis Barzdiņš. 2020. LVBERT: Transformer-Based Model for Latvian Language Understanding. Frontiers in Artificial Intelligence and Applications 328.