Skip to main content

Language Models

Language models are pretrained probabilistic models of word sequences. They are used to determine that a language string such as briefed reporters on is, in the case of English, more probable than the alternative string briefed to reporters, which is grammatically well-formed but significantly less idiomatic (Jurafsky and Martin 2021: 2). This allows a tool trained on the model to select the correct sequence. While language models can assign probabilities over simple sequences of words, there also exist models in which probabilities are assigned to more complex structures.

There are 98 language models in the CLARIN infrastructure for the training of the following tool functionalities:

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

Language Models in the CLARIN Infrastructure

Morphosyntax

Corpus Language Description Availability

The CLASSLA-Stanza model for morphosyntactic annotation of standard Bulgarian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Bulgarian

The model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the BulTreeBank training corpus and using the CLARIN.SI-embed.bg word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.83.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Croatian

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Croatian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Croatian

The model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and the ReLDI-NormTagNER-hr corpus, using the CLARIN.SI-embed.hr word embeddings. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.49.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Czech Models (MorfFlex CZ 161115 + PDT 3.0) for MorphoDiTa 161115

Annotation: morphosyntax
Licence: CC BY-NC-SA 4.0

Czech

These models were developed for MorphoDiTa, which performs morphological analysis, morphological generation and part-of-speech tagging (see also the PoS-taggers and lemmatizers Resource Family). The morphological dictionary is created from the 161115 version of the MorfFlex CZ lexicon and the 1.2 version of the DeriNet lexical network. The PoS tagger is trained on Prague Dependency Treebank 3.0.

The models are available for download from the LINDAT repository.

Download

POS Tagging and Lemmatization (Czech model)

Annotation: morphosyntax and lemmatisation
Licence: CC BY-NC-SA 4.0

Czech

This model is trained using RobeCzech, which is the Czech version of BERT. The model is trained on the Prague Dependency Treebank 3.5.

The model is available for download from the LINDAT repository.

For the relevant publication, see Vysušilová (2021)

Download

English Models (Morphium + WSJ) for MorphoDiTa

Annotation: morphosyntax
Licence: CC BY-NC-SA 3.0

English

These models are for MorphoDiTa, which performs morphological analysis, morphological generation and part-of-speech tagging (see also the PoS-taggers and lemmatizers Resource Family).

The morphological dictionary is created from Morphium and SCOWL (Spell Checker Oriented Word Lists), the PoS tagger is trained on the Wall Street Journal.

Download

FinBERT

Annotation: morphosyntax
Licence: CC BY 4.0

Finnish

This BERT model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.

The model is available for download from the Language Bank of Finland.

Download

The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Macedonian

The model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published) and using the Macedonian CLARIN.SI word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.14.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Liner2.5 model Minos

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Polish

This is a model for the Liner2.5 tool for the recognition of verbs without explicit subjects.

The model is available for download from the CLARIN-PL repository.

Download

The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Serbian

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Croatian hr500k training dataset to ensure sufficient representation of certain labels, and using the CLARIN.SI-embed.sr word embeddings. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Serbian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Serbian (non-standard)

The model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and the hr500k training corpus and using the CLARIN.SI-embed.sr word embeddings. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.64.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Slovak MorphoDiTa Models 170914

Annotation: morphosyntax
Licence: CC BY-NC-SA 4.0

Slovak

These are Slovak models for MorphoDiTa, a tool which provides morphological analysis, morphological generation and part-of-speech tagging. The morphological dictionary is created from MorfFlex (SK 170914) and the PoS tagger is trained on automatic translations in Prague Dependency Treebank 3.0.

The models are available for download from the LINDAT repository.

Download

The CLASSLA-Stanza model for morphosyntactic annotation of standard Slovenian 2.0

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Slovenian

The model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus.The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~98.27.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Slovenian 2.1

Annotation: morphosyntax
Licence: CC BY-SA 4.0

Slovenian (non-standard)

The model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and on the Janes-Tag corpus using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.17.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

POS-tagging model: Flair

Annotation: morphosyntax
Licence: CC BY 4.0

Swedish

This is a set of 2 models. flair_eval is trained on SUC3 with Talbanken_SBX_devas dev set. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. flair_full is trained on SUC3, Talbanken_SBX_test, SIC2 with Talbanken_SBX_dev as dev set.

The models are available for download from the Swedish Language Bank.

Download

POS-tagging model: Marmot

Annotation: morphosyntax
Licence: CC BY 4.0

Swedish

This is a set of 2 models. marmot_eval is trained on SUC3 and the Talbanken_SBX_dev treebank, using Saldo as dictionary. marmot_full is trained on SUC3, the Talbanken_SBX_dev treebank, and SIC2 (with Saldo as dictionary).

The models are available for download from the Swedish Language Bank.

Download

POS-tagging model: Stanza

Annotation: morphosyntax
Licence: CC BY 4.0

Swedish

This is a set of 2 models. stanza_eval is trained on SUC3 and the Talbanken_SBX_dev treebank. stanza_full is trained on the SUC3, Talbanken_SBX_test, and SIC2 sets, with Talbanken_SBX_dev as dev set.

The models are available for download from the Swedish Language Bank.

Download

Machine Translation

Corpus Language Description Availability

WMT21 Marian translation models (ca-ro,it,oc)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Catalan, Italian, Occitan, Romanian

This is a translation model from Catalan into Romanian, Italian, and Occitan that was part of the submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

The model is available for download from the LINDAT repository.

For the relevant publication, see Jon et al. (2021)

Download

WMT21 Marian translation model (ca-oc)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Catalan, Occitan

This is a translation model from Catalan Occitan that was part of the submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

The model is available for download from the LINDAT repository.

 

WMT21 Marian translation model (ca-oc)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Catalan, Occitan

This is a neural machine translation model for Catalan to Occitan translation and constitutes the primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

The model is available for download from the LINDAT repository.

For the relevant publication, see Jon et al. (2021)

Download

WMT21 Marian translation model (ca-oc multi-task)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Catalan, Occitan

This is a neural machine translation model for Catalan to Occitan translation. It is a multi-task model, also producing phonemic transcription of the Catalan source. The model was submitted to WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task as a CUNI-Contrastive system for Catalan to Occitan.

The model is available for download from the LINDAT repository.

For the relevant publication, see Jon et al. (2021)

Download

Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Czech, English

These models are for the Neural Monkey toolkit for Czech and English, solving four tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The same models can also be invoked via an online demo.

This entry also includes models for automatic news summarization for Czech and English. The Czech models were trained using the SumeCzech dataset, while the English models were trained using the CNN-Daily Mail corpus, using the standard recurrent sequence-to-sequence architecture.

The models are available for download from the LINDAT repository.

For the relevant publication, see Libovicky et al. (2018)

Download

CUBBITT Translation Models (en-cs) (v1.0)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Czech, English

These English-Czech translation models are used by the .

The model is available for download from the LINDAT repository.

Download

WMT16 Tuning Shared Task Models (Czech-to-English)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

Czech, English

These Czech to English translation models are trained on the parallel CzEng 1.6 corpus. The data is tokenized with Moses). Alignment is done using fast_align and the standard Moses pipeline is used for training.

The models are available for download from the LINDAT repository.

Download

CUBBITT Translation Models (en-fr) (v1.0)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

English, French

These are CUBBITT English-French translation models available in the LINDAT translation service.

The models are available for download from the LINDAT repository.

For the relevant publication, see Popel et al. (2020)

Download

Translation Models (English-German)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

English, German

These English-German translation models are used by the Lindat translation service.

The models are available for download from the LINDAT repository.

Download

MCSQ Translation Models (en-de) (v1.0)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

English, German

These are English-German translation models available in the LINDAT translation service. The models are trained using the MCSQ social surveys dataset (available here ).

The models are available for download from the LINDAT repository.

Download

GreynirT2T Serving - En--Is NMT Inference and Pre-trained Models (1.0)

Annotation: machine translation
Licence: The MIT License

English, Icelandic

This CLARIN-IS repository entry includes code and models required to run the GreynirT2T Transformer NMT system for translation between English and Icelandic.

The models along with the code are available for download from the CLARIN-IS repository.

Download

CUBBITT Translation Models (en-pl) (v1.0)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

English, Polish

These are CUBBITT English-Polish translation models available in the LINDAT translation service.

The models are available for download from the LINDAT repository.

For the relevant publication, see Popel et al. (2020)

Download

Translation Models (en-ru) (v1.0)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

English, Russian

These are CUBBITT English-Russiantranslation models available in the LINDAT translation service.

The models are available for download from the LINDAT repository.

Download

MCSQ Translation Models (en-ru) (v1.0)

Annotation: machine translation
Licence: CC BY-NC-SA 4.0

English, Russian

These are English-Russian translation models available in the LINDAT translation service. The models are trained using the MCSQ social surveys dataset (available here ).

The models are available for download from the LINDAT repository.

Download

GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English (1.0)

Annotation: machine translation
Licence: CC BY 4.0

Icelandic, English

These are a variant of GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0), trained with a 40% layer drop. They are suitable for inference using every other layer for optimized inference speed with lower translation performance.

These models are available for download from the repository of CLARIN-IS.

For the relevant publication, see Simonarson et al. (2021)

Download

Syntactic Parsing

Corpus Language Description Availability

UDify Pretrained Model

Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Afrikaans, Akkadian, Amharic, Ancient Greek (until 1453), Arabic, Armenian, Bambara, Basque, Belarusian, Breton, Bulgarian, Catalan, Chinese, Church Slavonic, Coptic, Croatian, Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish, French, Galician, German, Gothic, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Komi-Zyrian, Korean, Latin, Latvian, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Nigerian Pidgin, Northern Kurdish, Northern Sami, Norwegian, Old French (842-ca. 1400), Persian, Polish, Portuguese, Romanian, Buryat, Russian, Sanskrit, Serbian, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language, Tagalog, Tamil, Telugu, Thai, Turkish, Uighur, Ukrainian, Upper Sorbian, Urdu, Vietnamese, Warlpiri, Yoruba, Yue Chinese

UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (trained on UD v2.3 with 124 treebanks).

For the relevant publication, see Kondratyuk and Straka (2019)

Download

Universal Dependencies 2.5 Models for UDPipe

Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

Afrikaans, Ancient Greek (until 1453), Arabic, Armenian, Basque, Belarusian, Bulgarian, Catalan, Chinese, Church Slavonic, Coptic, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Gambian Wolof, German, Gothic, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Literary Chinese, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Northern Sami, Norwegian Bokmål, Norwegian Nynorsk, Old French (842-ca. 1400), Old Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, Wolof

These models are for the Universal Dependencies 2.5 treebanks (94 treebanks of 61 languages). In addition to dependency parsing, the models are also for toeknisation, part-of-speech tagging and lemmatisation.

The models are available for download from the LINDAT repository.

Download

The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Bulgarian

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the UD-parsed portion of the BulTreeBank training corpus and using the CLARIN.SI-embed.bg word embeddings. The estimated LAS of the parser is ~91.18.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for UD dependency parsing of standard Croatian 2.1

Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Croatian

The model for UD dependency parsing of standard Croatian was built with the CLASSLA-Stanza tool by training on the UD-parsed portion of the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings.The estimated LAS of the parser is ~87.46.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Slavic Forest, Norwegian Wood (models)

Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

Croatian, Norwegian, Slovak

These are models for the dependency parser UDPipe used to produce the authors' final submission to the Vardial 2017 CLP shared task. The scripts and commands used to create the models are part of a separate LINDAT repository entry. The models were trained with UDPipe version 3e65d69 from 3 January 2017; their functionality with newer or older versions of UDPipe is not guaranteed.

The models are available for download from the LINDAT repository.

For the relevant publication, see Rosa et al. (2017)

Download

Universal Dependencies 1.2 Models for Parsito

Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

English These are models for the dependency parser Parsito. They are trained on Universal Dependencies 1.2 Treebanks. Download

CoNLL 2018 Shared Task - UDPipe Baseline Models and Supplementary Materials

Annotation: syntactic parsing
Licence: License Universal Dependencies v2.2

Multiple languages

This is a baseline model for UDPipe (version 1.2 and up), created for the CoNLL 2018 Shared Task in UD Parsing. The models were trained using a custom data split for treebanks where no development data is provided.

The model is available for download from the LINDAT repository.

Download

CoNLL 2017 Shared Task - UDPipe Baseline Models and Supplementary Materials

Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

Multiple languages

These are models for the dependency parser UDPipe, developed as part of the CoNLL 2017 Shared Task in UD Parsing.

The models are available for download from the LINDAT repository.

Download

Dependency parsing models for Polish

Annotation: syntactic parsing
Licence: CC BY-NC-SA 4.0

Polish

These models are trained on the 3.5 version of the Polish Dependency Treebank with the publicly available parsing systems: MaltParser, MateParser, and UDPipe.

The models are available for download from the CLARIN-PL repository.

For the relevant publication, see Wroblewska and Rybak (2019)

Download

The CLASSLA-Stanza model for UD dependency parsing of standard Serbian 2.1

Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Serbian

The model for UD dependency parsing of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus and using the CLARIN.SI-embed.sr word embeddings.The estimated LAS of the parser is ~89.83.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for JOS dependency parsing of standard Slovenian 2.0

Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Slovenian

The model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated LAS of the parser is ~93.89.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.0

Annotation: syntactic parsing
Licence: CC BY-SA 4.0

Slovenian

The model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated LAS of the parser is ~91.11.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Dependency parsing model: Stanza

Annotation: syntactic parsing
Licence: CC BY 4.0

Swedish

This is a set of 2 models that enable the dependency parsing of Swedish (in the Mamba-Dep format, the format of TalbankenSBX).

The models are available for download from the Swedish Language Bank.

Download

Named Entity Recognition

Corpus Language Description Availability

The CLASSLA-StanfordNLP model for named entity recognition of standard Bulgarian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Bulgarian

This model for named entity recognition of standard Bulgarian was built with the CLASSLA-StanfordNLP tool by training on the BulTreeBank training corpus and using the CoNLL2017 word embeddings.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Croatian

This model for named entity recognition of standard Croatian was built with the CLASSLA-StanfordNLP tool by training on the hr500k training corpus and using the CLARIN.SI-embed.hr word embeddings.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Croatian (non-standard)

This model for named entity recognition of non-standard Croatian was built with the CLASSLA-StanfordNLP tool by training on the hr500k training corpus, the ReLDI-NormTagNER-hr corpus and the ReLDI-NormTagNER-sr corpus, using the CLARIN.SI-embed.hr word embeddings . The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Czech Models (CNEC) for NameTag

Annotation: named entity recognition
Licence: CC BY-NC-SA 3.0

Czech

These are models for the named entity recognizer NameTag.

The models are available for download from the LINDAT repository.

Download

NameTag 2 Models

Annotation: named entity recognition
Licence: CC BY-NC-SA 4.0

Czech, Dutch, English, German, Spanish

These models are for NameTag 2, a named entity recognition tool (see also the Named Entity Recognizers Resource Family). The documentation is available separately on the project webpage.

The models are available for download from the LINDAT repository.

For the relevant publication, see Straková et al. (2019)

Download

English Model (CoNLL-2003) for NameTag

Annotation: named entity recognition
Licence: CC BY-NC-SA 4.0

English

This is an English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data and recognizes PER, ORG, LOC and MISC named entities. It achieves an F-measure 84.73 on the CoNLL-2003 test data.

The model is available for download from the LINDAT repository.

Download

Liner2.5 model NER

Annotation: named entity recognition
Licence: GNU LGPL 3.0

Polish

This is a model for the Liner 2.5 tool.

The model is available for download from the CLARIN-PL repository.

Download

Liner2.6 model NER NKJP

Annotation: named entity recognition
Licence: GNU GPL3

Polish

This is a Liner2 model for the recognition of named entities. The model was trained on the NKJP corpus and evaluated in the PolEval 2018 Task 2.

The model is available for download from the CLARIN-PL repository.

Download

The CLASSLA-StanfordNLP model for named entity recognition of standard Serbian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Serbian

This model for named entity recognition of standard Serbian was built with the CLASSLA-StanfordNLP tool by training on the SETimes.SR training corpus and using the CLARIN.SI-embed.sr word embeddings.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Serbian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Serbian (non-standard)

This model for named entity recognition of non-standard Serbian was built with the CLASSLA-StanfordNLP tool by training on the SETimes.SR training corpus/a>, the hr500k training corpus, the ReLDI-NormTagNER-sr corpus, and the ReLDI-NormTagNER-hr corpus, using the CLARIN.SI-embed.sr word embeddings. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Slovenian

This model for named entity recognition of standard Slovenian was built with the CLASSLA-StanfordNLP tool by training on the ssj500k training corpus and using the CLARIN.SI-embed.sl word embeddings.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Slovenian (non-standard)

This model for named entity recognition of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool by training on the ssj500k training corpus and the Janes-Tag training corpus, using the CLARIN.SI-embed.sl word embeddings. The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

Annotation: named entity recognition
Licence: CC BY-SA 4.0

Slovenian

This is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library .

The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0. The model was trained on the SUK 1.0 training corpus.The source code of the model is available on GitHub repository.

The model is available for download from the CLARIN.SI repository.

Download

Lemmatisation

Corpus Language Description Availability

The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Bulgarian

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool by training on the BulTreeBank training corpus and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for lemmatisation of standard Croatian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Croatian

The model for lemmatisation of standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and using the hrLex inflectional lexicon. The estimated F1 of the lemma annotations is ~98.02.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Croatian (non-standard)

The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool by training on the hr500k training corpus and the ReLDI-NormTagNER-hr corpus, using the hrLex inflectional lexicon. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.23.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Macedonian

The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published). The estimated F1 of the lemma annotations is ~98.81.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for lemmatisation of standard Serbian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Serbian

The model for lemmatisation of standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and using the srLex inflectional lexicon. The estimated F1 of the lemma annotations is ~98.02.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for lemmatisation of non-standard Serbian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Serbian (non-standard)

The model for lemmatisation of non-standard Serbian was built with the CLASSLA-Stanza tool by training on the SETimes.SR training corpus combined with the Serbian non-standard training corpus ReLDI-NormTagNER-sr and using the srLex inflectional lexicon. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.92.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 2.0

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Slovenian

The model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. The estimated F1 of the lemma annotations is ~99.7.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

The CLASSLA-Stanza model for lemmatisation of non-standard Slovenian 2.1

Annotation: lemmatisation
Licence: CC BY-SA 4.0

Slovenian (non-standard)

The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and on the Janes-Tag corpus using the CLARIN.SI-embed.sl word embeddings expanded with the MaCoCu-sl Slovene web corpus. These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~91.45.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić and Dobrovoljc (2019)

Download

Lemmatization model: Stanza

Annotation: lemmatisation
Licence: CC BY 4.0

Swedish

This model enables lemmatisation of Swedish text following the SUC3 standard.

The models are available for download from the Swedish Language Bank.

Download

Baseline

Corpus Language Description Availability

CERED baseline models

Annotation: Baseline
Licence: CC BY-NC-SA 4.0

Czech

These models are trained on CERED, a dataset created by distant supervision on Czech Wikipedia and Wikidata, and recognize a subset of Wikidata relations.

The model is available for download from the LINDAT repository.

Download

LVBERT - Latvian BERT

Annotation: Baseline
Licence: GNU GPL3

Latvian

This model is trained on the original implementation of BERT on the TensorFlow machine-learning platform with the whole-word masking and the next sentence prediction objectives. This uses the BERT configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and a 32,000 token vocabulary.

THe model is available for download from the CLARIN-LV repository.

For the relevant publication, see Znotinš and Barzdinš (2020)

Download

LitLat BERT

Annotation: Baseline
Licence: PUB CLARIN-LT

Lithuanian, Latvian, English

This BERT-like model represents tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian.

The model is available for download from the CLARIN-LT repository.

Download

Albertina PT-BR

Annotation: Baseline
Licence: MIT

Portuguese

This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the DeBERTa model. This model is for American Portuguese spoken in Brazil, is trained on the brWaC dataset, and is a larger version of the Albertina PT-BR base model.

This model is available for download through Hugging Face.

Download

Albertina PT-BR base

Annotation: Baseline
Licence: MIT

Portuguese This model is for Portuguese spoken in Brazil. It is based on the Transformer neural architecture and is developed over the DeBERTa model. Download

Albertina PT-BR No-brWaC

Annotation: Baseline
Licence: MIT

Portuguese

This is a model for Portuguese spoken in Brazil trained on adta sets othan than brWaC. It is I developed over the DeBERTa model.

The model is available for download from Hugging Face.

Download

Albertina PT-PT

Annotation: Baseline
Licence: MIT

Portuguese

This model is an encoder of the BERT family and is based on the neural architecture Transformer and developed over the DeBERTa model. This model is for European Portuguese and is trained on the brWaC dataset, and is a larger version of the Albertina PT-PT base model.

This model is available for download through Hugging Face.

Download

Albertina PT-PT base

Annotation: Baseline
Licence: MIT

Portuguese

This model is for European. It is based on the Transformer neural architecture and is developed over the DeBERTa model.

This model is available for download through Hugging Face.

Download

Gervásio PT-BR base

Annotation: Baseline
Licence: MIT

Portuguese

This model, which is for Portuguese spoken in Brazil, is a decoder of the GPT family that is based on the neural architecture Transformer and developed over the Pythia model.

The model is available for download from Hugging Face.

Download

Gervásio PT-PT base

Annotation: Baseline
Licence: MIT

Portuguese

This model, which is for European Portuguese, is a decoder of the GPT family that is based on the neural architecture Transformer and developed over the Pythia model.

The model is available for download from Hugging Face.

Download

BERTimbau - Portuguese BERT-Base language model

Annotation: Baseline
Licence: Under negotiation

Portuguese

This is a BERT model, trained on BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask.

The model is available for download from the PORTULAN repository.

Download

BERTimbau - Portuguese BERT-Large language model

Annotation: Baseline
Licence: Under negotiation

Portuguese

This is a BERT model, trained on BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask.

The model is available for download from the PORTULAN repository.

Download

Portuguese RoBERTa language model

Annotation: Baseline
Licence: CC-BY

Portuguese

This is a pre-trained roBERTa model in Portuguese, with 6 layers and 12 attention-heads, totaling 68M parameters. Pre-training was done on 10 million Portuguese sentences and 10 million English sentences from the OSCAR corpus.

The model is available for download from the PORUTLAN repository.

Download

Dataset and baseline model of moderated content FRENK-MMC-RTV 1.0

Annotation: Baseline
Licence: CC BY-SA 4.0

Slovenian

FRENK-MMC-RTV is a dataset of moderated newspaper comments from the website rtvslo.si with metadata on the time of publishing, user identifier, thread identifier and whether the comment was deleted by the moderators or not. The full text of each comment is encrypted via a character-replacement method so that the comments are not readable by humans. Basic punctuation is not encrypted in order to enable tokenization. The main use of this dataset are experiments on automating comment moderation. For real-world usage, a fastText classification model trained on non-encrypted data is made available as well.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić et al. (2018)

Download

ccGigafida ARPA language model 1.0

Annotation: Baseline
Licence: CC BY 4.0

Slovenian

This model was created from the ccGigafida written corpus of Slovenian using the KenLM algorithm in the Moses machine translation framework. It is a general language model of contemporary standard Slovenian language that can be used as a language model in statistical machine translation systems.

The model is available for download from the CLARIN.SI repository.

Download

Other

Corpus Language Description Availability

Czech Models for Korektor 2

Annotation: normalization
Licence: CC BY-NC-SA 3.0

Czech

These models are for the statistical spellchecker Korektor 2. The models can either perform spellchecking and grammar-checking, or only generate diacritical marks.

The models are available for download from the LINDAT repository.

Download

Sentiment Analysis (Czech Model)

Annotation: sentiment analysis
Licence: CC BY-NC-SA 4.0

Czech

These models are trained on data from the following sources: Mall (product reviews), CSFD (movie reviews), and Facebook, and joint data from all three datasets above (data available here, using RobeCzech, which is the Czech version of BERT.

For the relevant publication, see Vysušilová (2021)

Download

Model weights for a study of commonsense reasoning

Annotation: commonsense reasoning
Licence: MIT

English

This resource contains model weights for five Transformer-based models: roBERTa, GPT-2, T5, BART and COMET.These models were implemented using HuggingFace, and fine-tuned on the following four commonsense reasoning tasks: Argument Reasoning Comprehension Task (ARCT), AI2 Reasoning Challenge (ARC), Physical Interaction Question Answering (PIQA) and CommonsenseQA (CSQA).

The models are available for download form the PORTULAN repository.

Download

RÚV-DI Speaker Diarization v5 models (21.05)

Annotation: diarization
Licence: CC BY 4.0

Icelandic

These models are trained on the Althingi Parliamentary Speech corpus hosted by CLARIN-IS. The models use MFCCS, x-vectors, PLDA and AHC

The models are available for download from the CLARIN-IS repository.

Download

Models for automatic g2p for Icelandic (20.10)

Annotation: phonemic transcription
Licence: Apache License 2.0

Icelandic

These are grapheme-to-phoneme models for Icelandic, trained on an encoder-decoder LSTM neural network. The models are delivered with scripts for automatic transcription of Icelandic in the standard pronunciation variation, in the northern variation, north-east variation, and the south variation. To run the scripts the user needs to install Fairseq.

For the relevant publication, see Gorman et al. (2020)

Download

Liner2.5 model Timex

Annotation: temporal expressions
Licence: CC BY-SA 4.0

Polish

This is a model for the Liner2.5 tool for the recognition and normalization on temporal expressions.

The model is available for download from the CLARIN-PL repository.

Download

Liner2.5 model Events

Annotation: event mentions
Licence: CC BY-SA 4.0

Polish

This is a model for the Liner2.5 tool for the recognition of event mentions.

The model is available for download from the CLARIN-PL repository.

Download

PyTorch model for Slovenian Coreference Resolution

Annotation: coreference resolution
Licence: CC BY 4.0

Slovenian

This is a Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with this code. The model is based on the Slovenian CroSloEngual BERT 1.1 model. It was trained on the SUK 1.0 training corpus, specifically the SentiCoref subcorpus.

This resource is available for download from the CLARIN.SI repository.

For the relevant publication, see Klemen & Žitnik (2022)

Download

Face-domain-specific automatic speech recognition models

Annotation: face-domain-specific automatic speech recognition
Licence: Apache License 2.0

Slovenian

This model contains all the files required to implement face-domain-specific automatic speech recognition (ASR) applications using the Kaldi ASR toolkit, including the acoustic model, language model, and other relevant files. It also includes all the scripts and configuration files needed to use these models for implementing face-domain-specific automatic speech recognition.

The acoustic model was trained using the relevant Kaldi ASR tools and the Artur speech corpus (audio,transcriptions). The language model was trained using the domain-specific text data involving face descriptions obtained by translating the Face2Text English dataset into the Slovenian language. These models, combined with other necessary files like the HCLG.fst and decoding scripts, enable the implementation of face-domain-specific ASR applications.

This resource is available for download from the CLARIN.SI repository.

Download

The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0

Annotation: semantic role labeling
Licence: CC BY-SA 4.0

Slovenian

The model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool by training on the SUK training corpus and using the CLARIN.SI-embed.sl word embeddings extended with the MaCoCu-sl Slovene web corpus. The estimated F1 of the lemma annotations is ~76.24.

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ljubešić & Dobrovoljc (2019)

Download

Contextual Word Embeddings

Corpus Language Description Availability

CroSloEngual BERT 1.1

Annotation: word embeddings
Licence: CC BY-SA 4.0

Croatian, English, Slovenian

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library).

The model is available for download from the CLARIN.SI repository.

For the relevant publication, see Ulčar and Robnik-Šikonja (2020)

Download

ELMo embeddings models for seven languages

Annotation: word embeddings
Licence: Apache License 2.0

Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, Swedish

This model is used to produce contextual word embeddings. It is trained on large monolingual corpora for 7 languages. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level.

The model is available for download from the CLARIN.SI repository.

Download

LX-DSemVectors

Annotation: Word embeddings
Licence: CC-BY

Portuguese

This model represents tokens as contextual word embeddings for Portuguese. It was trained on a corpus of 2 billion tokens and achieved state-of-the-art results on multiple lexical semantic tasks.

The model is available for download from the PORTULAN repository.

Download

Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

Annotation: word embeddings
Licence: CC BY-SA 4.0

Slovenian The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end.  

Word Embeddings trained on English Wikipedia

Annotation: word embeddings
Licence: CC BY 4.0

Swedish

This is a set of contextual word embeddings.

The models are available for download from the Swedish Language Bank.

Download

Word embeddings CLARIN.SI-embed

Annotation: word embeddings
Licence: CC BY-SA 4.0

Bulgarian, Croatian, Macedonian, Serbian, Slovenian

This is a set of word embeddings for 5 languages.

  • CLARIN.SI-embed.bg contains word embeddings for Bulgarian induced from the MaCoCu-bg web crawl corpus. The embeddings are based on the skip-gram model of fastText trained on 4,120,343,820 tokens of running text for 2,746,640 lowercased surface forms.
  • CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC, a 400-million-token-heavy collection of newspaper texts and MaCoCu-hr. The embeddings are based on the skip-gram model of fastText trained on 4,586,769,197 tokens of running text for 3,406,574 lowercased surface forms.
  • CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms.
  • CLARIN.SI-embed.sr contains word embeddings induced from the srWaC and MaCoCu-sr web corpora. The embeddings are based on the skip-gram model of fastText trained on 3,434,602,575 tokens of running text for 2,676,036 lowercased surface forms.
  • CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl, etc. The embeddings are based on the skip-gram model of fastText trained on 5,791,405,942 tokens of running text for 3,471,054 lowercased surface forms.

 

The models are available for download from the CLARIN.SI repository.

Download (Bulgarian)

Download (Croatian)

Download (Macedonian)

Download (Serbian)

Download (Slovenian)

References

[Gorman et al. 2020] Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta, Arya McCarthy, Shijie Wu, and Daniel You. 2020. The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion. 2020. In: Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, 40–50.

[Jon et al. 2021] Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, and Ondřej Bojar. 2021. CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task, arXiv pre-print.

[Jurafsky and Martin 2021] Daniel Jurafsky and James H. Martin. 2021. Speech and Language Processing.

[Kondratyuk and Straka 2019] Dan Kondratyuk and Milan Straka. 2019. 75 Languages, 1 Model: Parsing Universal Dependencies Universally. arXiv pre-print.

[Libovicky et al. 2018] Jindrich Libovicky, Rusolf Rosa, Jindrich Helcl, and Martin Popel. 2018. Solving Three Cech NLP Tasks End-to-End with Neural Models. In: CEUR Workshop Proceedings, volume 2203, 138–143.

[Ljubešić and Dobrovoljc 2019] Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 29–34.

[Ljubešić et al. 2018] Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2018. Datasets of Slovene and Croatian Moderated News Comments. In: Proceedings of the 2nd Workshop on Abusive Language Online, 124–131.

[Popel et al. 2020]     Martin Popel, Marketa Tomkova, Jakub Tomek, Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bojar, and Zdeněk Žabokrtský. 2020. Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nature Communications 11.

[Rosa et al. 2017] Rudolf Rosa, Daniel Zeman, David Mareček, and Zdenek Žabokrtsky. 2017. Slavic Forest, Norwegian Wood. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, 210–219.

[Símonarson et al. 2021] Haukur Barri Símonarson, Vésteinn Snæbjarnarson, Pétur Orri Ragnarsson, Haukur Páll Jónsson, and Vilhjálmur Þorsteinsson. 2021. Miðeind's WMT 2021 submission. arXiv pre-print. .

[Straková et al. 2019] Jana Straková, Milan Straka, and Jan Hajič. 2019. Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5326–5331.

[Ulčar and Robnik-Šikonja 2020] Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT and CroSloEngual BERT: less is more in multilingual models. arXiv pre-print.

[Vysušilová 2021] Petra Vysušilová. 2021. Czech NLP with Contextualized Embeddings. Diploma thesis.

[Wróblewska and Rybak 2019] Alina Wróblewska and Piotr Rybak. 2019. Dependency parsing of Polish. Poznan Studies in Contemporary Linguistics.

[Znotiņš and Barzdiņš 2020] Artūrs Znotiņš and Guntis Barzdiņš. 2020.  LVBERT: Transformer-Based Model for Latvian Language Understanding. Frontiers in Artificial Intelligence and Applications 328.