Part-of-speech tagging is the automatic text annotation process in which words or tokens are assigned part of speech tags, which typically correspond to the main syntactic categories in a language (e.g., noun, verb) and often to subtypes of a particular syntactic category which are distinguished by morphosyntactic features (e.g., number, tense). Lemmatisation is the process by which inflected forms of a lexeme are grouped together under a base dictionary form. Part-of-speech tagging and lemmatisation are crucial steps of linguistic pre-processing. On this website, the acronym PoS is used for part-of-speech tagging, while MSD stands for morphosyntactic descriptors. MSD tags denote fine-grained feature-structure based PoS tags which are used to account for rich inflectional paradigms like those in Slavic languages.
The CLARIN infrastructure offers 68 tools for part-of-speech tagging or lemmatisation. Most of the tools work for a single language (2 Afrikaans, 1 Assamese, 10 Bantu languages, 1 Belarusian, 1 Bulgarian, 1 Czech, 3 Dutch, 4 English, 2 Estonian, 1 Finnish, 5 German, 1 Greek, 1 Hungarian, 3 Icelandic, 1 Latvian, 1 Maltese, 1 Norwegian, 7 Polish, 4 Portuguese, 2 Slovenian), while the rest have a multilingual scope. Half of the tools provide additional functionalities such as syntactic parsing or named entity recognition.
For comments, changes of the existing content or inclusion of new tools, send us an resource-families [at] clarin.eu (email).
Part-of-Speech Taggers and Lemmatisers in the CLARIN Infrastructure
For a Single Language
Tool | Language | Description |
---|---|---|
Functionality: PoS Licence: research only |
Afrikaans |
This tool is based on the TnT tagger (Brants 2000). The tagset used by the tool was especially designed for Afrikaans and consists of 139 PoS-tags. Input: plain text |
Functionality: lemma |
Afrikaans |
This tool is a lemmatiser for Afrikaans developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
Functionality: PoS |
Assamese |
This tool is a CRF++ based PoS-tagger. CLARIN Centre: CLARIN-PL |
Functionality: lemma |
Belarusian |
This tool is part of the corpus.by platform. Availability: web service |
Functionality: sentence splitting, PoS, lemma, syntactic parsing |
Bulgarian |
This tool is an XML-based software system for corpora development implemented in JAVA. The main aim behind the design of the system is the minimisation of human intervention during the creation of language resources. CLaRK includes BTB-Pipe, which is a language pipeline for Bulgarian that comprises the following modules: sentence splitting, MSD-tagging, lemmatisation, dependency parsing. Availability: dowload |
Functionality: MSD |
Czech |
This tool uses Hidden Markov Models and is an implementation of the UFAL tagger. Availability: download |
Functionality: PoS, MSD, lemma, NE, phrase chunks, dependency relations with head words |
Dutch |
This tool is an integration of memory-based modules developed for Dutch. All NLP modules are based on TiMBL, the Tilburg memory-based learning software package. Where possible, Frog makes use of multi-processor support to run subtasks in parallel. Availability: download |
INL Labs tagger/lemmatizer tools Functionality: PoS, lemma |
Dutch |
This tool employs a PoS tagger that is trained on the "Letters as loot" historical corpus and a lemmatiser that is trained on the INL historical lexicon. Availability: web application |
Functionality: PoS/MSD, lemma, syntactic parsing |
Dutch |
An integrated tokenizer, tagger-lemmatiser, morphological analyzer, and dependency parser for Dutch. CLARIN Centre: LINDAT/CLARIAH-NL |
Functionality: PoS/MSD |
English |
CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to PoS tag approx. 100 million words of the British National Corpus (BNC), and all the English corpora in Mark Davies' BYU corpus server. Users can choose to have output in either the smaller C5 tagset or the larger C7 tagset. Availability: web application |
Functionality: lemma |
English |
This tool is implemented in WebLicht and is derived from the MorphAdorner morphological analyser. Availability: WebLicht |
OpenNLP Part-of-Speech Tagger (English) Functionality: PoS |
English |
This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy-based machine learning toolkit for the processing of natural language text. Availability: web application |
Functionality: PoS, syntactic parsing |
English |
This tool is a WebLicht implementation of the Stanford Parser. Availability: WebLicht |
Functionality: MSD, NER |
Estonian |
This tool provides common natural language processing functionality such as morphological analysis and named entity recognition for the Estonian language. Web documentation is available here. Availability: download |
Vabamorf open source morphology tagger for Estonian Functionality: PoS, MSD, lemma |
Estonian |
This tool performs various tasks of morphological analysis, including morphological disambiguation and synthesis. Availability: download, web application |
Functionality: PoS, lemma, NER |
Finnish |
This toolchain provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish. Both tools take running text from standard input and produce tabular output (one token per line) to standard output. Availability: download, web application |
OpenNLP Part-of-Speech Tagger (German) Functionality: PoS |
German |
This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy–based machine learning toolkit for the processing of natural language text. Availability: web application |
Weblicht Part-of-Speech Tagger Functionality: PoS, lemma |
German |
This tool is a PoS tagger and lemmatiser implemented in WebLicht. Availability: web application, WebLicht |
Functionality: lemma |
German |
This tool is based on the Mate toolkit. Availability: WebLicht |
Functionality: PoS, lemma |
German |
This tool is implemented in WebLicht. Input: TCF, XML |
Functionality: PoS, syntactic parsing |
German |
This tool is a Weblicht implementation of the Stuttgart parser. Availability: WebLicht |
ILSP Feature-based multi-tiered POS Tagger Functionality: PoS |
Greek |
This tool is a FBT-based multitiered tagger. FBT is a variant of the well-known transformation based learning paradigm aiming at improving the quality of tagging highly inflective languages such as Greek. Availability: web application |
Functionality: PoS |
Hungarian |
This tool is an open source reimplementation of the TnT tagger (Brants 2000). Availability: download |
Functionality: lemma
Licence: The MIT License |
Icelandic |
The lemmatiser achieves an accuracy of 98.3% on MIM-Gold (21.05, cross-validation).
|
Functionality: PoS
Licence: Apache License 2.0 |
Icelandic |
This tool is a part of speech tagger for Icelandic. This entry contains pretrained models for ABLTagger v3.0.0. There are two versions, small and large, of PoS taggers that work with the revised tagset that achieve an accuracy of ~96.7% and ~97.8% on MIM-Gold (cross-validation, excluding "x" and "e" tags), respectively.
Availability: download
Input: tokenised plain or pre-tagged text
CLARIN Centre: CLARIN-IS
Related publication: Steingrímsson et al. (2019)
|
IceNLP Natural Language Processing toolkit Functionality: PoS, lemma, shallow syntactic parsing |
Icelandic |
This tool is an open source NLP toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java. Availability: download, web application |
Functionality: PoS, lemma |
Italian |
This toolchain was developed in the PANACEA project and implements Freeling 2.1 libraries. Availability: web application |
Functionality: MSD, syntactic parsing, NER |
Latvian |
This tool is a modular toolchain that allows researchers to combine multiple natural language processing tools in a unified framework. It provides the gluing code that is used to combine tools even if they are written in different programming languages and rely on conflicting library versions. It was created to make NLP technology more accessible to linguists, and to make new tool creation and integration easier to researchers and software developers. Availability: download, |
Functionality: PoS |
Maltese |
This tool is an implementation of the TnT tagger (Brants 2000). The model for Maltese was trained on manually tagged texts and has reached an accuracy of 96%. The tagset tailored to Maltese is available here. Availability: web application |
Functionality: lemma |
Ndebele |
This tool is a lemmatiser for Ndebele Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
Functionality: MSD, syntactic parsing |
Norwegian (Bokmål and Nynorsk) |
This tool consists of three main modules: a pre-processor with a composition analyzer and multitagger, a grammar module for morphological and syntactic disambiguation (based on the constraint grammar paradigm) and a statistical module that removes the last residual morphological ambiguity (only for Bokmål). The tool is trained on the Norwegian wordbank. Availability: download |
Functionality: MSD |
Polish |
This tool is a dictionary-based morphological analyser and generator for Polish. This version of the program is decoupled from the dictionary. Two dictionaries of Polish developed within other projects are distributed with Morfeusz 2, namely SGJP and Polimorf. Availability: download, web application |
MorphoDiTa-based tagger for Polish language Functionality: MSD |
Polish |
This tool is based on the MorphoDiTa tagger, adapted to Polish. The tool employs the NKJP tagset. Availability: download |
Functionality: MSD |
Polish |
This tool is the second version of tagger developed in the sentione project, adapted to UGC-processing. The tool has been enriched with some heuristics to improve its accuracy and a tokenizer. Availability: download |
Functionality: MSD, lemma |
Polish |
This tool uses the NKJP tagset and implements the Morfeusz SGJP dictionary. The service is based on WCRFT. Availability: web application |
Functionality: MSD |
Polish |
This tool assumes the morpho-syntactic description of the IPI PAN corpus tagset (Przepiórkowski 2005). CLARIN Centre: CLARIN-PL |
Functionality: MSD |
Polish |
This tool combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL. The algorithm and code are inspired by Wrocław Memory-Based Tagger (WMBT). Availability: download |
WMBT (Wrocław Memory-Based Tagger) Functionality: MSD |
Polish |
This tool uses the TiMBL API as the underlying memory-based learning implementation. The features for classification are generated by using the WCCL formalism. The tool uses a tiered tagging approach. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute may be supplied a different set of features. The software package comes with default configurations for KIPI/IPIC and NKJP tagsets. Availability: download |
Functionality: lemma |
Portuguese |
This tool is based on the MXPOST part of speech tagger and is trained on UNITEX dictionaries for Portuguese. Availability: download |
Functionality: MSD |
Portuguese |
This tool is based on the TnT tagger (Brants 2000). Availability: download |
Functionality: lemma (verbs) |
Portuguese |
This tool performs fully-fledged lemmatisation of Portuguese verbs, including the full range of pronominal conjugation forms. Availability: web application |
OpenNLP Part-of-Speech Tagger (Portuguese) Functionality: PoS |
Portuguese |
This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy-based machine learning toolkit for the processing of natural language text. Availability: web application |
Functionality: lemma |
Sepedi |
This tool is a lemmatiser for the Sepedi (Northern Sotho) Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
Functionality: PoS |
Sepedi |
This tool is based on Helmut Schmidt stochastic tagger (see Schmid 1994) supported by additional noun and verb guessing modules and a tokenizer. CLARIN Centre: SADiLaR |
Functionality: lemma |
Sesotho |
This tool is a lemmatiser for the Sesotho Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma
|
Character-level part-of-speech tagger of Slovene language Functionality: PoS |
Slovenian |
This tool uses convolutional and LSTM neural networks. The tool has been trained on the ssj500k 2.1 corpus. Availability: download |
Functionality: PoS, lemma |
Slovenian |
This tool, which was developed in the context of the JANES project, tags non-standard Slovenian, with Croatian and Serbian to follow. Availability: download |
Functionality: lemma |
Swazi |
This tool is a lemmatiser for the Swazi Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
Functionality: lemma |
Tsonga |
This tool is a lemmatiser of the Tsonga Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
Functionality: lemma |
Tswana |
This tool is a lemmatiser for the Tswana Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
Functionality: lemma |
Venda |
This tool is a lemmatiser for the Venda Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma |
Functionality: lemma |
Zulu |
This tool is a lemmatiser for the Zulu Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download
Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR |
For Multiple Languages
Tool | Language | Description |
---|---|---|
Functionality: PoS, phrase chunks, NE |
Afrikaans, English, Ndebele, Xhosa, Zulu, Sesotho sa Leboa, Setswana, Sesotho, Siswati, Tshivenda, Xitsonga |
This tool is used to annotate texts in Afrikaans and a variety of Bantu languages. Availability: download Input: Utf8 text file containing running text Output: tab-delimited text file containing each token followed by its the assigned class. |
Functionality: lemma |
Bulgarian, Czech, Danish, Dutch, English, Estonian, Farsi, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Ukrainian |
This tool uses affix rules (affix: prefix, infix, suffix, circumfix). Availability: download |
Functionality: PoS, MSD, lemma, compound analysis, dictionary lookup |
Bulgarian, English, Estonian, Finnish, French, Galician, Italian, Catalan, Latin, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, German |
This tool is Språkbanken's corpus annotation pipeline infrastructure. The pipeline uses in-house and external tools on the text to segment it into sentences and paragraphs, tokenise, tag parts-of-speech, look up in dictionaries and analyse compounds. The pipeline can also be run using a web API with XML results, and it is run locally to prepare the documents in Korp, which is SWE-LANG’s corpus search tool. While the most sophisticated support is for modern Swedish, the pipeline supports additional 19 languages. Availability: web application, web API |
Functionality: PoS, lemma, NER, syntactic parsing |
Croatian, Serbian, Slovenian |
This tool, which was developed in the context of the ReLDI project, employs the MULTEXT tagset for part of speech tagging and Universal Dependencies for syntactic parsing. Availability: download, web application |
Functionality: PoS, lemma, frequency lists |
Danish, English |
This tool is an NLP toolchain that is part of the core CLARIN-DK structure. Availability: web application |
Functionality: PoS, lemma, chunks, named entities |
English, Czech, Slovak |
This tool is used for annotating biomedical texts such as MEDLINE abstracts. Availability: download |
MorphoDiTa: Morphological Dictionary and Tagger Functionality: MSD, lemma Licence: Mozilla Public Licence 2.0 (software); CC BY-NC-SA models |
English, Czech, Slovak |
This tool performs morphological analysis, morphological generation, tagging and tokenisation and is distributed as a standalone tool or a library, along with trained linguistic models. For Czech, the tool achieves state-of-the-art results with a throughput around 10,000-200,000 words per second. The tool is versioned using Semantic Versioning. The following language models are available through LINDAT under the CC BY licence: Czech and English. Availability: download, web application, API |
Functionality: PoS |
English, Czech, Slovak |
This tool is used for annotating biomedical texts such as MEDLINE abstracts. Input: plain text |
Stanford Phrase Structure Parser Functionality: PoS, syntactic parsing |
English, German |
This tool is a Weblicht implementation of the Stanford Parser. Availability: WebLicht |
Functionality: PoS |
German, Czech, Slovene, Hungarian |
This tool is a PoS tagger implemented in WebLicht. Availability: download, WebLicht |
Sticker part-of-speech tagger UD Functionality: PoS, syntactic parsing, NER |
German, Dutch |
This tool is a PoS tagger, syntactic parser and named entity recognizer implemented in WebLicht. The PoS tagger uses the Universal Dependencies tagset. Availability: download, WebLicht |
Functionality: PoS, lemma |
German, English French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French |
This tool is a PoS tagger and lemmatiser implemented in WebLicht. Availability: download, WebLicht |
Functionality: PoS |
German, English, Italian |
This tool is a PoS tagger implemented in WebLicht. Availability: WebLicht |
Functionality: PoS, lemma, syntactic parsing |
Language independent |
This tool is a trainable pipeline for annotating CoNLL-U files. UDPipe is language-agnostic and can be trained given annotated data in the CoNLL-U format. Trained models are provided for nearly all Universal Dependency treebanks. Availability: download, web application |
Functionality: |
More than 50 languages |
A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatisation with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task. Availability: download, web application |
Publications
[Barnard et al. 2014] Etienne Barnard, Marelie H. Davel, Charl van Heerden, Febe de Wet, and Jaco Badenhors. 2014. The NCHLT Speech Corpus of the South African languages. In SLTU-2014, 194–200.
[Belej 2018] Primož Belej. 2018. Oblikoskladenjsko označevanje slovenskega jezika z globokimi nevronskimi mrežami. Master’s Thesis. University of Ljubljana.
[Borin et al. 2016] Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In Proceedings of SLTC 2016.
[van den Bosch et al. 2007] Antal van den Bosch, Bertjan Busser, Sander Canisius, and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch, In Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, edited by F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste, 99–114.
[Brants 2000] Thorsten Brants. 2000. TnT – A Statistical Part of-Speech Tagger.
[Halácsy et al. 2007] Péter Halácsy, Andras Kornai, and Csaba Oravecz. 2007. HunPos: an open source trigram tagger.
[Garside and Smith 1997] Roger Garside and Nicholar Smith. 1997. A hybrid grammatical tagger: CLAWS4. In Corpus Annotation: Lnguistic information from Computer Text Corpora, edited by R.G. Garside, Geoffrey Leech, and Anthony Mark McEnery, 112–131.
[Hinrichs et al. 2010] Hinrichs, Erhard, Marie Hinrichs, and Thomas Zastrow. 2010. WebLicht: Web-Based Services for German. In Proceedings of the ACL 2010 System Demonstrations, 25–29.
[Johannessen et al. 2012] Janne Bondi Johannessen, Kristin Hagen, André Lynum, and Anders Nøklestad. 2012. A combined rule-based and statistical tagger. In Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian, edited by G. Andersen, 51–66.
[Jongejan and Dalianis 2009] Bart Jongejan and Hercules Dalianis. 2009. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, 145–153.
[Kaalep 2015] Kaalep, Heiki-Jaan. 2015. Vabamorf, a set of open-source morphological tools for Estonian.
[Kanerva et al. 2018] Kanerva, Jenna, Filip Ginter, Niko Miekka, Akseli Leino, and Tapio Salakosk. 2018. Turku neural parser pipeline: An end-to-end system for the conll 2018 shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, 133–142.
[Ling et al. 2015] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1520–1530.
[Ljubešić et al. 2016] Nikola Ljubešić, Filip Klubička, Željko Agić, and Ivo-Pavao Jazbec. 2016. New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. In Proceedings of LREC 2016, edited by Nicoletta Calzolari, 4264–4270.
[Loftsson and Rögnvaldsson 2007] Hrafn Loftsson and Eiríkur Rögnvaldsson. 2007. IceNLP: A natural language processing toolkit for Icelandic. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association.
[Orasmaa et al. 2016] Siim Orasmaa,Timo Petmanson, Alexander Tkachenko, Sven Laur, and Heiki-Jaan Kaalep. 2016. Estnltk-nlp toolkit for Estonian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2460–2466.
[Papageorgiou et al. 2000] Harris Papageorgiou, Prokopis Prokopidis, Voula Giouli, and Stelios Piperidis. 2000. A Unified POS Tagging Architecture and its Application to Greek. In Proceedings of LREC2000.
[Piasecki 2007] Maciej Piasecki. 2007. Polish tagger TaKIPI: Rule based construction and optimisation. Task quarterly, 11 (1–2): 151–167.
[Padró et al. 2010] Lluís Padró, Miquel Colaldo, Samuel Reese, Marina Lloberes, and Irene Castellón. 2010. FreeLing 2.1. Five Years of open-source language processing tools. In Proceedings of LREC2010, 931–936.
[Prokopidis et al. 2011] Prokopis Prokopidis, Byron Georgantopoulos, and Haris Papageorgiou. 2011. A Suite of Natural Language Processing Tools for Greek. In The 10th International Conference of Greek Linguistics.
[Przepiórkowski 2005] Adam Przepiórkowski. 2005. The IPI PAN Corpus in numbers. In Proceedings of the 2nd Language & Technology Conference, 27–31.
[Radziszewski 2013] Adam Radziszewski. 2013. A Tiered CRF Tagger for Polish. In Intelligent Tools for Building a Scientific Information Platform, 215–230.
[Schmid and Laws 2008] Helmut Schmid and Florian Laws. 2008. Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 777–784.
[Schmid 1994] Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In New methods in language processing.
[Schmid 1999] Helmut Schmid. 1999. Improvements in part-of-speech tagging with an application to German. In Natural language processing using very large corpora, 13–25. Springer, Dordrecht.
[Silva 2007] João Silva. 2007. Shallow Processing of Portuguese: From Sentence Chunking to Nominal Lemmatization. Master’s Thesis.
[Simov et al. 2017] Simov, Kiril, Zdravko Peev, Milen Kouylekov, Alexander Simov, Marin Dimitrov, and Atanas Kiryakov. 2017. ClaRK – an XML-based System for Corpora Development. In Proc. of the Corpus Linguistics 2001 Conference, 558–560.
[Steingrímsson et al. 2019] Steinþór Steingrímsson, Örvar Kárason, and Hrafn Loftsson. 2019. Augmenting a BiLSTM tagger with a morphological lexicon and a lexical category identification step. arXiv preprint arXiv:1907.09038
[Straka and Straková 2017] Milan Straka and Jana Straková. 2017. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.
[Straková et al. 2014] Jana Straková, Milan Straka, and Jan Hajič. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 13–18.
[Tsuruoka et al. 2005] Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii. 2005. Developing a Robust Part-of-Speech Tagger for Biomedical Text. In Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, edited by P. Bozanis and E.N. Houstis.
[Woliński 2014] Marcin Woliński. 2014. Morfeusz reloaded. In Proceedings of LREC2014, 1106–1111.
[Znotiņš and Cīrule 2018] Artūrs Znotiņš and Elita Cīrule. 2018. NLP-PIPE: Latvian NLP Tool Pipeline. In Human Language Technologies – The Baltic Perspective, 183–189.