Part-of-Speech Taggers and Lemmatisers

Part-of-speech tagging is the automatic text annotation process in which words or tokens are assigned part of speech tags, which typically correspond to the main syntactic categories in a language (e.g., noun, verb) and often to subtypes of a particular syntactic category which are distinguished by morphosyntactic features (e.g., number, tense). Lemmatisation is the process by which inflected forms of a lexeme are grouped together under a base dictionary form. Part-of-speech tagging and lemmatisation are crucial steps of linguistic pre-processing. On this website, the acronym PoS is used for part-of-speech tagging, while MSD stands for morphosyntactic descriptors. MSD tags denote fine-grained feature-structure based PoS tags which are used to account for rich inflectional paradigms like those in Slavic languages.

The CLARIN infrastructure offers 68 tools for part-of-speech tagging or lemmatisation. Most of the tools work for a single language (2 Afrikaans, 1 Assamese, 10 Bantu languages, 1 Belarusian, 1 Bulgarian, 1 Czech, 3 Dutch, 4 English, 2 Estonian, 1 Finnish, 5 German, 1 Greek, 1 Hungarian, 3 Icelandic, 1 Latvian, 1 Maltese, 1 Norwegian, 7 Polish, 4 Portuguese, 2 Slovenian), while the rest have a multilingual scope. Half of the tools provide additional functionalities such as syntactic parsing or named entity recognition.

For comments, changes of the existing content or inclusion of new tools, send us an resource-families [at] clarin.eu (email).

Part-of-Speech Taggers and Lemmatisers in the CLARIN Infrastructure

For a Single Language

Tool	Language	Description
Afrikaans TnT-Tagger Functionality: PoS Licence: research only	Afrikaans	This tool is based on the TnT tagger (Brants 2000). The tagset used by the tool was especially designed for Afrikaans and consists of 139 PoS-tags. Input: plain text Output: plain text CLARIN Centre: SADiLaR
NCHLT Afrikaans Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Afrikaans	This tool is a lemmatiser for Afrikaans developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
Assamese POS Tagger Functionality: PoS	Assamese	This tool is a CRF++ based PoS-tagger. CLARIN Centre: CLARIN-PL
Corpus.by Lemmatizer Functionality: lemma	Belarusian	This tool is part of the corpus.by platform. Availability: web service CLARIN Centre: CLARIN Knowledge Centre for Belarusian text and speech processing Input: plain text Output: plain text
CLaRK Functionality: sentence splitting, PoS, lemma, syntactic parsing	Bulgarian	This tool is an XML-based software system for corpora development implemented in JAVA. The main aim behind the design of the system is the minimisation of human intervention during the creation of language resources. CLaRK includes BTB-Pipe, which is a language pipeline for Bulgarian that comprises the following modules: sentence splitting, MSD-tagging, lemmatisation, dependency parsing. Availability: dowload Input: XML Output: XML CLARIN Centre: ClaDA-BG Related publication: Simov et al. (2001)
HMM tagger Functionality: MSD Licence: GNU General Public Licence, version 2	Czech	This tool uses Hidden Markov Models and is an implementation of the UFAL tagger. Availability: download CLARIN Centre: LINDAT
Frog Functionality: PoS, MSD, lemma, NE, phrase chunks, dependency relations with head words Licence: GNU General Public Licence	Dutch	This tool is an integration of memory-based modules developed for Dutch. All NLP modules are based on TiMBL, the Tilburg memory-based learning software package. Where possible, Frog makes use of multi-processor support to run subtasks in parallel. Availability: download Output: FoLiA XML CLARIN Centre: CLARIAH-NL Related publication: van den Bosch et al. (2007)
INL Labs tagger/lemmatizer tools Functionality: PoS, lemma Licence: CLARIN PUB	Dutch	This tool employs a PoS tagger that is trained on the "Letters as loot" historical corpus and a lemmatiser that is trained on the INL historical lexicon. Availability: web application CLARIN Centre: CLARIAH-NL Input: plain text, , epub, html, docx, alto Output: styled, XML
Tadpole Functionality: PoS/MSD, lemma, syntactic parsing	Dutch	An integrated tokenizer, tagger-lemmatiser, morphological analyzer, and dependency parser for Dutch. CLARIN Centre: LINDAT/CLARIAH-NL
CLAWS Functionality: PoS/MSD Licence: see here	English	CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to PoS tag approx. 100 million words of the British National Corpus (BNC), and all the English corpora in Mark Davies' BYU corpus server. Users can choose to have output in either the smaller C5 tagset or the larger C7 tagset. Availability: web application CLARIN Centre: CLARIN UK Input: plain text Output: horizontal, vertical, pseudo-XML Publication: Garside and Smith (1997)
MorphAdorner Lemmatizer Functionality: lemma	English	This tool is implemented in WebLicht and is derived from the MorphAdorner morphological analyser. Availability: WebLicht Input: , XML CLARIN Centre: CLARIN-D
OpenNLP Part-of-Speech Tagger (English) Functionality: PoS Licence: Apache Licence 2.0 (restricted)	English	This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy-based machine learning toolkit for the processing of natural language text. Availability: web application Input: application/xml Output: application/xml CLARIN Centre: CLARIN:EL
Stanford Dependency Parser Functionality: PoS, syntactic parsing	English	This tool is a WebLicht implementation of the Stanford Parser. Availability: WebLicht Input: plain text, pdf, rtf, XML Output: plain text, pdf, rtf, XML CLARIN Centre: CLARIN-D Related publication: Hinrichs et al. (2010)
EstNLTK Functionality: MSD, NER Licence: Available - Unrestricted Use	Estonian	This tool provides common natural language processing functionality such as morphological analysis and named entity recognition for the Estonian language. Web documentation is available here. Availability: download Input: plain text Output: plain text CLARIN Centre: CELR Related publication: Orasmaa et al. (2016)
Vabamorf open source morphology tagger for Estonian Functionality: PoS, MSD, lemma Licence: Available - Unrestricted Use	Estonian	This tool performs various tasks of morphological analysis, including morphological disambiguation and synthesis. Availability: download, web application Input: plain text Output: plain text CLARIN Centre: CELR Related publication: Kaalep (2015)
FinTag Functionality: PoS, lemma, NER Licence: GPL	Finnish	This toolchain provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish. Both tools take running text from standard input and produce tabular output (one token per line) to standard output. Availability: download, web application Input: plain text, pdf, doc, scv, epub, html, odt, xls Output: TSV CLARIN Centre: FIN-CLARIN
OpenNLP Part-of-Speech Tagger (German) Functionality: PoS Licence: Apache Licence 2.0 (restricted)	German	This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy–based machine learning toolkit for the processing of natural language text. Availability: web application Input: application/xml Output: application/xml CLARIN Centre: CLARIN:EL
Weblicht Part-of-Speech Tagger Functionality: PoS, lemma	German	This tool is a PoS tagger and lemmatiser implemented in WebLicht. Availability: web application, WebLicht Input: TCF, XML CLARIN Centre: CLARIN-D
SepVerb Lemmatizer Functionality: lemma	German	This tool is based on the Mate toolkit. Availability: WebLicht Input: TCF, XML CLARIN Centre: CLARIN-D
SMOR lemmatizer Functionality: PoS, lemma	German	This tool is implemented in WebLicht. Input: TCF, XML CLARIN Centre: CLARIN-D
Stuttgart Dependency Parser Functionality: PoS, syntactic parsing	German	This tool is a Weblicht implementation of the Stuttgart parser. Availability: WebLicht Input: plain text, pdf, rtf, XML Output: plain text, pdf, rtf, XML CLARIN Centre: CLARIN-D Related publication: Hinrichs et al. (2010)
ILSP Feature-based multi-tiered POS Tagger Functionality: PoS Licence: terms of service (Restrictions: Academic - Non Commercial Use)	Greek	This tool is a FBT-based multitiered tagger. FBT is a variant of the well-known transformation based learning paradigm aiming at improving the quality of tagging highly inflective languages such as Greek. Availability: web application Input: Application/vnd.xmi+xml Output: Application/vnd.xmi+xml CLARIN Centre: CLARIN:EL Related publication: Papageorgiou et al. (2000)
hunpos Functionality: PoS Licence: New BSD License	Hungarian	This tool is an open source reimplementation of the TnT tagger (Brants 2000). Availability: download CLARIN Centre: LINDAT Related publication: Halácsy et al. (2007)
ABLTagger (Lemmatizer) Functionality: lemma Licence: The MIT License	Icelandic	The lemmatiser achieves an accuracy of 98.3% on MIM-Gold (21.05, cross-validation). Availability: download Input: tokenised plain text CLARIN Centre: CLARIN-IS Related publication: Steingrímsson et al. (2019)
ABLTagger (PoS) Functionality: PoS Licence: Apache License 2.0	Icelandic	This tool is a part of speech tagger for Icelandic. This entry contains pretrained models for ABLTagger v3.0.0. There are two versions, small and large, of PoS taggers that work with the revised tagset that achieve an accuracy of ~96.7% and ~97.8% on MIM-Gold (cross-validation, excluding "x" and "e" tags), respectively. Availability: download Input: tokenised plain or pre-tagged text CLARIN Centre: CLARIN-IS Related publication: Steingrímsson et al. (2019)
IceNLP Natural Language Processing toolkit Functionality: PoS, lemma, shallow syntactic parsing Licence: GNU General Public License, version 2	Icelandic	This tool is an open source NLP toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java. Availability: download, web application Input: plain text Output: plain text CLARIN Centre: CLARIN-IS Related publication: Loftsson and Rögnvaldsson (2007)
Freeling Functionality: PoS, lemma	Italian	This toolchain was developed in the PANACEA project and implements Freeling 2.1 libraries. Availability: web application CLARIN Centre: CLARIN-IT Publication: Padró et al. (2010)
NLP-PIPE Functionality: MSD, syntactic parsing, NER Licence: GNU General Public Licence 3	Latvian	This tool is a modular toolchain that allows researchers to combine multiple natural language processing tools in a unified framework. It provides the gluing code that is used to combine tools even if they are written in different programming languages and rely on conflicting library versions. It was created to make NLP technology more accessible to linguists, and to make new tool creation and integration easier to researchers and software developers. Availability: download, CLARIN Centre: CLARIN-LV Related publication: Znotins and Cirule (2018)
MLSS Tagger Web Service Functionality: PoS Licence: CLARIN ACA	Maltese	This tool is an implementation of the TnT tagger (Brants 2000). The model for Maltese was trained on manually tagged texts and has reached an accuracy of 96%. The tagset tailored to Maltese is available here. Availability: web application CLARIN Centre: PORTULAN
NCHLT isiNdebele Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Ndebele	This tool is a lemmatiser for Ndebele Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
The Oslo-Bergen tagger Functionality: MSD, syntactic parsing Licence: GNU General public licence	Norwegian (Bokmål and Nynorsk)	This tool consists of three main modules: a pre-processor with a composition analyzer and multitagger, a grammar module for morphological and syntactic disambiguation (based on the constraint grammar paradigm) and a statistical module that removes the last residual morphological ambiguity (only for Bokmål). The tool is trained on the Norwegian wordbank. Availability: download CLARIN Centre: CLARINO Related publication: Johannessen et al. (2012)
Morfeusz 2 Functionality: MSD Licence: BSD 2 (public)	Polish	This tool is a dictionary-based morphological analyser and generator for Polish. This version of the program is decoupled from the dictionary. Two dictionaries of Polish developed within other projects are distributed with Morfeusz 2, namely SGJP and Polimorf. Availability: download, web application Input: various Output: various CLARIN Centre: CLARIN-PL Related publication: Woliński (2014)
MorphoDiTa-based tagger for Polish language Functionality: MSD Licence: GNU LGPL 3.0	Polish	This tool is based on the MorphoDiTa tagger, adapted to Polish. The tool employs the NKJP tagset. Availability: download CLARIN Centre: CLARIN-PL
Tagger SentiOne - version 2 Functionality: MSD Licence: GNU GPL3	Polish	This tool is the second version of tagger developed in the sentione project, adapted to UGC-processing. The tool has been enriched with some heuristics to improve its accuracy and a tokenizer. Availability: download CLARIN Centre: CLARIN-PL
Tagger WS Functionality: MSD, lemma	Polish	This tool uses the NKJP tagset and implements the Morfeusz SGJP dictionary. The service is based on WCRFT. Availability: web application Input: plain text, XML Output: plain text, XML CLARIN Centre: CLARIN-PL
TaKIPI Functionality: MSD	Polish	This tool assumes the morpho-syntactic description of the IPI PAN corpus tagset (Przepiórkowski 2005). CLARIN Centre: CLARIN-PL Related publication: Piasecki (2007)
WCRFT (Wrocław CRF Tagger) Functionality: MSD Licence: GNU LGPL 3.0	Polish	This tool combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL. The algorithm and code are inspired by Wrocław Memory-Based Tagger (WMBT). Availability: download CLARIN Centre: CLARIN-PL Related publication: Radziszewski (2013)
WMBT (Wrocław Memory-Based Tagger) Functionality: MSD Licence: GNU LGPL 3.0	Polish	This tool uses the TiMBL API as the underlying memory-based learning implementation. The features for classification are generated by using the WCCL formalism. The tool uses a tiered tagging approach. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute may be supplied a different set of features. The software package comes with default configurations for KIPI/IPIC and NKJP tagsets. Availability: download Input: various, default is XML Output: various, default is XCES XML CLARIN Centre: CLARIN-PL
Lemmatizer for Portuguese Functionality: lemma Licence: Apache Licence 2.0 (academic)	Portuguese	This tool is based on the MXPOST part of speech tagger and is trained on UNITEX dictionaries for Portuguese. Availability: download Input: plain text Output: plain text CLARIN Centre: PORTULAN
LX-Tagger Functionality: MSD Licence: Academic - Non-Commercial use	Portuguese	This tool is based on the TnT tagger (Brants 2000). Availability: download CLARIN Centre: PORTULAN Related publication: Silva (2007)
LX-Verbal Lemmatizer Functionality: lemma (verbs) Licence: Terms of Service	Portuguese	This tool performs fully-fledged lemmatisation of Portuguese verbs, including the full range of pronominal conjugation forms. Availability: web application CLARIN Centre: PORTULAN
OpenNLP Part-of-Speech Tagger (Portuguese) Functionality: PoS Licence: Apache Licence 2.0 (restricted)	Portuguese	This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy-based machine learning toolkit for the processing of natural language text. Availability: web application Input: application/xml Output: application/xml CLARIN Centre: CLARIN:EL
NCHLT Sepedi Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Sepedi	This tool is a lemmatiser for the Sepedi (Northern Sotho) Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
Sepedi Part of Speech Tagger Functionality: PoS	Sepedi	This tool is based on Helmut Schmidt stochastic tagger (see Schmid 1994) supported by additional noun and verb guessing modules and a tokenizer. CLARIN Centre: SADiLaR
NCHLT Sesotho Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Sesotho	This tool is a lemmatiser for the Sesotho Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
Character-level part-of-speech tagger of Slovene language Functionality: PoS Licence: GNU General Public Licence, version 3	Slovenian	This tool uses convolutional and LSTM neural networks. The tool has been trained on the ssj500k 2.1 corpus. Availability: download Input: XML, TEI, plain text CLARIN Centre: CLARIN.SI Related publication: Belej (2018)
janes-tagger Functionality: PoS, lemma	Slovenian	This tool, which was developed in the context of the JANES project, tags non-standard Slovenian, with Croatian and Serbian to follow. Availability: download Input: plain text CLARIN Centre: CLARIN.SI
NCHLT Siswati Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Swazi	This tool is a lemmatiser for the Swazi Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
NCHLT Xitsonga Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Tsonga	This tool is a lemmatiser of the Tsonga Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
NCHLT Setswana Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Tswana	This tool is a lemmatiser for the Tswana Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
NCHLT Tshivenda Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Venda	This tool is a lemmatiser for the Venda Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR
NCHLT isiZulu Lemmatiser Functionality: lemma Licence: CC-BY 2.5 South Africa Licence	Zulu	This tool is a lemmatiser for the Zulu Bantu language developed during the NCHLT Text project (Barnard et al. 2014). Availability: download Input: Text data (encoding: UTF8 without BOM), one lowercase token per line Output: Token tab, lemma CLARIN Centre: SADiLaR

Tool

Language

Description

Afrikaans TnT-Tagger

Functionality: PoS

Licence: research only

Afrikaans

This tool is based on the TnT tagger (Brants 2000). The tagset used by the tool was especially designed for Afrikaans and consists of 139 PoS-tags.

Input: plain text
Output: plain text
CLARIN Centre: SADiLaR

NCHLT Afrikaans Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Afrikaans

This tool is a lemmatiser for Afrikaans developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

Assamese POS Tagger

Functionality: PoS

Assamese

This tool is a CRF++ based PoS-tagger.

CLARIN Centre: CLARIN-PL

Corpus.by Lemmatizer

Functionality: lemma

Belarusian

This tool is part of the corpus.by platform.

Availability: web service
CLARIN Centre: CLARIN Knowledge Centre for Belarusian text and speech processing
Input: plain text
Output: plain text

CLaRK

Functionality: sentence splitting, PoS, lemma, syntactic parsing

Bulgarian

This tool is an XML-based software system for corpora development implemented in JAVA. The main aim behind the design of the system is the minimisation of human intervention during the creation of language resources. CLaRK includes BTB-Pipe, which is a language pipeline for Bulgarian that comprises the following modules: sentence splitting, MSD-tagging, lemmatisation, dependency parsing.

Availability: dowload
Input: XML
Output: XML
CLARIN Centre: ClaDA-BG
Related publication: Simov et al. (2001)

HMM tagger

Functionality: MSD
Licence: GNU General Public Licence, version 2

Czech

This tool uses Hidden Markov Models and is an implementation of the UFAL tagger.

Availability: download
CLARIN Centre: LINDAT

Frog

Functionality: PoS, MSD, lemma, NE, phrase chunks, dependency relations with head words
Licence: GNU General Public Licence

Dutch

This tool is an integration of memory-based modules developed for Dutch. All NLP modules are based on TiMBL, the Tilburg memory-based learning software package. Where possible, Frog makes use of multi-processor support to run subtasks in parallel.

Availability: download
Output: FoLiA XML
CLARIN Centre: CLARIAH-NL
Related publication: van den Bosch et al. (2007)

INL Labs tagger/lemmatizer tools

Functionality: PoS, lemma
Licence: CLARIN PUB

Dutch

This tool employs a PoS tagger that is trained on the "Letters as loot" historical corpus and a lemmatiser that is trained on the INL historical lexicon.

Availability: web application
CLARIN Centre: CLARIAH-NL
Input: plain text, , epub, html, docx, alto
Output: styled, XML

Tadpole

Functionality: PoS/MSD, lemma, syntactic parsing

Dutch

An integrated tokenizer, tagger-lemmatiser, morphological analyzer, and dependency parser for Dutch.

CLARIN Centre: LINDAT/CLARIAH-NL

CLAWS

Functionality: PoS/MSD
Licence: see here

English

CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to PoS tag approx. 100 million words of the British National Corpus (BNC), and all the English corpora in Mark Davies' BYU corpus server. Users can choose to have output in either the smaller C5 tagset or the larger C7 tagset.

Availability: web application
CLARIN Centre: CLARIN UK
Input: plain text
Output: horizontal, vertical, pseudo-XML
Publication: Garside and Smith (1997)

MorphAdorner Lemmatizer

Functionality: lemma

English

This tool is implemented in WebLicht and is derived from the MorphAdorner morphological analyser.

Availability: WebLicht
Input: , XML
CLARIN Centre: CLARIN-D

OpenNLP Part-of-Speech Tagger (English)

Functionality: PoS
Licence: Apache Licence 2.0 (restricted)

English

This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy-based machine learning toolkit for the processing of natural language text.

Availability: web application
Input: application/xml
Output: application/xml
CLARIN Centre: CLARIN:EL

Stanford Dependency Parser

Functionality: PoS, syntactic parsing

English

This tool is a WebLicht implementation of the Stanford Parser.

Availability: WebLicht
Input: plain text, pdf, rtf, XML
Output: plain text, pdf, rtf, XML
CLARIN Centre: CLARIN-D
Related publication: Hinrichs et al. (2010)

EstNLTK

Functionality: MSD, NER
Licence: Available - Unrestricted Use

Estonian

This tool provides common natural language processing functionality such as morphological analysis and named entity recognition for the Estonian language.

Web documentation is available here.

Availability: download
Input: plain text
Output: plain text
CLARIN Centre: CELR
Related publication: Orasmaa et al. (2016)

Vabamorf open source morphology tagger for Estonian

Functionality: PoS, MSD, lemma
Licence: Available - Unrestricted Use

Estonian

This tool performs various tasks of morphological analysis, including morphological disambiguation and synthesis.

Availability: download, web application
Input: plain text
Output: plain text
CLARIN Centre: CELR
Related publication: Kaalep (2015)

FinTag

Functionality: PoS, lemma, NER
Licence: GPL

Finnish

This toolchain provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish. Both tools take running text from standard input and produce tabular output (one token per line) to standard output.

Availability: download, web application
Input: plain text, pdf, doc, scv, epub, html, odt, xls
Output: TSV
CLARIN Centre: FIN-CLARIN

OpenNLP Part-of-Speech Tagger (German)

Functionality: PoS
Licence: Apache Licence 2.0 (restricted)

German

This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy–based machine learning toolkit for the processing of natural language text.

Availability: web application
Input: application/xml
Output: application/xml
CLARIN Centre: CLARIN:EL

Weblicht Part-of-Speech Tagger

Functionality: PoS, lemma

German

This tool is a PoS tagger and lemmatiser implemented in WebLicht.

Availability: web application, WebLicht
Input: TCF, XML
CLARIN Centre: CLARIN-D

SepVerb Lemmatizer

Functionality: lemma

German

This tool is based on the Mate toolkit.

Availability: WebLicht
Input: TCF, XML
CLARIN Centre: CLARIN-D

SMOR lemmatizer

Functionality: PoS, lemma

German

This tool is implemented in WebLicht.

Input: TCF, XML
CLARIN Centre: CLARIN-D

Stuttgart Dependency Parser

Functionality: PoS, syntactic parsing

German

This tool is a Weblicht implementation of the Stuttgart parser.

Availability: WebLicht
Input: plain text, pdf, rtf, XML
Output: plain text, pdf, rtf, XML
CLARIN Centre: CLARIN-D
Related publication: Hinrichs et al. (2010)

ILSP Feature-based multi-tiered POS Tagger

Functionality: PoS
Licence: terms of service (Restrictions: Academic - Non Commercial Use)

Greek

This tool is a FBT-based multitiered tagger. FBT is a variant of the well-known transformation based learning paradigm aiming at improving the quality of tagging highly inflective languages such as Greek.

Availability: web application
Input: Application/vnd.xmi+xml
Output: Application/vnd.xmi+xml
CLARIN Centre: CLARIN:EL
Related publication: Papageorgiou et al. (2000)

hunpos

Functionality: PoS
Licence: New BSD License

Hungarian

This tool is an open source reimplementation of the TnT tagger (Brants 2000).

Availability: download
CLARIN Centre: LINDAT
Related publication: Halácsy et al. (2007)

ABLTagger (Lemmatizer)

Functionality: lemma
Licence: The MIT License

Icelandic

The lemmatiser achieves an accuracy of 98.3% on MIM-Gold (21.05, cross-validation).

Availability: download

Input: tokenised plain text

CLARIN Centre: CLARIN-IS

Related publication: Steingrímsson et al. (2019)

ABLTagger (PoS)

Functionality: PoS
Licence: Apache License 2.0

Icelandic

This tool is a part of speech tagger for Icelandic. This entry contains pretrained models for ABLTagger v3.0.0. There are two versions, small and large, of PoS taggers that work with the revised tagset that achieve an accuracy of ~96.7% and ~97.8% on MIM-Gold (cross-validation, excluding "x" and "e" tags), respectively.

Availability: download

Input: tokenised plain or pre-tagged text

CLARIN Centre: CLARIN-IS

Related publication: Steingrímsson et al. (2019)

IceNLP Natural Language Processing toolkit

Functionality: PoS, lemma, shallow syntactic parsing
Licence: GNU General Public License, version 2

Icelandic

This tool is an open source NLP toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.

Availability: download, web application
Input: plain text
Output: plain text
CLARIN Centre: CLARIN-IS
Related publication: Loftsson and Rögnvaldsson (2007)

Freeling

Functionality: PoS, lemma

Italian

This toolchain was developed in the PANACEA project and implements Freeling 2.1 libraries.

Availability: web application
CLARIN Centre: CLARIN-IT
Publication: Padró et al. (2010)

NLP-PIPE

Functionality: MSD, syntactic parsing, NER
Licence: GNU General Public Licence 3

Latvian

This tool is a modular toolchain that allows researchers to combine multiple natural language processing tools in a unified framework. It provides the gluing code that is used to combine tools even if they are written in different programming languages and rely on conflicting library versions. It was created to make NLP technology more accessible to linguists, and to make new tool creation and integration easier to researchers and software developers.

Availability: download,
CLARIN Centre: CLARIN-LV
Related publication: Znotins and Cirule (2018)

MLSS Tagger Web Service

Functionality: PoS
Licence: CLARIN ACA

Maltese

This tool is an implementation of the TnT tagger (Brants 2000). The model for Maltese was trained on manually tagged texts and has reached an accuracy of 96%. The tagset tailored to Maltese is available here.

Availability: web application
CLARIN Centre: PORTULAN

NCHLT isiNdebele Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Ndebele

This tool is a lemmatiser for Ndebele Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

The Oslo-Bergen tagger

Functionality: MSD, syntactic parsing
Licence: GNU General public licence

Norwegian (Bokmål and Nynorsk)

This tool consists of three main modules: a pre-processor with a composition analyzer and multitagger, a grammar module for morphological and syntactic disambiguation (based on the constraint grammar paradigm) and a statistical module that removes the last residual morphological ambiguity (only for Bokmål). The tool is trained on the Norwegian wordbank.

Availability: download
CLARIN Centre: CLARINO
Related publication: Johannessen et al. (2012)

Morfeusz 2

Functionality: MSD
Licence: BSD 2 (public)

Polish

This tool is a dictionary-based morphological analyser and generator for Polish. This version of the program is decoupled from the dictionary. Two dictionaries of Polish developed within other projects are distributed with Morfeusz 2, namely SGJP and Polimorf.

Availability: download, web application
Input: various
Output: various
CLARIN Centre: CLARIN-PL
Related publication: Woliński (2014)

MorphoDiTa-based tagger for Polish language

Functionality: MSD
Licence: GNU LGPL 3.0

Polish

This tool is based on the MorphoDiTa tagger, adapted to Polish. The tool employs the NKJP tagset.

Availability: download
CLARIN Centre: CLARIN-PL

Tagger SentiOne - version 2

Functionality: MSD
Licence: GNU GPL3

Polish

This tool is the second version of tagger developed in the sentione project, adapted to UGC-processing. The tool has been enriched with some heuristics to improve its accuracy and a tokenizer.

Availability: download
CLARIN Centre: CLARIN-PL

Tagger WS

Functionality: MSD, lemma

Polish

This tool uses the NKJP tagset and implements the Morfeusz SGJP dictionary. The service is based on WCRFT.

Availability: web application
Input: plain text, XML
Output: plain text, XML
CLARIN Centre: CLARIN-PL

TaKIPI

Functionality: MSD

Polish

This tool assumes the morpho-syntactic description of the IPI PAN corpus tagset (Przepiórkowski 2005).

CLARIN Centre: CLARIN-PL
Related publication: Piasecki (2007)

WCRFT (Wrocław CRF Tagger)

Functionality: MSD
Licence: GNU LGPL 3.0

Polish

This tool combines tiered tagging, conditional random fields (CRF) and features tailored for inflective languages written in WCCL. The algorithm and code are inspired by Wrocław Memory-Based Tagger (WMBT).

Availability: download
CLARIN Centre: CLARIN-PL
Related publication: Radziszewski (2013)

WMBT (Wrocław Memory-Based Tagger)

Functionality: MSD
Licence: GNU LGPL 3.0

Polish

This tool uses the TiMBL API as the underlying memory-based learning implementation. The features for classification are generated by using the WCCL formalism. The tool uses a tiered tagging approach. Grammatical class is disambiguated first, then subsequent attributes (as defined in a config file) are taken care of. Each attribute may be supplied a different set of features. The software package comes with default configurations for KIPI/IPIC and NKJP tagsets.

Availability: download
Input: various, default is XML
Output: various, default is XCES XML
CLARIN Centre: CLARIN-PL

Lemmatizer for Portuguese

Functionality: lemma
Licence: Apache Licence 2.0 (academic)

Portuguese

This tool is based on the MXPOST part of speech tagger and is trained on UNITEX dictionaries for Portuguese.

Availability: download
Input: plain text
Output: plain text
CLARIN Centre: PORTULAN

LX-Tagger

Functionality: MSD
Licence: Academic - Non-Commercial use

Portuguese

This tool is based on the TnT tagger (Brants 2000).

Availability: download
CLARIN Centre: PORTULAN
Related publication: Silva (2007)

LX-Verbal Lemmatizer

Functionality: lemma (verbs)
Licence: Terms of Service

Portuguese

This tool performs fully-fledged lemmatisation of Portuguese verbs, including the full range of pronominal conjugation forms.

Availability: web application
CLARIN Centre: PORTULAN

OpenNLP Part-of-Speech Tagger (Portuguese)

Functionality: PoS
Licence: Apache Licence 2.0 (restricted)

Portuguese

This tool is based on the Apache OpenNLP library, which is a perception and maximum entropy-based machine learning toolkit for the processing of natural language text.

Availability: web application
Input: application/xml
Output: application/xml
CLARIN Centre: CLARIN:EL

NCHLT Sepedi Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Sepedi

This tool is a lemmatiser for the Sepedi (Northern Sotho) Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

Sepedi Part of Speech Tagger

Functionality: PoS

Sepedi

This tool is based on Helmut Schmidt stochastic tagger (see Schmid 1994) supported by additional noun and verb guessing modules and a tokenizer.

CLARIN Centre: SADiLaR

NCHLT Sesotho Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Sesotho

This tool is a lemmatiser for the Sesotho Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

Character-level part-of-speech tagger of Slovene language

Functionality: PoS
Licence: GNU General Public Licence, version 3

Slovenian

This tool uses convolutional and LSTM neural networks. The tool has been trained on the ssj500k 2.1 corpus.

Availability: download
Input: XML, TEI, plain text
CLARIN Centre: CLARIN.SI
Related publication: Belej (2018)

janes-tagger

Functionality: PoS, lemma

Slovenian

This tool, which was developed in the context of the JANES project, tags non-standard Slovenian, with Croatian and Serbian to follow.

Availability: download
Input: plain text
CLARIN Centre: CLARIN.SI

NCHLT Siswati Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Swazi

This tool is a lemmatiser for the Swazi Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

NCHLT Xitsonga Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Tsonga

This tool is a lemmatiser of the Tsonga Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

NCHLT Setswana Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Tswana

This tool is a lemmatiser for the Tswana Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

NCHLT Tshivenda Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Venda

This tool is a lemmatiser for the Venda Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma
CLARIN Centre: SADiLaR

NCHLT isiZulu Lemmatiser

Functionality: lemma
Licence: CC-BY 2.5 South Africa Licence

Zulu

This tool is a lemmatiser for the Zulu Bantu language developed during the NCHLT Text project (Barnard et al. 2014).

Availability: download

Input: Text data (encoding: UTF8 without BOM), one lowercase token per line

Output: Token tab, lemma

CLARIN Centre: SADiLaR

For Multiple Languages

Tool	Language	Description
NCHLT Tagger Functionality: PoS, phrase chunks, NE Licence: CC-BY 2.5 South Africa Licence	Afrikaans, English, Ndebele, Xhosa, Zulu, Sesotho sa Leboa, Setswana, Sesotho, Siswati, Tshivenda, Xitsonga	This tool is used to annotate texts in Afrikaans and a variety of Bantu languages. Availability: download Input: Utf8 text file containing running text Output: tab-delimited text file containing each token followed by its the assigned class. CLARIN Centre: SADiLaR
CST’s lemmatizer Functionality: lemma	Bulgarian, Czech, Danish, Dutch, English, Estonian, Farsi, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Ukrainian	This tool uses affix rules (affix: prefix, infix, suffix, circumfix). Availability: download CLARIN Centre: LINDAT/CLARIN-DK Related publication: Jongejan and Dalianis (2009)
Sparv Functionality: PoS, MSD, lemma, compound analysis, dictionary lookup	Bulgarian, English, Estonian, Finnish, French, Galician, Italian, Catalan, Latin, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, German	This tool is Språkbanken's corpus annotation pipeline infrastructure. The pipeline uses in-house and external tools on the text to segment it into sentences and paragraphs, tokenise, tag parts-of-speech, look up in dictionaries and analyse compounds. The pipeline can also be run using a web API with XML results, and it is run locally to prepare the documents in Korp, which is SWE-LANG’s corpus search tool. While the most sophisticated support is for modern Swedish, the pipeline supports additional 19 languages. Availability: web application, web API Input: plain text, XML Output: plain text, XML CLARIN Centre: SWE-CLARIN Related publication: Borin et al. (2016)
ReLDIanno Functionality: PoS, lemma, NER, syntactic parsing Licence: CC-BY (for webservice); Apache 2 for library	Croatian, Serbian, Slovenian	This tool, which was developed in the context of the ReLDI project, employs the MULTEXT tagset for part of speech tagging and Universal Dependencies for syntactic parsing. Availability: download, web application Input: plain text, TCF Output: vertical, plain text CLARIN Centre: CLARIN.SI Related publication: Ljubešić et al. (2016)
CLARIN DK NLP Toolbox Functionality: PoS, lemma, frequency lists	Danish, English	This tool is an NLP toolchain that is part of the core CLARIN-DK structure. Availability: web application Input: plain text, rtf, pdf Output: plain text, rtf CLARIN Centre: CLARIN-DK
GENIA Tagger Functionality: PoS, lemma, chunks, named entities Licence: proprietary - commercial	English, Czech, Slovak	This tool is used for annotating biomedical texts such as MEDLINE abstracts. Availability: download CLARIN Centre: PORTULAN Related publication: Tsurouka et al. (2015)
MorphoDiTa: Morphological Dictionary and Tagger Functionality: MSD, lemma Licence: Mozilla Public Licence 2.0 (software); CC BY-NC-SA models	English, Czech, Slovak	This tool performs morphological analysis, morphological generation, tagging and tokenisation and is distributed as a standalone tool or a library, along with trained linguistic models. For Czech, the tool achieves state-of-the-art results with a throughput around 10,000-200,000 words per second. The tool is versioned using Semantic Versioning. The following language models are available through LINDAT under the CC BY licence: Czech and English. Availability: download, web application, API Input: plain text, vertical Output: vertical, XML CLARIN Centre: LINDAT Related publication: Straková et al. (2014)
STEPP Tagger Functionality: PoS Licence: proprietary - commercial	English, Czech, Slovak	This tool is used for annotating biomedical texts such as MEDLINE abstracts. Input: plain text Output: plain text CLARIN Centre: PORTULAN
Stanford Phrase Structure Parser Functionality: PoS, syntactic parsing	English, German	This tool is a Weblicht implementation of the Stanford Parser. Availability: WebLicht Input: plain text, pdf, rtf, XML Output: plain text, pdf, rtf, XML CLARIN Centre: CLARIN-D Related publication: Hinrichs et al. (2010)
RFTagger Functionality: PoS	German, Czech, Slovene, Hungarian	This tool is a PoS tagger implemented in WebLicht. Availability: download, WebLicht CLARIN Centre: CLARIN-D Related publication: Schmid and Laws (1995)
Sticker part-of-speech tagger UD Functionality: PoS, syntactic parsing, NER Licence: Blue Oak Mode Licence version 1.0.0	German, Dutch	This tool is a PoS tagger, syntactic parser and named entity recognizer implemented in WebLicht. The PoS tagger uses the Universal Dependencies tagset. Availability: download, WebLicht CLARIN Centre: CLARIN-D Related publication: Ling et al. (2015)
TreeTagger Functionality: PoS, lemma Licence: free but unspecified	German, English French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French	This tool is a PoS tagger and lemmatiser implemented in WebLicht. Availability: download, WebLicht Output: plain text CLARIN Centre: CLARIN-D Related publication: Schmid (1999)
PoS Tagger OpenNLP Project Functionality: PoS	German, English, Italian	This tool is a PoS tagger implemented in WebLicht. The model for Italian is trained on the MIDT corpus. Availability: WebLicht Input: TCF, XML CLARIN Centre: CLARIN-D
UDPipe Functionality: PoS, lemma, syntactic parsing Licence: Mozilla Public Licence 2.0 (software); CC BY-NC-SA UD models	Language independent	This tool is a trainable pipeline for annotating CoNLL-U files. UDPipe is language-agnostic and can be trained given annotated data in the CoNLL-U format. Trained models are provided for nearly all Universal Dependency treebanks. Availability: download, web application Input: plain text Output: CoNLL-U CLARIN Centre: LINDAT Related publication: Straka and Straková (2017)
Turku-neural-parser-pipeline Functionality: segmentation, MSD, syntactic parsing, lemma Licence: Apache License 2.0	More than 50 languages	A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatisation with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task. Availability: download, web application Input: utf-8 encoded plain text Output: CoNLL-U CLARIN Centre: FIN-CLARIN Related publication: Kanerva et al. (2018)

Tool

Language

Description

NCHLT Tagger

Functionality: PoS, phrase chunks, NE
Licence: CC-BY 2.5 South Africa Licence

Afrikaans, English, Ndebele, Xhosa, Zulu, Sesotho sa Leboa, Setswana, Sesotho, Siswati, Tshivenda, Xitsonga

This tool is used to annotate texts in Afrikaans and a variety of Bantu languages.

Availability: download

Input: Utf8 text file containing running text

Output: tab-delimited text file containing each token followed by its the assigned class.
CLARIN Centre: SADiLaR

CST’s lemmatizer

Functionality: lemma

Bulgarian, Czech, Danish, Dutch, English, Estonian, Farsi, French, German, Greek, Hungarian, Icelandic, Italian, Latin, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Ukrainian

This tool uses affix rules (affix: prefix, infix, suffix, circumfix).

Availability: download
CLARIN Centre: LINDAT/CLARIN-DK
Related publication: Jongejan and Dalianis (2009)

Sparv

Functionality: PoS, MSD, lemma, compound analysis, dictionary lookup

Bulgarian, English, Estonian, Finnish, French, Galician, Italian, Catalan, Latin, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, German

This tool is Språkbanken's corpus annotation pipeline infrastructure. The pipeline uses in-house and external tools on the text to segment it into sentences and paragraphs, tokenise, tag parts-of-speech, look up in dictionaries and analyse compounds. The pipeline can also be run using a web API with XML results, and it is run locally to prepare the documents in Korp, which is SWE-LANG’s corpus search tool. While the most sophisticated support is for modern Swedish, the pipeline supports additional 19 languages.

Availability: web application, web API
Input: plain text, XML
Output: plain text, XML
CLARIN Centre: SWE-CLARIN
Related publication: Borin et al. (2016)

ReLDIanno

Functionality: PoS, lemma, NER, syntactic parsing
Licence: CC-BY (for webservice); Apache 2 for library

Croatian, Serbian, Slovenian

This tool, which was developed in the context of the ReLDI project, employs the MULTEXT tagset for part of speech tagging and Universal Dependencies for syntactic parsing.

Availability: download, web application
Input: plain text, TCF
Output: vertical, plain text
CLARIN Centre: CLARIN.SI
Related publication: Ljubešić et al. (2016)

CLARIN DK NLP Toolbox

Functionality: PoS, lemma, frequency lists

Danish, English

This tool is an NLP toolchain that is part of the core CLARIN-DK structure.

Availability: web application
Input: plain text, rtf, pdf
Output: plain text, rtf
CLARIN Centre: CLARIN-DK

GENIA Tagger

Functionality: PoS, lemma, chunks, named entities
Licence: proprietary - commercial

English, Czech, Slovak

This tool is used for annotating biomedical texts such as MEDLINE abstracts.

Availability: download
CLARIN Centre: PORTULAN
Related publication: Tsurouka et al. (2015)

MorphoDiTa: Morphological Dictionary and Tagger
Functionality: MSD, lemma
Licence: Mozilla Public Licence 2.0 (software); CC BY-NC-SA models

English, Czech, Slovak

This tool performs morphological analysis, morphological generation, tagging and tokenisation and is distributed as a standalone tool or a library, along with trained linguistic models. For Czech, the tool achieves state-of-the-art results with a throughput around 10,000-200,000 words per second. The tool is versioned using Semantic Versioning.

The following language models are available through LINDAT under the CC BY licence: Czech and English.

Availability: download, web application, API
Input: plain text, vertical
Output: vertical, XML
CLARIN Centre: LINDAT
Related publication: Straková et al. (2014)

STEPP Tagger

Functionality: PoS
Licence: proprietary - commercial

English, Czech, Slovak

This tool is used for annotating biomedical texts such as MEDLINE abstracts.

Input: plain text
Output: plain text
CLARIN Centre: PORTULAN

Stanford Phrase Structure Parser

Functionality: PoS, syntactic parsing

English, German

This tool is a Weblicht implementation of the Stanford Parser.

Availability: WebLicht
Input: plain text, pdf, rtf, XML
Output: plain text, pdf, rtf, XML
CLARIN Centre: CLARIN-D
Related publication: Hinrichs et al. (2010)

RFTagger

Functionality: PoS

German, Czech, Slovene, Hungarian

This tool is a PoS tagger implemented in WebLicht.

Availability: download, WebLicht
CLARIN Centre: CLARIN-D
Related publication: Schmid and Laws (1995)

Sticker part-of-speech tagger UD

Functionality: PoS, syntactic parsing, NER
Licence: Blue Oak Mode Licence version 1.0.0

German, Dutch

This tool is a PoS tagger, syntactic parser and named entity recognizer implemented in WebLicht. The PoS tagger uses the Universal Dependencies tagset.

Availability: download, WebLicht
CLARIN Centre: CLARIN-D
Related publication: Ling et al. (2015)

TreeTagger

Functionality: PoS, lemma
Licence: free but unspecified

German, English French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French

This tool is a PoS tagger and lemmatiser implemented in WebLicht.

Availability: download, WebLicht
Output: plain text
CLARIN Centre: CLARIN-D
Related publication: Schmid (1999)

PoS Tagger OpenNLP Project

Functionality: PoS

German, English, Italian

This tool is a PoS tagger implemented in WebLicht.
The model for Italian is trained on the MIDT corpus.

Availability: WebLicht
Input: TCF, XML
CLARIN Centre: CLARIN-D

UDPipe

Functionality: PoS, lemma, syntactic parsing
Licence: Mozilla Public Licence 2.0 (software); CC BY-NC-SA UD models

Language independent

This tool is a trainable pipeline for annotating CoNLL-U files. UDPipe is language-agnostic and can be trained given annotated data in the CoNLL-U format. Trained models are provided for nearly all Universal Dependency treebanks.

Availability: download, web application
Input: plain text
Output: CoNLL-U
CLARIN Centre: LINDAT
Related publication: Straka and Straková (2017)

Turku-neural-parser-pipeline

Functionality:
segmentation, MSD, syntactic parsing, lemma
Licence: Apache License 2.0

More than 50 languages

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatisation with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.

Availability: download, web application
Input: utf-8 encoded plain text
Output: CoNLL-U
CLARIN Centre: FIN-CLARIN
Related publication: Kanerva et al. (2018)

Publications

[Barnard et al. 2014] Etienne Barnard, Marelie H. Davel, Charl van Heerden, Febe de Wet, and Jaco Badenhors. 2014. The NCHLT Speech Corpus of the South African languages. In SLTU-2014, 194–200.

[Belej 2018] Primož Belej. 2018. Oblikoskladenjsko označevanje slovenskega jezika z globokimi nevronskimi mrežami. Master’s Thesis. University of Ljubljana.

[Borin et al. 2016] Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In Proceedings of SLTC 2016.

[van den Bosch et al. 2007] Antal van den Bosch, Bertjan Busser, Sander Canisius, and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch, In Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, edited by F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste, 99–114.

[Brants 2000] Thorsten Brants. 2000. TnT – A Statistical Part of-Speech Tagger.

[Halácsy et al. 2007] Péter Halácsy, Andras Kornai, and Csaba Oravecz. 2007. HunPos: an open source trigram tagger.

[Garside and Smith 1997] Roger Garside and Nicholar Smith. 1997. A hybrid grammatical tagger: CLAWS4. In Corpus Annotation: Lnguistic information from Computer Text Corpora, edited by R.G. Garside, Geoffrey Leech, and Anthony Mark McEnery, 112–131.

[Hinrichs et al. 2010] Hinrichs, Erhard, Marie Hinrichs, and Thomas Zastrow. 2010. WebLicht: Web-Based Services for German. In Proceedings of the ACL 2010 System Demonstrations, 25–29.

[Johannessen et al. 2012] Janne Bondi Johannessen, Kristin Hagen, André Lynum, and Anders Nøklestad. 2012. A combined rule-based and statistical tagger. In Exploring Newspaper Language: Using the web to create and investigate a large corpus of modern Norwegian, edited by G. Andersen, 51–66.

[Jongejan and Dalianis 2009] Bart Jongejan and Hercules Dalianis. 2009. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, 145–153.

[Kaalep 2015] Kaalep, Heiki-Jaan. 2015. Vabamorf, a set of open-source morphological tools for Estonian.

[Kanerva et al. 2018] Kanerva, Jenna, Filip Ginter, Niko Miekka, Akseli Leino, and Tapio Salakosk. 2018. Turku neural parser pipeline: An end-to-end system for the conll 2018 shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, 133–142.

[Ling et al. 2015] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1520–1530.

[Ljubešić et al. 2016] Nikola Ljubešić, Filip Klubička, Željko Agić, and Ivo-Pavao Jazbec. 2016. New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. In Proceedings of LREC 2016, edited by Nicoletta Calzolari, 4264–4270.

[Loftsson and Rögnvaldsson 2007] Hrafn Loftsson and Eiríkur Rögnvaldsson. 2007. IceNLP: A natural language processing toolkit for Icelandic. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association.

[Orasmaa et al. 2016] Siim Orasmaa,Timo Petmanson, Alexander Tkachenko, Sven Laur, and Heiki-Jaan Kaalep. 2016. Estnltk-nlp toolkit for Estonian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), 2460–2466.

[Papageorgiou et al. 2000] Harris Papageorgiou, Prokopis Prokopidis, Voula Giouli, and Stelios Piperidis. 2000. A Unified POS Tagging Architecture and its Application to Greek. In Proceedings of LREC2000.

[Piasecki 2007] Maciej Piasecki. 2007. Polish tagger TaKIPI: Rule based construction and optimisation. Task quarterly, 11 (1–2): 151–167.

[Padró et al. 2010] Lluís Padró, Miquel Colaldo, Samuel Reese, Marina Lloberes, and Irene Castellón. 2010. FreeLing 2.1. Five Years of open-source language processing tools. In Proceedings of LREC2010, 931–936.

[Prokopidis et al. 2011] Prokopis Prokopidis, Byron Georgantopoulos, and Haris Papageorgiou. 2011. A Suite of Natural Language Processing Tools for Greek. In The 10th International Conference of Greek Linguistics.

[Przepiórkowski 2005] Adam Przepiórkowski. 2005. The IPI PAN Corpus in numbers. In Proceedings of the 2nd Language & Technology Conference, 27–31.

[Radziszewski 2013] Adam Radziszewski. 2013. A Tiered CRF Tagger for Polish. In Intelligent Tools for Building a Scientific Information Platform, 215–230.

[Schmid and Laws 2008] Helmut Schmid and Florian Laws. 2008. Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 777–784.

[Schmid 1994] Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In New methods in language processing.

[Schmid 1999] Helmut Schmid. 1999. Improvements in part-of-speech tagging with an application to German. In Natural language processing using very large corpora, 13–25. Springer, Dordrecht.

[Silva 2007] João Silva. 2007. Shallow Processing of Portuguese: From Sentence Chunking to Nominal Lemmatization. Master’s Thesis.

[Simov et al. 2017] Simov, Kiril, Zdravko Peev, Milen Kouylekov, Alexander Simov, Marin Dimitrov, and Atanas Kiryakov. 2017. ClaRK – an XML-based System for Corpora Development. In Proc. of the Corpus Linguistics 2001 Conference, 558–560.

[Steingrímsson et al. 2019] Steinþór Steingrímsson, Örvar Kárason, and Hrafn Loftsson. 2019. Augmenting a BiLSTM tagger with a morphological lexicon and a lexical category identification step. arXiv preprint arXiv:1907.09038

[Straka and Straková 2017] Milan Straka and Jana Straková. 2017. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.

[Straková et al. 2014] Jana Straková, Milan Straka, and Jan Hajič. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 13–18.

[Tsuruoka et al. 2005] Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun’ichi Tsujii. 2005. Developing a Robust Part-of-Speech Tagger for Biomedical Text. In Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, edited by P. Bozanis and E.N. Houstis.

[Woliński 2014] Marcin Woliński. 2014. Morfeusz reloaded. In Proceedings of LREC2014, 1106–1111.

[Znotiņš and Cīrule 2018] Artūrs Znotiņš and Elita Cīrule. 2018. NLP-PIPE: Latvian NLP Tool Pipeline. In Human Language Technologies – The Baltic Perspective, 183–189.