Named entity recognition (NER) is an information extraction task which identifies mentions of various named entities in unstructured text and classifies them into predetermined categories, such as person names, organisations, locations, date/time, monetary values, and so forth. They can, for example, help with the classification of news content, content recommentations and search algorithms.
The CLARIN infrastructure offers 25 tools for NER. 17 tools are aimed at normalizing texts within a single language (4 Dutch, 2 English, 1 Finnish, 2 German, 1 Icelandic, 1 Greek, 1 Hungarian, 1 Latvian, 3 Polish, 1 Portuguese), while the rest have a very broad multilingual scope. While 16 tools are in terms of their functionality dedicated exclusively to NER, 10 are part of tool pipelines that also provide functionalities such as PoS-tagging, lemmatisation and syntactic parsing.
For comments, changes of the existing content or inclusion of new tools, send us an resource-families [at] clarin.eu (email).
Tools for Named Entity Recognition in the CLARIN Infrastructure
Tool | Language | Description |
---|---|---|
Functionality: tokenization, sentence segmentation, PoS-tagging, phrase chunking, NER |
Afrikaans, English, South Ndebele, Xhosa, Zulu, Sesotho, Pedi, Setswana, Swazi, Venda, Tsonga |
This is a corpus query and manipulation tool primarily for the official South African languages. The tool supports the creation of frequency and word lists, collocation searches and statistical analysis of corpus data. Availability: download NER categories: organisation, person, location, miscellaneous, outside |
Functionality: PoS-tagging, phrase chunking, NER |
Afrikaans, English, Ndebele, Xhosa, Zulu, Pedi Setswana, Sesotho, Swazi, Venda, Tssonga |
This is a graphical user interface and command-line tool for automatic text processing. Availability: download NER categories: organisation, person, location, miscellaneous, outside |
Functionality: tokenisation, MSD-tagging, syntactic parsing, lemmatization, NER |
Catalan, English, Galician, Italian, Portuguese, Welsh |
This is an open source language analysis tool suite that provides several processing components. Availability: download |
Functionality: tokenisation, MSD-tagging, lemmatisation, morphologic segmentation, phrase chunking, NER |
Dutch |
Frog is a memory-based pipeline based on Timbl, the Tilburg memory-based learning software package. Frog produces FoLiA XML. Availability: download |
Functionality: tokenisation, sentece segmentation, PoS-tagging, lemmatisation, NER |
Dutch |
This toolchain currently provides two annotation tools: the Stanford named entity recognizer, which was trained on the historical Dutch newspapers corpus Letters as loot in the context of the IMPACT project (Landsbergen 2012), and a tagger that consists of a tokenizer/sentence boundary detector, a statistical part-of-speech tagger and a lemmatizer. This toolchain outputs linguistically annotated from a number of input formats (TEI, plain text, Alto, .doc files). Availability: online service |
NameScape: Named Entity Recognition Functionality: NER |
Dutch |
This NER was developed in the Namescape project. Availability: online service |
The NERD named entity recognizer Functionality: NER |
Dutch |
This NER is now integrated into the PICLL workflow. Availability: online service |
Functionality: NER |
Czech, English |
NameTag is an open-source tool that recognizes different NER categories per language model. For Czech, it recognizes a complex hierarchy of categories. The English model, which is trained on CoNLL-2003 NER annotations (Sang and De Meulder 2003), distinguishes the following four NER classes: person, organisation, location and miscellaneous. The trained model for Czech is available for through LINDAT: Czech Models (CNEC) for NameTag. A user manual is also available. Availability: download, online service, web API |
Illinois Named Entity Recognizer Functionality: NER |
English |
This NER annotates plain text. Availability: WebLicht |
Functionality: NER |
English |
This NER can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool’s requirements. Availability: online service |
Functionality: tokenization, PoS-tagging, NER, semantic and orthographic coreference, pronominal coreference |
English, French, German, Romanian, Russian, Welsh, Danish, Chinese, Arabic |
This is a complete NLP platform with modules for named entity recognition. Availability: download, online service |
OpenNLP Named Entity Recognizer Functionality: NER |
English, Spanish |
This NER is based on the OpenNLP NER tool. Availability: WebLicht |
Functionality: PoS/MSD-tagging, NER |
Finnish |
This software package provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish. Availability: download, online service |
German Named Entity Recognizer Functionality: NER |
German |
This NER is based on the maximum entropy approach using the OpenNLP maxent library. Two models are available: one trained on CoNLL2003 training set (conll), and the one trained on TuebaDZ corpus release 8 (tuebadz). Availability: WebLicht |
Functionality: NER |
German |
This NER is tailored to historical German (optimized for journals and high precision) and is based on weighted finite state transducers. Availability: WebLicht |
Sticker Named Entity Recognizer Functionality: NER |
German, Dutch |
This NER is built on a neural-network-based sequence labeller that can label named entities for German and Dutch. Availability: download, WebLicht |
Functionality: NER |
Greek (modern) |
This NER operates on a rule-based engine designed. It was developed and is maintained by the Institute for Language and Speech Processing / Athena Research Center. This recognizer can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool’s requirements. Availability: online service |
hunner - named entitiy recognizer for Hungarian Functionality: NER |
Hungarian |
This NER employs a maximum entropy approach. Availability: unavailable |
Functionality: NER
Licence: Apache 2.0 (models)
|
Icelandic |
This is a dockerized NER
for Icelandic. The code for the API is available at GitHub. There are two models for this NER available for download through the CLARIN-IS repository; the ELECTRA-base model, which achieves F1-score of ~91.9 on the test set for MIM-GOLD-NER, and the Ensamble model, which uses a the IceBERT language model from Miðeind as its primary model, but it also offers the possibility to use 3 other transformer language models with it ( ELECTRA-base, convbert-small, and multilingual-BERT) and combines them with CombiTagger.
Availability: online service
NER categories: person, location, organization, miscellaneous, date, money, time, percent
CLARIN Centre: CLARIN-IS
|
Functionality: tokenisation, NER, MSD-tagging, lemmas, syntactic parsing (universal dependencies)
|
Latvian |
NLP-PIPE is a modular toolchain that allows researchers to combine multiple natural language processing tools in a unified framework. It supports a wide range of annotation services for Latvian, including tokenization, morphological tagging, lemmatisation, universal dependency parsing, and named entity recognition. In the web based interface, a user simply selects the required processing tools and inputs the text they want to annotate. The results can then be viewed either directly on the website or exported in several formats (JSON, CONLL).
Availability: online service
CLARIN Centre: CLARIN-LV
Publication: Znotiņš and Cīrule (2018)
|
Functionality: NER |
Polish |
This NER uses conditional random fields and a rich set of token features. The tool got third place in the PolEval 2018 Task 2 on named entity recognition. It contains a pre-trained model trained on the National Corpus of Polish (NKJP) and KPWr corpus (Broda et al. 2012). The KPWr model distinguishes the following categories: person, location, facility, organization, product, event, adjective. The NKJP model distinguishes the following NER categories: person, organization, location, date, time Availability: download, online service, web API |
Functionality: NER |
Polish |
This statistical NER is based on linear-chain conditional random fields. Availability: download |
Functionality: NER |
Polish |
This NER uses deep learning methods . The tool got 2nd place in the PolEval 2018 Task 2 on NER. It contains a pre-trained model on the NKJP corpus . Availability: download |
Functionality: NER |
Portuguese |
This NER annotates plain text by identifying and classifying the expressions for named entities it contains. The named-based module is integrated into the full LX-Suite pipeline (tokenization, POS tagging, parsing). Availability: online service |
Functionality: NER |
Slovenian, Croatian, Serbian |
This named entity recognizer is a slight modification of the CRF-based reldi-tagger with Brown clusters information added. Input data need to be pre-processed by the reldi-tokeniser and the reli-tagger for morphosyntactic annotation. Availability: download, online service, web API |
Publications
[Broda et al. 2012] Bartosz Broda, Michał Marcinczuk, Marek Maziarz, Adam Radziszewski, and Adam Wardyński. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of LREC2012.
[Carreras, Màrquez and Padró 2003] Xavier Carreras, Luís Màrquez, and Luís Padró. 2003. A simple named entity extractor using AdaBoost. In CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, 152–155.
[Chrupała and Klakow 2010] Grzegorz Chrupała and Dietrich Klakow. 2010. A Named Entity Labeller for German: exploiting Wikipedia and distributional clusters. In Proceedings of LREC2010.
[Cunningham et al. 2019] Hammish Cunninghamn, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, Wim Peters, and Leon Derczynski. 2019. Developing Language Processing Components with GATE Version 8 (a User Guide).
[Derczynski et al. 2015] Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve, Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management 51 (2): 32–49.
[Didakowski and Drotschmann 2009] Jörg Didakowski and Marko Drotschmann. 2009. In Finite-State Methods and Natural Language Processing, 50–61.
[Fišer, Ljubešić and Erjavec 2018] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation.
[Landsbergen 2012] Frank Landsbergen. 2012. Evaluation of named entity work in IMPACT: NE Recognition and matching. Technical report.
[Marcińczuk, Kocoń and Gawor 2018] Michał Marcińczuk, Jan Kocoń, and Michał Jacek Gawor. 2018. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches. In Proceedings of the PolEval 2018 Workshop, 71–86.
[Ruokolainen et al. 2019] Ruokolainen, Teemu, Pekka Kauppinen, Miikka Silfverberg, and Krister Lindén. 2019. A Finnish news corpus for named entity recognition. Language Resources and Evaluation.
[Sang and De Meulder 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceeding CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 – Volume 4, 142–147.
[Simon 2013] Simon, Eszter. 2013. Approaches to Hungarian Named Entity Recognition. PhD Thesis.
[Straková, Straka and Hajič 2013] Jana Straková, Milan Straka, and Jan Hajič. 2013. A New State-of-The-Art Czech Named Entity Recognizer. In TSD 2013: Text, Speech, and Dialogue, edited by I. Habernal and V. Matoušek, 68–75.
[Van den Bosch et al. 2007] Antal van den Bosch, Bertjan Busser, Sander Canisius and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational Linguistics in the Netherlands 2006: selected papers from the Seventeenth CLIN meeting, edited by Peter Dirix, 191–206.
[Znotiņš and Cīrule 2018] Artūrs Znotiņš and Elita Cīrule. 2018. NLP-PIPE: Latvian NLP Tool Pipeline. Frontiers in Artificial Intelligence and Applications 307: 183–189.