Tools for named entity recognition

 

Introduction

Named entity recognition (NER) is an information extraction task which identifies mentions of various named entities in unstructured text and classifies them into predetermined categories, such as person names, organisations, locations, date/time, monetary values, and so forth. They can, for example, help with the classification of news content, content recommentations and search algorithms.

The CLARIN infrastructure offers 24 tools for NER. 15 tools are aimed at normalizing texts within a single language (4 Dutch, 2 English, 1 Finnish, 2 German, 1 Greek, 1 Hungarian, 3 Polish, 1 Portuguese), while the rest have a very broad multilingual scope. While 16 tools are in terms of their functionality dedicated exclusively to NER, 8 are part of tool pipelines that also provide functionalities such as PoS-tagging, lemmatisation and syntactic parsing.

For comments, changes of the existing content or inclusion of new tools, send us an email.

This website was last updated on 30 March 2020.

Tools for named entity recognition in the CLARIN infrastructure

Tool Language Description

CTexTools 2

Functionality: tokenization, sentence segmentation, PoS-tagging, phrase chunking, NER

Licence: CC 4.0

Afrikaans, English, South Ndebele, Xhosa, Zulu, Sesotho, Pedi, Setswana, Swazi, Venda, Tsonga

This is a corpus query and manipulation tool primarily for the official South African languages. The tool supports the creation of frequency and word lists, collocation searches and statistical analysis of corpus data.

Availability: download

CLARIN Centre: SADiLaR

NCHLT Tagger

Functionality: PoS-tagging, phrase chunking, NER

Platform: cross-platform

Licence: CC-A 2.5 South Africa Licence

Afrikaans, English, Ndebele, Xhosa, Zulu, Pedi Setswana, Sesotho, Swazi, Venda, Tssonga

This is a graphical user interface and command-line tool for automatic text processing.

Availability: download

CLARIN Centre: SADiLaR

FreeLing

Functionality: tokenisation, MSD-tagging, syntactic parsing, lemmatization, NER

Platform: cross-platform

Licence: Affero GPL

Catalan, English, Galician, Italian, Portuguese, Welsh

This is an open source language analysis tool suite that provides several processing components.

Availability: download

CLARIN Centre: LINDAT

NER categories: person, location, organisation, miscellaneous

Publication: Carreras, Màrquez, and Padró (2013)

Frog

Functionality: tokenisation, MSD-tagging, lemmatisation, morphologic segmentation, phrase chunking,  NER

Platform: Linux, Mac OS X

Licence: GNU General Public Licence

Dutch

Frog is a memory-based pipeline based on Timbl, the Tilburg memory-based learning software package. Frog produces FoLiA XML. 

Availability: download

CLARIN Centre: CLARIAH-NL

NER categories: person, organisation, location, product, event, miscellaneous

Publication: Van den Bosch et al. (2007)

INL labs

Functionality: tokenisation, sentece segmentation, PoS-tagging, lemmatisation, NER

Platform: cross-paltform

Dutch

This toolchain currently provides two annotation tools: the Stanford named entity recognizer, which was trained on the historical Dutch newspapers corpus Letters as loot in the context of the IMPACT project (Landsbergen 2012), and a tagger that consists of a tokenizer/sentence boundary detector, a statistical part-of-speech tagger and a lemmatizer. 

This toolchain outputs linguistically annotated from a number of input formats (TEI, plain text, Alto, .doc files).

Availability: online service

CLARIN Centre: CLARIAH-NL

NER categories: person, organization, location, miscellaneous

 

NameScape: Named Entity Recognition

Functionality: NER

Platform: cross-platform

Dutch

This NER was developed in the Namescape project.

Availability: online service

CLARIN Centre: CLARIAH-NL

The NERD named entity recognizer

Functionality: NER

Platform: cross-platform

Dutch

This NER is now integrated into the PICLL workflow.

Availability: online service

CLARIN Centre: CLARIAH-NL

NameTag

Functionality: NER

Platform: Linux, Windows, OS X

Licence: MPL 2.0

Czech, English

NameTag is an open-source tool that recognizes different NER categories per language model. For Czech, it recognizes a complex hierarchy of categories. The English model, which is trained on CoNLL-2003 NER annotations (Sang and De Meulder 2003), distinguishes the following four NER classes: person, organisation, location and miscellaneous.

The trained model for Czech is available for through LINDAT: Czech Models (CNEC) for NameTag.

A user manual is also available.

Availability: download, online service, web API

CLARIN Centre: LINDAT

NER categories: per model, see above

Publication: Straková, Straka and Hajič (2013)

Illinois Named Entity Recognizer

Functionality: NER

Platform: cross-platform

Licence: underlying software is open source

English

This NER annotates plain text.

Availability: WebLicht

CLARIN Centre: CLARIN-D

NER categories: person, location, organization, miscellaneous

OpenNLP Name Finder (English)

Functionality: NER

Platform: Linux, Windows

Licence: Apache Licence 2.0

English

This NER can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool’s requirements.

Availability: online service

CLARIN Centre: CLARIN:EL

NER categories: person, location, organization

GATE

Functionality: tokenization, PoS-tagging, NER, semantic and orthographic coreference, pronominal coreference

Platform: cross-platform

Licence: LGPL

English, French, German, Romanian, Russian, Welsh, Danish, Chinese, Arabic

This is a complete NLP platform with modules for named entity recognition.

Availability: download, online service

CLARIN Centre: CLARIN-UK

NER categories: person, location, organisation, date, percent, money

Publication: Cunningham et al. (2019)

OpenNLP Named Entity Recognizer

Functionality: NER

Platform: cross-platform

Licence: Apache License version 2.0 (underlying software)

English, Spanish

This NER is based on the OpenNLP NER tool.

Availability: WebLicht

CLARIN Centre: CLARIN-D

NER categories: person, location, organization

Finnish Tagtools 1.4

Functionality: PoS/MSD-tagging, NER

Platform: Linux, Unix

Licence: GPL 3

Finnish

This software package provides finnish-postag, a part-of-speech and morphology tagger for Finnish, and finnish-nertag, a named entity recogniser for Finnish.

Availability: download, online service

CLARIN Centre: FIN-CLARIN

NER categories: person (human, mythological, animal, other); location (political, geographical, street, infrastructure, mythological, astronomical, other); organisation (corporation, political, media, financial, educational, cultural, athletic, other, miscellaneous); product; event; time (dates, times), numerical expressions (measurements, money)

Publication: Ruokolainen et al. (2019)

German Named Entity Recognizer

Functionality: NER

Platform: cross-platform

Licence: Apache License, Version 2.0 (underlying software)

German

This NER is based on the maximum entropy approach using the OpenNLP maxent library. Two models are available: one trained on CoNLL2003 training set (conll), and the one trained on TuebaDZ corpus release 8 (tuebadz).

Availability: WebLicht

CLARIN Centre: CLARIN-D

NER categories: person, location, organization

Person Name Recognizer

Functionality: NER

Platform: cross-platform

Licence: terms of service

German

This NER is tailored to historical German (optimized for journals and high precision) and is based on weighted finite state transducers.

Availability: WebLicht

CLARIN Centre: CLARIN-D

NER categories: person

Publication: Didakowski and Drotschmann (2008)

SemiNER

Functionality: PoS-tagging, syntactic chunking, NER

Platform: cross-platform

Licence: see here

German, English

The SemiNER is part of a sequence labeller called Sequor, which is based on Collins’s (2002) perceptron. Sequor has a flexible feature template language and is meant mainly for NLP applications such as Named Entity recognition, Part of Speech tagging and syntactic chunking. It includes pre-trained models for German and English.

Availability: download

CLARIN Centre: CLARIN-D

Trained models: available

NER categories: person, organisation, location, miscellaneous

Publication: Chrupala and Klakow (2010)

Sticker Named Entity Recognizer

Functionality: NER

Platform: cross-platform

Licence: Blue Oak Model Licence 1.0.0 (underlying software)

German, Dutch

This NER is built on a neural-network-based sequence labeller that can label named entities for German and Dutch.

Availability: download, WebLicht

CLARIN Centre: CLARIN-D

NER categories: person, location, organization, geopolitical entity, other

GrNE-Tagger

Functionality: NER

Platform: cross-platform

Licence: terms of service (academic non-commercial use)

Greek (modern)

This NER operates on a rule-based engine designed. It was developed and is maintained by the Institute for Language and Speech Processing / Athena Research Center. This recognizer can be applied to existing corpora available through the CLARIN:EL infrastructure and to those independently uploaded corpora that are compatible with the tool’s requirements.

Availability: online service

CLARIN Centre: CLARIN:EL

NER categories: person, location, organization, facility, gpe (geo-political entity)

hunner - named entitiy recognizer for Hungarian

Functionality: NER

Hungarian

This NER employs a maximum entropy approach.

Availability: unavailable

CLARIN Centre: HUN-CLARIN

Publication: Simon (2013)

Liner2

Functionality: NER

Platform: cross-platform

Licence: GNU General Public License

Polish

This NER uses conditional random fields and a rich set of token features. The tool got third place in the PolEval 2018 Task 2 on named entity recognition. It contains a pre-trained model trained on the National Corpus of Polish (NKJP) and KPWr corpus (Broda et al. 2012).

The KPWr model distinguishes the following categories: person, location, facility, organization, product, event, adjective.

The NKJP model distinguishes the following NER categories: person, organization, location, date, time

Availability: download, online service, web API

CLARIN Centre: CLARIN-PL

NRE categories: per model, see above.

Publication: Marcińczuk, Kocoń, and Gawor (2018)

Nerf

Functionality: NER

Platform: Haskell Platform

Licence: GPL v.3

Polish

This statistical NER is based on linear-chain conditional random fields.

Availability: download

CLARIN Centre: CLARIN-PL

Trained models: download

PolDeepNer

Functionality: NER

Platform: cross-platform

Licence: GNU General Public Licence

Polish

This NER uses deep learning methods . The tool got 2nd place in the PolEval 2018 Task 2 on NER. It contains a pre-trained model on the NKJP corpus .

Availability: download

CLARIN Centre: CLARIN-PL

NER categories: nested annotations of the following types: personal names (forenames, surnames, additional names), organizational names, geographic names, place names (district, settlement, region, country, bloc), date, and time

Publication: Marcińczuk, Kocoń, and Gawor (2018)

LX-NER

Functionality: NER

Platform: cross-platform

Portuguese

This NER annotates plain text by identifying and classifying the expressions for named entities it contains.

The named-based module is integrated into the full LX-Suite pipeline (tokenization, POS tagging, parsing).

Availability: online service

CLARIN Centre: PORTULAN

NER categories: name-based: person, organization, location, events, works; number-based: numbers, measures, time

janes-ner

Functionality: NER

Platform: cross-platform

Licence: Apache License 2.0

Slovenian, Croatian, Serbian

This named entity recognizer is a slight modification of the CRF-based reldi-tagger with Brown clusters information added. Input data need to be pre-processed by the reldi-tokeniser and the reli-tagger for morphosyntactic annotation.

Availability: download, online service, web API

CLARIN Centre: CLARIN.SI

NER categories: person, person derivative, location, organization and miscellaneous

Publication: Fišer, Ljubešić and Erjavec (2018)

Publications

[Broda et al. 2012]  Bartosz Broda, Michał Marcinczuk, Marek Maziarz, Adam Radziszewski, and Adam Wardyński. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of LREC2012.

[Carreras, Màrquez and Padró 2003] Xavier Carreras, Luís Màrquez, and Luís Padró. 2003. A simple named entity extractor using AdaBoost. In CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, 152–155.

[Chrupała and Klakow 2010] Grzegorz Chrupała and Dietrich Klakow. 2010. A Named Entity Labeller for German: exploiting Wikipedia and distributional clusters. In Proceedings of LREC2010.

[Cunningham et al. 2019] Hammish Cunninghamn, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, Wim Peters, and Leon Derczynski. 2019. Developing Language Processing Components with GATE Version 8 (a User Guide).

[Derczynski et al. 2015] Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve, Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Information Processing & Management 51 (2): 32–49.

[Didakowski and Drotschmann 2009] Jörg Didakowski and Marko Drotschmann. 2009. In Finite-State Methods and Natural Language Processing, 50–61. 

[Fišer, Ljubešić and Erjavec 2018] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation.

[Landsbergen 2012] Frank Landsbergen. 2012. Evaluation of named entity work in IMPACT: NE Recognition and matching. Technical report.

[Marcińczuk, Kocoń and Gawor 2018] Michał Marcińczuk, Jan Kocoń, and Michał Jacek Gawor. 2018. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches. In Proceedings of the PolEval 2018 Workshop, 71–86.

[Ruokolainen et al. 2019] Ruokolainen, Teemu, Pekka Kauppinen, Miikka Silfverberg, and Krister Lindén. 2019. A Finnish news corpus for named entity recognition. Language Resources and Evaluation.

[Sang and De Meulder 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceeding CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 – Volume 4, 142–147.

[Simon 2013] Simon, Eszter. 2013. Approaches to Hungarian Named Entity Recognition. PhD Thesis.

[Straková, Straka and Hajič 2013] Jana Straková, Milan Straka, and Jan Hajič. 2013. A New State-of-The-Art Czech Named Entity Recognizer. In TSD 2013: Text, Speech, and Dialogue, edited by I. Habernal and V. Matoušek, 68–75. 

[Van den Bosch et al. 2007] Antal van den Bosch, Bertjan Busser, Sander Canisius and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational Linguistics in the Netherlands 2006: selected papers from the Seventeenth CLIN meeting, edited by Peter Dirix, 191–206.