Tools for Normalisation

Text normalisation is the process of transforming parts of a text into a single canonical form. It represents one of the key stages of linguistic processing for texts in which spelling variation abounds or deviates from the contemporary norm, such as in texts published in historical documents or on social media. After text normalisation, standard tools for all further stages of text processing can be used. Another important advantage of text normalisation is improved search which can be performed with querying a single, standard variant but takes into account all its spelling variants, be it historical, dialectal, colloquial or slang.

The CLARIN infrastructure offers 15 tools for text normalisation. Most of the tools are aimed at normalising texts within a single language (1 Czech, 3 Dutch, 1 English, 3 German, 1 Hungarian, 1 Icelandic, 1 Slovenian, 1 Turkish), while the rest have a very broad multilingual scope. Half of the tools are dedicated normalisers, while the others provide additional functionalities such as PoS-tagging, lemmatisation and named entity recognition.

For comments, changes of the existing content or inclusion of new tools, send us an resource-families [at] clarin.eu (email).

Tools for Normalisation in the CLARIN Infrastructure

Tool	Language	Description
Text Tonsorium Functionality: tokenisation, segmentation, lemmatisation, PoS-tagging, normalisation, syntax analysis, NER, format transformations Domain: independent Licence: GPL	Afrikaans, Albanian, Armenian, Basque, Bosnian, Breton, Bulgarian, Catalan, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Middle Low German, Haitian, Hindi, Hungarian, Icelandic, Indonesian, Inuktitut, Irish, Italian, Javanese, Kannada, Kurdish, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Turkish, Ukranian, Uzbek, Vietnamese, Welsh, Yiddish	Automatic construction and execution of several workflows, which include normalisation. Availability: download, web application CLARIN Centre: CLARIN-DK Platform: Ubuntu
Korektor Functionality: normalization Licence: CC-BY-NC-A (models)	Czech	Korektor is a statistical spellchecker and (occasional) grammar checker released under 2-Clause BSD license and versioned using Semantic Versioning. Korektor started with Michal Richter's diploma thesis Advanced Czech Spellchecker, but it is being developed further. There are two versions: a command line utility (tested on Linux, Windows and OS X) and a REST service with publicly available API and HTML front end. Availability: web service, API CLARIN Centre: LINDAT Platform: Linux, Windows, OS X Input format: plain text Output format: enriched text Publication: Richter et al. (2012)
FoLiA-wordtranslate Functionality: normalisation Domain: historical texts Licence: GNU Public License v3	Dutch	This tool does word-by-word lookups in a bilingual lexicon, applies some transformation rules and additionally it may consult the INT Historical Lexicon (to be obtained separately due to licensing restrictions). The aim is to use the modernisation layer to do further linguistic enrichment using contemporary models. This tool is part of the FoLiA-Utils collection as it operates on documents in the FoLiA format. Standalone it is only of very limited interest to others. Availability: download CLARIN Centre: CLARIAH-NL Platform: Linux/POSIX (C++) Input format: plain text, FoLiA-XML Output format: FoLiA-XML, plaintext
Nederlab Pipeline Functionality: modernisation, normalisation, tokenisation, conversion, PoS-tagging, lemmatisation, NER with entity linking (all functionality is derived from the individual parts rather than an inherent part of the workflow) Domain: independent Licence: GNU Public License v3	Dutch	This is a linguistic enrichment pipeline for Historical Dutch as developed for and used in the Nederlab project. This workflow, powered by Nextflow, invokes various tools, including Frog and FoLiA-wordtranslate , as well as other tools such as ucto (a tokeniser), folialangid (language identification), tei2folia (conversion from a subset of to FoLiA, which serves as the exchange format for all our tooling, as well as the final corpus format for Nederlab). Due to the high complexity in tooling, this workflow and all dependencies are distributed as part of the LaMachine distribution. Availability: download CLARIN Centre: CLARIAH-NL Platform: Linux/POSIX (workflow itself runs on JVM, underlying components are mostly implemented in C++ and Python) Input format: FoLiA XML Output format: FoLiA XML Publication: Brugman et al. (2016)
TiCClops Functionality: corpus processing, normalisation Domain: independent	Dutch	This tool is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalised word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. CLARIN Centre: CLARIAH-NL Platform: cross-platform Input format: images (tiff, djvu), plain text, xml, csv Output formal: xml Publication: Reynaert (2010)
@Philostei Functionality: corpus processing, normalisation Domain: independent	Dutch, English, Finnish, French, German, German (Fraktur), Classical Greek, Modern Greek, Icelandic, Italian, Latin, Polish, Portuguese, Russian, Spanish, Swedish	This tool uses a combination of an Tesseract webservice for text layout analysis and OCR and a multilingual version of TICCL for normalisation. CLARIN Centre: CLARIAH-NL Platform: cross-platform Input format: images (tiff, djvu), plain text, XML, csv Output formal: XML Related publication: Betti, Reynaert and van den Berg (2017)
PICCL: Philosophical Integrator of Computational and Corpus Libraries Functionality: OCR, normalisation, tokenisation, dependency parsing, shallow parsing, lemmatisation, morphological analysis, NER, PoS-tagging Domain: independent Licence: GNU GPL	Dutch, Swedish, Russian, Spanish, Portuguese, English, German, French, Italian, Finnish, Modern Greek, Classical Greek, Icelandic, German (Fraktur), Latin, Romanian	This is a set of workflows for corpus building through OCR, post-correction, modernisation of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL and FROG functionality in a single pipeline. Availability: download CLARIN Centre: CLARIAH-NL Platform: cross-platform Input format: images (tiff, vnd.djvu), plain text, xml Output formal: FoLiA XML Publication: Reynaert et al. (2015)
VARD2 Functionality: normalisation Domain: historical texts Licence: CC-BY-NC-SA 2.0	English	This tool performs manual and automatic spelling normalisation based on letter replacement rules, phonetic matching (extended Soundex), edit distance, and variant mappings. Availability: download CLARIN Centre: CLARIAH-UK Platform: cross-platform (java) Input format: plain text, rtf, SGML, XML Output format: XML Publications: see here
CAB historical text analysis Functionality: normalisation, PoS-tagging, lemmatisation Domain: historical texts	German	This tool is a WebLicht stub for the DTA::CAB service and provides orthographic normalisation, PoS-tagging and lemmatisation for historical German. Availability: web application CLARIN Centre: CLARIN-D Platform: cross-platform Input format: plain text, XML Output format: stts tagset for PoS
CAB orthographic canonicalizer Functionality: normalisation Domain: historical texts	German	This tool a WebLicht stub for the DTA::CAB service and provides orthographic normalisation for historical German. Availability: web application CLARIN Centre: CLARIN-D Platform: cross-platform Input format: plain text, XML Output format: unspecified
DTA::CAB Functionality: lemmatization, PoS-tagging, normalization Domain: historical texts Licence: see here	German	This is an abstract framework for robust linguistic annotation, with public web-service including normalization and lemmatization for historical German Availability: download, web application CLARIN Centre: CLARIN-D Platform: cross-platform Input format: various Output format: various Publication: Jurish (2012)
Normo Functionality: normalization Domain: historical texts	Hungarian	This tool is an automatic pre-normalizer for Middle Hungarian Bible translations. It employs a memory-based and a rule-based module, which consists of character- and token level rewrite rules. The tool was used for building the Old Hungarian Corpus. CLARIN Centre: HUN-CLARIN Input format: unspecified Output format: unspecified Publication: Vadász and Simon (2018)
Skrambi Functionality: OCR, normalization Domain: historical texts	Icelandic	This tool is a spell-checking application based on a noisy channel model, which can be used to achieve a true copy of the original spelling of historical OCR texts, and to produce a parallel text with modern spelling. CLARIN Centre: CLARIN-IS Input format: unspecified Output format: unspecified
CSMTiser Functionality: normalization Domain: social media Licence: GNU Lesser General Public License v3.0	Slovenian	This is a trainable tool for text normalisation, based on Moses. Availability: download CLARIN Centre: CLARIN.SI Platform: Linux Input format: unspecified Output format: unspecified Publication: Ljubešić et al. (2016).
Turkish Natural Language Processing Pipeline Functionality: tokenisation, sentence splitting, normalisation,de-asciification, vowelisation, spelling correction, morphological analysis/disambiguation, named entity recognition, dependency parsing Domain: independent	Turkish	This is a pipeline of state-of-the-art Turkish NLP tools. Availability: web application, web API CLARIN Centre: LINDAT Platform: cross-platform Input format: plain text Output format: plain text Publication: Eryiğit (2014)

Tool

Language

Description

Text Tonsorium

Functionality: tokenisation, segmentation, lemmatisation, PoS-tagging, normalisation, syntax analysis, NER, format transformations
Domain: independent
Licence: GPL

Afrikaans, Albanian, Armenian, Basque, Bosnian, Breton, Bulgarian, Catalan, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish,
French, Galician, Georgian, German, Greek, Middle Low German, Haitian, Hindi, Hungarian, Icelandic, Indonesian, Inuktitut, Irish, Italian, Javanese, Kannada, Kurdish, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Turkish, Ukranian, Uzbek, Vietnamese, Welsh, Yiddish

Automatic construction and execution of several workflows, which include normalisation.

Availability: download, web application
CLARIN Centre: CLARIN-DK
Platform: Ubuntu

Korektor

Functionality: normalization

Licence: CC-BY-NC-A (models)

Czech

Korektor is a statistical spellchecker and (occasional) grammar checker released under 2-Clause BSD license and versioned using Semantic Versioning.

Korektor started with Michal Richter's diploma thesis Advanced Czech Spellchecker, but it is being developed further. There are two versions: a command line utility (tested on Linux, Windows and OS X) and a REST service with publicly available API and HTML front end.

Availability: web service, API
CLARIN Centre: LINDAT
Platform: Linux, Windows, OS X
Input format: plain text
Output format: enriched text
Publication: Richter et al. (2012)

FoLiA-wordtranslate

Functionality: normalisation
Domain: historical texts
Licence: GNU Public License v3

Dutch

This tool does word-by-word lookups in a bilingual lexicon, applies some transformation rules and additionally it may consult the INT Historical Lexicon (to be obtained separately due to licensing restrictions). The aim is to use the modernisation layer to do further linguistic enrichment using contemporary models. This tool is part of the FoLiA-Utils collection as it operates on documents in the FoLiA format. Standalone it is only of very limited interest to others.

Availability: download
CLARIN Centre: CLARIAH-NL
Platform: Linux/POSIX (C++)
Input format: plain text, FoLiA-XML
Output format: FoLiA-XML, plaintext

Nederlab Pipeline

Functionality: modernisation, normalisation, tokenisation, conversion, PoS-tagging, lemmatisation, NER with entity linking (all functionality is derived from the individual parts rather than an inherent part of the workflow)
Domain: independent
Licence: GNU Public License v3

Dutch

This is a linguistic enrichment pipeline for Historical Dutch as developed for and used in the Nederlab project. This workflow, powered by Nextflow, invokes various tools, including Frog and FoLiA-wordtranslate , as well as other tools such as ucto (a tokeniser), folialangid (language identification), tei2folia (conversion from a subset of to FoLiA, which serves as the exchange format for all our tooling, as well as the final corpus format for Nederlab). Due to the high complexity in tooling, this workflow and all dependencies are distributed as part of the LaMachine distribution.

Availability: download
CLARIN Centre: CLARIAH-NL
Platform: Linux/POSIX (workflow itself runs on JVM, underlying components are mostly implemented in C++ and Python)
Input format: FoLiA XML
Output format: FoLiA XML
Publication: Brugman et al. (2016)

TiCClops

Functionality: corpus processing, normalisation
Domain: independent

Dutch

This tool is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalised word forms are the sum of the frequencies of the actual word forms found in the corpus.

TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly
reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form.

CLARIN Centre: CLARIAH-NL
Platform: cross-platform
Input format: images (tiff, djvu), plain text, xml, csv
Output formal: xml
Publication: Reynaert (2010)

@Philostei

Functionality: corpus processing, normalisation
Domain: independent

Dutch, English, Finnish, French, German, German (Fraktur), Classical Greek, Modern Greek, Icelandic, Italian, Latin, Polish, Portuguese, Russian, Spanish, Swedish

This tool uses a combination of an Tesseract webservice for text layout analysis and OCR and a multilingual version of TICCL for normalisation.

CLARIN Centre: CLARIAH-NL
Platform: cross-platform
Input format: images (tiff, djvu), plain text, XML, csv
Output formal: XML
Related publication: Betti, Reynaert and van den Berg (2017)

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Functionality: OCR, normalisation, tokenisation, dependency parsing, shallow parsing, lemmatisation, morphological analysis, NER, PoS-tagging
Domain: independent
Licence: GNU GPL

Dutch, Swedish, Russian, Spanish, Portuguese, English, German, French, Italian, Finnish, Modern Greek, Classical Greek, Icelandic, German (Fraktur), Latin, Romanian

This is a set of workflows for corpus building through OCR, post-correction, modernisation of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL and FROG functionality in a single pipeline.

Availability: download
CLARIN Centre: CLARIAH-NL
Platform: cross-platform
Input format: images (tiff, vnd.djvu), plain text, xml
Output formal: FoLiA XML
Publication: Reynaert et al. (2015)

VARD2

Functionality: normalisation
Domain: historical texts
Licence: CC-BY-NC-SA 2.0

English

This tool performs manual and automatic spelling normalisation based on letter replacement rules, phonetic matching (extended Soundex), edit distance, and variant mappings.

Availability: download
CLARIN Centre: CLARIAH-UK
Platform: cross-platform (java)
Input format: plain text, rtf, SGML, XML
Output format: XML
Publications: see here

CAB historical text analysis

Functionality: normalisation, PoS-tagging, lemmatisation
Domain: historical texts

German

This tool is a WebLicht stub for the DTA::CAB service and provides orthographic normalisation, PoS-tagging and lemmatisation for historical German.

Availability: web application
CLARIN Centre: CLARIN-D
Platform: cross-platform
Input format: plain text, XML
Output format: stts tagset for PoS

CAB orthographic canonicalizer

Functionality: normalisation
Domain: historical texts

German

This tool a WebLicht stub for the DTA::CAB service and provides orthographic normalisation for historical German.

Availability: web application
CLARIN Centre: CLARIN-D
Platform: cross-platform
Input format: plain text, XML
Output format: unspecified

DTA::CAB

Functionality: lemmatization, PoS-tagging, normalization
Domain: historical texts
Licence: see here

German

This is an abstract framework for robust linguistic annotation, with public web-service including normalization and lemmatization for historical German

Availability: download, web application
CLARIN Centre: CLARIN-D
Platform: cross-platform
Input format: various
Output format: various
Publication: Jurish (2012)

Normo

Functionality: normalization
Domain: historical texts

Hungarian

This tool is an automatic pre-normalizer for Middle Hungarian Bible translations. It employs a memory-based and a rule-based module, which consists of character- and token level rewrite rules. The tool was used for building the Old Hungarian Corpus.

CLARIN Centre: HUN-CLARIN
Input format: unspecified
Output format: unspecified
Publication: Vadász and Simon (2018)

Skrambi

Functionality: OCR, normalization
Domain: historical texts

Icelandic

This tool is a spell-checking application based on a noisy channel model, which can be used to achieve a true copy of the original spelling of historical OCR texts, and to produce a parallel text with modern spelling.

CLARIN Centre: CLARIN-IS
Input format: unspecified
Output format: unspecified

CSMTiser

Functionality: normalization
Domain: social media
Licence: GNU Lesser General Public License v3.0

Slovenian

This is a trainable tool for text normalisation, based on Moses.

Availability: download
CLARIN Centre: CLARIN.SI
Platform: Linux
Input format: unspecified
Output format: unspecified
Publication: Ljubešić et al. (2016).

Turkish Natural Language Processing Pipeline

Functionality: tokenisation, sentence splitting, normalisation,de-asciification, vowelisation, spelling correction, morphological analysis/disambiguation, named entity recognition, dependency parsing
Domain: independent

Turkish

This is a pipeline of state-of-the-art Turkish NLP tools.

Availability: web application, web API
CLARIN Centre: LINDAT
Platform: cross-platform
Input format: plain text
Output format: plain text
Publication: Eryiğit (2014)

Publications

[Betti et al. 2017] Arianna Betti, Martin Reynaert, and Hein van den Berg. 2017. @PhilosTEI: Building Corpora for Philosophers. In CLARIN in the Low Countries, edited by Jan Odijk and Arjan van Hessen. London: Ubiquity Press.

[Brugman et al. 2016] Hennie Brugman, Martin Reynaert, Nicoline van der Sijs, René van Stipriaan, Erik Tjong Kim Sang, and Antal van den Bosch. 2016. Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora. In Proceedings of LREC 2016, 1277–1281.

[Eryiğit 2014] Gülşen Cebiroğlu Eryiğit. 2014. ITU Turkish NLP Web Service. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014).

[Jurish 2012] Bryan Jurish. 2012. Finite-State Canonicalization Techniques for Historical German. PhD dissertation. Universität Potsdam.

[Ljubešić et al. 2016] Nikola Ljubešić, Katja Zupan, Darja Fišer, and Tomaž Erjavec. 2016. Normalising Slovene data: historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 146–155.

[Reynaert et al. 2015] Martin Reynaert, Maarten van Gompel, Ko van der Sloot, and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries.

[Reynaert 2010] Martin Reynaert. 2010. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition 14 (2): 173–187.

[Richter et al. 2012] Michal Richter, Pavel Stranak, and Alexandr Rosen. 2012. Korektor – A System for Contextual Spell-checking and Diacritics Completion. In Proceedings of COLING 2012, 1019–1027.

[Vadász and Simon 2018] Noémi Vadász and Eszter Simon. 2018. NORMO: An Automatic Normalization Tool for Middle Hungarian. In Proceedings of the Second Workshop on Corpus-Based Research in the Humanities, 227–236.