Introduction
Text normalization is the process of transforming parts of a text into a single canonical form. It represents one of the key stages of linguistic processing for texts in which spelling variation abounds or deviates from the contemporary norm, such as in texts published in historical documents or on social media. After text normalization, standard tools for all further stages of text processing can be used. Another important advantage of text normalization is improved search which can be performed with querying a single, standard variant but takes into account all its spelling variants, be it historical, dialectal, colloquial or slang.
The CLARIN infrastructure offers 14 tools for text normalization. Most of the tools are aimed at normalizing texts within a single language (3 Dutch, 1 English, 3 German, 1 Hungarian, 1 Icelandic, 1 Slovenian, 1 Turkish), while the rest have a very broad multilingual scope. Half of the tools are dedicated normalizers, while the others provide additional functionalities such as PoS-tagging, lemmatization and named entity recognition.
For comments, changes of the existing content or inclusion of new tools, send us an email.
This website was last updated on 3 June 2021.
Tools for normalization in the CLARIN infrastructure
Tool | Language | Description |
---|---|---|
Functionality: tokenization, segmentation, lemmatization, PoS-tagging, normalization, syntax analysis, NER, format transformations |
Afrikaans, Albanian, Armenian, Basque, Bosnian, Breton, Bulgarian, Catalan, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish, |
Automatic construction and execution of several workflows, which include normalisation.
|
Functionality: normalization |
Dutch |
This tool does word-by-word lookups in a bilingual lexicon, applies some transformation rules and additionally it may consult the INT Historical Lexicon (to be obtained separately due to licensing restrictions). The aim is to use the modernisation layer to do further linguistic enrichment using contemporary models. This tool is part of the FoLiA-Utils collection as it operates on documents in the FoLiA format. Standalone it is only of very limited interest to others.
|
Functionality: modernisation, normalisation, tokenisation, conversion, PoS-tagging, lemmatisation, NER with entity linking (all functionality is derived from the individual parts rather than an inherent part of the workflow) |
Dutch |
This is a linguistic enrichment pipeline for Historical Dutch as developed for and used in the Nederlab project. This workflow, powered by Nextflow, invokes various tools, including Frog and FoLiA-wordtranslate , as well as other tools such as ucto (a tokeniser), folialangid (language identification), tei2folia (conversion from a subset of to FoLiA, which serves as the exchange format for all our tooling, as well as the final corpus format for Nederlab). Due to the high complexity in tooling, this workflow and all dependencies are distributed as part of the LaMachine distribution. |
Functionality: corpus processing, normalization |
Dutch |
This tool is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly
|
Functionality: corpus processing, normalization |
Dutch, English, Finnish, French, German, German (Fraktur), Classical Greek, Modern Greek, Icelandic, Italian, Latin, Polish, Portuguese, Russian, Spanish, Swedish |
This tool uses a combination of an Tesseract webservice for text layout analysis and OCR and a multilingual version of TICCL for normalization.
|
PICCL: Philosophical Integrator of Computational and Corpus Libraries Functionality: OCR, normalization, tokenisation, dependency parsing, shallow parsing, lemmatization, morphological analysis, NER, PoS-tagging |
Dutch, Swedish, Russian, Spanish, Portuguese, English, German, French, Italian, Finnish, Modern Greek, Classical Greek, Icelandic, German (Fraktur), Latin, Romanian |
This is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL and FROG functionality in a single pipeline. |
Functionality: normalization |
English |
This tool performs manual and automatic spelling normalisation based on letter replacement rules, phonetic matching (extended Soundex), edit distance, and variant mappings. |
Functionality: normalisation, PoS-tagging, lemmatisation |
German |
This tool is a WebLicht stub for the DTA::CAB service and provides orthographic normalisation, PoS-tagging and lemmatization for historical German.
|
CAB orthographic canonicalizer Functionality: normalisation |
German |
This tool a WebLicht stub for the DTA::CAB service and provides orthographic normalization for historical German.
|
Functionality: lemmatization, PoS-tagging, normalization |
German |
This is an abstract framework for robust linguistic annotation, with public web-service including normalization and lemmatization for historical German
|
Normo Functionality: normalization |
Hungarian |
This tool is an automatic pre-normalizer for Middle Hungarian Bible translations. It employs a memory-based and a rule-based module, which consists of character- and token level rewrite rules. The tool was used for building the Old Hungarian Corpus.
|
Functionality: OCR, normalization |
Icelandic |
This tool is a spell-checking application based on a noisy channel model, which can be used to achieve a true copy of the original spelling of historical OCR texts, and to produce a parallel text with modern spelling.
|
Functionality: normalization |
Slovenian |
This is a trainable tool for text normalisation, based on Moses. |
Turkish Natural Language Processing Pipeline Functionality: tokenisation, sentence splitting, normalisation,de-asciification, vowelisation, spelling correction, morphological analysis/disambiguation, named entity recognition, dependency parsing |
Turkish |
This is a pipeline of state-of-the-art Turkish NLP tools.
|
Publications
[Betti et al. 2017] Arianna Betti, Martin Reynaert, and Hein van den Berg. 2017. @PhilosTEI: Building Corpora for Philosophers. In CLARIN in the Low Countries, edited by Jan Odijk and Arjan van Hessen. London: Ubiquity Press.
[Brugman et al. 2016] Hennie Brugman, Martin Reynaert, Nicoline van der Sijs, René van Stipriaan, Erik Tjong Kim Sang, and Antal van den Bosch. 2016. Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora. In Proceedings of LREC 2016, 1277–1281.
[Eryiğit 2014] Gülşen Cebiroğlu Eryiğit. 2014. ITU Turkish NLP Web Service. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014).
[Jurish 2012] Bryan Jurish. 2012. Finite-State Canonicalization Techniques for Historical German. PhD dissertation. Universität Potsdam.
[Ljubešić et al. 2016] Nikola Ljubešić, Katja Zupan, Darja Fišer, and Tomaž Erjavec. 2016. Normalising Slovene data: historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 146–155.
[Reynaert et al. 2015] Martin Reynaert, Maarten van Gompel, Ko van der Sloot, and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries.
[Reynaert 2010] Martin Reynaert. 2010. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition 14 (2): 173–187.
[Vadász and Simon 2018] Noémi Vadász and Eszter Simon. 2018. NORMO: An Automatic Normalization Tool for Middle Hungarian. In Proceedings of the Second Workshop on Corpus-Based Research in the Humanities, 227–236.