Text normalisation is the process of transforming parts of a text into a single canonical form. It represents one of the key stages of linguistic processing for texts in which spelling variation abounds or deviates from the contemporary norm, such as in texts published in historical documents or on social media. After text normalisation, standard tools for all further stages of text processing can be used. Another important advantage of text normalisation is improved search which can be performed with querying a single, standard variant but takes into account all its spelling variants, be it historical, dialectal, colloquial or slang.
The CLARIN infrastructure offers 15 tools for text normalisation. Most of the tools are aimed at normalising texts within a single language (1 Czech, 3 Dutch, 1 English, 3 German, 1 Hungarian, 1 Icelandic, 1 Slovenian, 1 Turkish), while the rest have a very broad multilingual scope. Half of the tools are dedicated normalisers, while the others provide additional functionalities such as PoS-tagging, lemmatisation and named entity recognition.
For comments, changes of the existing content or inclusion of new tools, send us an resource-families [at] clarin.eu (email).
Tools for Normalisation in the CLARIN Infrastructure
Tool | Language | Description |
---|---|---|
Functionality: tokenisation, segmentation, lemmatisation, PoS-tagging, normalisation, syntax analysis, NER, format transformations |
Afrikaans, Albanian, Armenian, Basque, Bosnian, Breton, Bulgarian, Catalan, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Middle Low German, Haitian, Hindi, Hungarian, Icelandic, Indonesian, Inuktitut, Irish, Italian, Javanese, Kannada, Kurdish, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malay, Malayalam, Maltese, Norwegian, Occitan, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Turkish, Ukranian, Uzbek, Vietnamese, Welsh, Yiddish |
Automatic construction and execution of several workflows, which include normalisation.
|
Functionality: normalization
Licence: CC-BY-NC-A (models)
|
Czech |
Korektor is a statistical spellchecker and (occasional) grammar checker released under 2-Clause BSD license and versioned using Semantic Versioning. Korektor started with Michal Richter's diploma thesis Advanced Czech Spellchecker, but it is being developed further. There are two versions: a command line utility (tested on Linux, Windows and OS X) and a REST service with publicly available API and HTML front end.
|
Functionality: normalisation |
Dutch |
This tool does word-by-word lookups in a bilingual lexicon, applies some transformation rules and additionally it may consult the INT Historical Lexicon (to be obtained separately due to licensing restrictions). The aim is to use the modernisation layer to do further linguistic enrichment using contemporary models. This tool is part of the FoLiA-Utils collection as it operates on documents in the FoLiA format. Standalone it is only of very limited interest to others.
|
Functionality: modernisation, normalisation, tokenisation, conversion, PoS-tagging, lemmatisation, NER with entity linking (all functionality is derived from the individual parts rather than an inherent part of the workflow) |
Dutch |
This is a linguistic enrichment pipeline for Historical Dutch as developed for and used in the Nederlab project. This workflow, powered by Nextflow, invokes various tools, including Frog and FoLiA-wordtranslate , as well as other tools such as ucto (a tokeniser), folialangid (language identification), tei2folia (conversion from a subset of to FoLiA, which serves as the exchange format for all our tooling, as well as the final corpus format for Nederlab). Due to the high complexity in tooling, this workflow and all dependencies are distributed as part of the LaMachine distribution. |
Functionality: corpus processing, normalisation |
Dutch |
This tool is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalised word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly
|
Functionality: corpus processing, normalisation |
Dutch, English, Finnish, French, German, German (Fraktur), Classical Greek, Modern Greek, Icelandic, Italian, Latin, Polish, Portuguese, Russian, Spanish, Swedish |
This tool uses a combination of an Tesseract webservice for text layout analysis and OCR and a multilingual version of TICCL for normalisation.
|
PICCL: Philosophical Integrator of Computational and Corpus Libraries Functionality: OCR, normalisation, tokenisation, dependency parsing, shallow parsing, lemmatisation, morphological analysis, NER, PoS-tagging |
Dutch, Swedish, Russian, Spanish, Portuguese, English, German, French, Italian, Finnish, Modern Greek, Classical Greek, Icelandic, German (Fraktur), Latin, Romanian |
This is a set of workflows for corpus building through OCR, post-correction, modernisation of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL and FROG functionality in a single pipeline. |
Functionality: normalisation |
English |
This tool performs manual and automatic spelling normalisation based on letter replacement rules, phonetic matching (extended Soundex), edit distance, and variant mappings. |
Functionality: normalisation, PoS-tagging, lemmatisation |
German |
This tool is a WebLicht stub for the DTA::CAB service and provides orthographic normalisation, PoS-tagging and lemmatisation for historical German.
|
CAB orthographic canonicalizer Functionality: normalisation |
German |
This tool a WebLicht stub for the DTA::CAB service and provides orthographic normalisation for historical German.
|
Functionality: lemmatization, PoS-tagging, normalization |
German |
This is an abstract framework for robust linguistic annotation, with public web-service including normalization and lemmatization for historical German
|
Normo Functionality: normalization |
Hungarian |
This tool is an automatic pre-normalizer for Middle Hungarian Bible translations. It employs a memory-based and a rule-based module, which consists of character- and token level rewrite rules. The tool was used for building the Old Hungarian Corpus.
|
Functionality: OCR, normalization |
Icelandic |
This tool is a spell-checking application based on a noisy channel model, which can be used to achieve a true copy of the original spelling of historical OCR texts, and to produce a parallel text with modern spelling.
|
Functionality: normalization |
Slovenian |
This is a trainable tool for text normalisation, based on Moses. |
Turkish Natural Language Processing Pipeline Functionality: tokenisation, sentence splitting, normalisation,de-asciification, vowelisation, spelling correction, morphological analysis/disambiguation, named entity recognition, dependency parsing |
Turkish |
This is a pipeline of state-of-the-art Turkish NLP tools.
|
Publications
[Betti et al. 2017] Arianna Betti, Martin Reynaert, and Hein van den Berg. 2017. @PhilosTEI: Building Corpora for Philosophers. In CLARIN in the Low Countries, edited by Jan Odijk and Arjan van Hessen. London: Ubiquity Press.
[Brugman et al. 2016] Hennie Brugman, Martin Reynaert, Nicoline van der Sijs, René van Stipriaan, Erik Tjong Kim Sang, and Antal van den Bosch. 2016. Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora. In Proceedings of LREC 2016, 1277–1281.
[Eryiğit 2014] Gülşen Cebiroğlu Eryiğit. 2014. ITU Turkish NLP Web Service. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014).
[Jurish 2012] Bryan Jurish. 2012. Finite-State Canonicalization Techniques for Historical German. PhD dissertation. Universität Potsdam.
[Ljubešić et al. 2016] Nikola Ljubešić, Katja Zupan, Darja Fišer, and Tomaž Erjavec. 2016. Normalising Slovene data: historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 146–155.
[Reynaert et al. 2015] Martin Reynaert, Maarten van Gompel, Ko van der Sloot, and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries.
[Reynaert 2010] Martin Reynaert. 2010. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition 14 (2): 173–187.
[Richter et al. 2012] Michal Richter, Pavel Stranak, and Alexandr Rosen. 2012. Korektor – A System for Contextual Spell-checking and Diacritics Completion. In Proceedings of COLING 2012, 1019–1027.
[Vadász and Simon 2018] Noémi Vadász and Eszter Simon. 2018. NORMO: An Automatic Normalization Tool for Middle Hungarian. In Proceedings of the Second Workshop on Corpus-Based Research in the Humanities, 227–236.