Skip to main content

Tour de CLARIN: CLARIN Knowledge Centre for Systems and Frameworks for Morphologically Rich Languages (SAFMORIL)

Submitted by Jakob Lenardič on

 


Written by Erik Axelson, Sjur Moshagen, Jurgita Vaičenonienė, Inguna Skadina, Therese Lindström Tiedemann, and Krister Lindén
 
SAFMORIL was officially recognized as a K-centre by CLARIN on June 19, 2019. The K-centre operates as a distributed virtual centre supported by the following CLARIN member institutions:
 

SAFMORIL brings together linguists as well as researchers and developers in the area of computational morphology and its application during language processing. The focus of SAFMORIL is on actual, working systems and frameworks based on linguistic principles and on providing linguistically motivated analyses and/or generation on the basis of linguistic categories. Such systems are relevant in particular for languages with rich morphologies, as is the case of Nordic and Baltic languages (such as Finnish, Swedish, Norwegian, Latvian, Lithuanian as well as the Sámi languages) and more generally Fenno-Ugric languages, Inuit languages, Canadian First Nation languages and Babylonian languages.

SAFMORIL offers online courses for developing and teaching morphologies, tokenizers and spell-checkers, a repository for storing morphologies, and an environment for creating tokenizers and spell-checkers. SAFMORIL serves linguists and computational linguists developing and adapting morphologies as well as digital humanities scholars, linguists, and computer scientists processing language data. Researchers are welcome to get in touch with SAFMORIL regarding any matters related to morphology (computational or otherwise) via the safmoril [at] kielipankki.fi (SAFMORIL Helpdesk (safmoril[at]kielipankki[dot]fi)).

The four member institutions of SAFMORIL – that is, FIN-CLARIN, CLARINO, CLARIN-LV, and CLARIN-LT – each offers its own unique technologies and services for working with morphology. The Finnish member FIN-CLARIN focuses on creating novel morphology systems and frameworks. The two main tools that FIN-CLARIN contributes as a member of SAFMORIL are Mylly, which is used for analyzing and visualizing data sets, and HFST – Helsinki Finite-State Technology, which is a compilation and runtime software, with some source morphologies.

An exercise from Finland Swedish Online which focuses on morphology.

FIN-CLARIN offers online tutorials for XFST-based Morphology Development (by Erik Axelson, Kimmo Koskenniemi and Mathias Creutz at the University of Helsinki) and Morphology Construction (developed by Jack Rueter at FIN-CLARIN) as well as documentation for experimental two-level rule compilation using Python HFST (by Kimmo Koskenniemi at FIN-CLARIN). Lastly, FIN-CLARIN offers Finland Swedish Online, which is a free online course in Swedish as spoken in Finland. It is designed based on the model of Icelandic Online and includes a variety of texts, videos, sounds clips and exercises to help you learn Swedish. Morphology is practised implicitly through reading and listening to Swedish where we take care to repeat forms and patterns that are being practised, and it can be practised in self-correcting exercises. Finland Swedish Online currently consists of two courses but a third course is soon to be launched and soon there will also be a special course designed for librarians.

CLARINO contributes to SAFMORIL mainly via the Arctic University of Norway (UiT), which is one of the institutions comprising the Norwegian CLARIN consortium. UiT offers the GiellaLT infrastructure, which hosts language resources and tools for more than 140 different languages in more than 180 repositories. The infrastructure includes a development environment for morphologies and morphology-
Find forms of  the verb 'būt' (to be) together with its morphological annotation in the LVK 2018 corpus
(http://nosketch.korpuss.lv/#dashboard?corpname=LVK2018).

based tools, morphology teaching service GiellaLT ICALL, and offers tutorials for making computer tools for your language. In addition, the aforementioned HFST (Helsinki Finite-State Technology) toolkit has been applied extensively in the GiellaLT infrastructure, and is also a core part of the proofing tools provided by it.

As part of SAFMORIL, CLARIN-LV aims to provide support not only for Latvian as a morphologically rich language, but also to other morphologically rich languages spoken and researched in Latvia (e.g., Latgalian). The CLARIN-LV repository includes not only LKV2018, a morphologically annotated 10-million-word corpus of modern Latvian; the Saeima corpus of parliamentary proceedings; and Senie, a 900-word-corpus of historical Latvian texts from the 16th to the 18th century; but also the Latgalian language corpus MuLa. To support users of digital resources for the Latvian language, CLARIN-LV organizes practical workshops and hands-on sessions on different topics (e.g., on regular expressions, morphological annotation and how to search in syntactically annotated corpora). All materials from seminars are available on the CLARIN-LV website. These materials are actively used in different courses at the University of Latvia and Liepāja University, as well as at Digital Humanities summer schools.
A screenshot of the North Sámi grammar checker, highlighting a congruence error and a correction suggestion.

In 2020, another morphologically rich language, Lithuanian, was included in SAFMORIL. Although the CLARIN-LT consortium already had a Helpdesk on corpus linguistics and natural language processing methods for Lithuanian, the team of Lithuanian researchers was very glad to expand their knowledge sharing with regards to the Lithuanian morphology, syntax, semantics and tools for linguistic analysis (e.g., those produced within the project SEMANTIKA-2) with an international audience.  As a member of SAFMORIL, the CLARIN-LT team looks forward to exchanging experiences, opening new opportunities for cooperation, and the further development of resources and tools relevant for the analysis of morphologically rich languages.