Tour de CLARIN: The CLARIN Knowledge Centre for South Slavic Languages (CLASSLA)

Submitted by Jakob Lenardič on 18 November 2021

Written by Nikola Ljubešić, Taja Kuzman, Tomaž Erjavec, and Petya Osenova

The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) was recognized by CLARIN on 19 March 2019. The centre offers support for automated processing of South Slavic languages, and is operated by the Slovenian CLARIN.SI, and by the Bulgarian CLaDA-BG.

CLASSLA recognizes the need for the development of language resources and technologies not only for Slovene and Bulgarian, but also for the other under-resourced South Slavic languages. That is why the centre aims to support researchers from the fields of Computational and Corpus Linguistics, Digital Humanities, as well as interested individuals from other scientific and business areas that use and produce language data for Slovene, Croatian, Serbian, Bosnian, Montenegrin, Macedonian, and Bulgarian.

Space for a productive cooperation of small nations

The languages supported by CLASSLA are spoken by a small number of speakers. The estimated number of speakers worldwide ranges from less than half a million (Montenegrin, Hlavac 2013), between 1.4 million (Macedonian, Wikipedia) and 2.5 million (Slovene, Krek 2012, and Bosnian, Hlavac 2013), to over 5.5 million (Croatian, Tadić et al. 2012) and around 9 million speakers of Serbian (Hlavac 2013) and Bulgarian (Blagoeva et al. 2012). All seven CLASSLA languages together are used by around 30 million speakers. These seven languages form a dialect continuum with various degrees of mutual intelligibility between neighbouring languages.

figure-CLASSLA-1 — Figure 1: Map of the countries in which a South Slavic language is the most prominent. The size of the circles represents the number of speakers worldwide and the colour encodes the level of similarity between languages (created with Datawrapper).

The production of resources and tools for South Slavic languages costs just as much as for global languages such as English with more than a billion of speakers. However, despite the small number of speakers and consequently very small language technology communities, it is crucial for the maintenance of an equal status of South Slavic languages in future digital environments that they be supported with the same technologies as global languages. This is where CLASSLA plays an important role. The knowledge centre provides a space for the cooperation of researchers interested in any of the South Slavic languages, as well as for a rational and economical approach to solving common problems, especially in the light of mutual intelligibility of most of the languages.

To stimulate the development of language resources and technologies, CLASSLA provides information on freely available dictionaries, corpora, concordancers, (manually annotated) datasets, tools, and pipelines. The information is provided in the form of frequently asked questions (FAQ), and it is aimed towards both non-technical and more technically educated audiences. Currently, there are available FAQs for Slovene, Croatian, Serbian, Macedonian, and Bulgarian. The information is regularly updated to encompass all emerging resources and technologies.

figure-CLASSLA-2 — Figure 2: List of topics covered in the FAQs.

In addition to this, CLASSLA supports researchers in producing resources and technologies for South Slavic languages via its help desk which can be contacted at helpdesk.classla [at] clarin.si (helpdesk[dot]classla[at]clarin[dot]si). So far, it has provided individual help to more than 50 researchers.

To share knowledge and enlarge the South Slavic language technology community, CLASSLA organizes workshops and raises awareness about its activities home and abroad. In 2020, the first CLASSLA workshop was organized, which brought together 42 researchers.

figure-classla-3 — Figure 3: The first CLASSLA workshop, which took place in May 2020.

Developing and providing freely available technologies and resources for under-resourced languages

The findings of the META-NET White Paper Series “Europe's Languages in the Digital Age” revealed that although there are some language technologies and resources for Slovene, Croatian, Serbian, and Bulgarian, these languages are only fragmentarily or weakly supported by machine translation, speech recognition, and text analysis (Krek 2012). CLASSLA aims to help closing the technological gap between South Slavic languages and Western European languages by enabling researchers to acquire easy and long-term access to language resources via the CLARIN.SI repository, which, for example, comprises literary, news, web, spoken, parallel, parliamentary, and computer-mediated communication corpora. For some of the South Slavic languages, these corpora are the first of their kind ever made. For instance, the first ever linguistically annotated Macedonian corpus (CLASSLAWiki-mk) was created in 2020 as part of CLASSLA’s project of generating Wikipedia corpora in seven South Slavic languages. All the corpora can be downloaded and used locally, or queried via the two CLARIN.SI concordancers, NoSketch Engine and KonText.

Recently, the CLASSLA neural pipeline, an adaptation of the highly popular Stanza package, was built, and offers state-of-the-art language processing of Slovene, Croatian, Serbian, Macedonian, and Bulgarian. The pipeline encompasses both standard and non-standard language processing, processing from tokenization to syntactic parsing and named entity recognition for most of the supported languages, with semantic parsing being currently added to the pipeline. The CLASSLA pipeline is designed to suit the needs of researchers with various backgrounds, from the non-technical linguists who can simply run the pipeline as described in the documentation, to the more technically sophisticated engineers who can use the pipeline to train their own language models. This year also a state-of-the-art transformer model BERTić was trained that covers Bosnian, Croatian, Montenegrin, and Serbian. Transformer models are large language models consisting of millions, or even billions of parameters that produce a general numerical representation of a portion of text, which is then used for various tasks, from part-of-speech tagging, via text classification and machine translation, to text summarization and question answering.

figure-classla-5 — Figure 4: The user-friendly CLASSLA pipeline code (example for Macedonian) brings language processing closer to the non-technical researchers.

In the two years since the inception of CLASSLA, South Slavic languages have become supported with many new technologies and resources, and many more are planned for the near future. Currently, CLASSLA is a part of the MaCoCu project, which will produce large high-quality monolingual and bilingual web corpora for under-resourced languages, South Slavic languages included. CLASSLA is also aware of the current technological advances in speech technologies, and is working on ensuring a comparable technological coverage of South Slavic languages to their larger counterparts in that area as well. Finally, CLASSLA plans to add a newsflash and other dissemination channels to continue supporting and enlarging the South Slavic language technologies community.

References:

Blagoeva D., S. Koeva, and V. Murdarov. 2012. The Bulgarian Language in the Digital Age. META-NET White Paper Series, edited by G. Rehm and H. Uszkoreit. Berlin, Heidelberg: Springer.

Hlavac, J. 2013. Interpreting in one’s own and in closely related languages: Negotiation of linguistic varieties amongst interpreters of the Bosnian, Croatian and Serbian languages. Interpreting 15 (1): 94–125.

Krek, S. 2012. The Slovene Language in the Digital Age. META-NET White Paper Series, edited by G. Rehm and H. Uszkoreit. Berlin, Heidelberg: Springer.

Tadić, M., D. Brozović-Rončević, and A. Kapetanović. 2012. The Croatian Language in the Digital Age. META-NET White Paper Series, edited by G. Rehm and H. Uszkoreit. Berlin, Heidelberg: Springer.

Wikipedia contributors, Geographical distribution of Macedonian speakers, Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Geographical_distribution_of_Macedonian_speakers&oldid=1041157079 (accessed 17 September 2021).