Skip to main content

Tour de CLARIN: BTB-Pipe, a language pipeline for Bulgarian

Submitted by Jakob Lenardič on

Blog post written by Petya Osenova and Kiril Simov, edited by Darja Fišer and Jakob Lenardič


The BTB-Pipe language pipeline for Bulgarian has been developed incrementally over the last twenty years, starting with the Bulgarian-German BulTreeBank project for the creation of a Bulgarian treebank. The BTB-Pipe comprises the following modules:

  • Tokenizer and sentence splitter
  • Morphosyntactic tagger
  • Lemmatizer
  • Dependency parser

Bulgarian is an analytical language with rich word inflection, predominantly in the verbal area. The rich morphology inevitably leads to a lot of morphological ambiguity. Consequently, morphosyntactic tagging is more complex in Bulgarian than in languages like English. BTB-Pipe is a hybrid system combining a rule-based module and a statistical module (Simov and Osenova 2001) and uses the BulTreeBank Morphosyntactic Tagset (Simov, Osenova, and Slavcheva, 2004).

The lemmatizer in BTB-pipe comprises a set of transformation rules that have been developed based on the 1998 inflectional lexicon (Popov, Simov, and Vidinska 1998). Since the rules in the lexicon are implemented through the CLaRK system, they can also be used on unknown words in order to produce some guesses with regard to their word lemmas.

The following is an illustrative example of a lemmatization rule:

if pos-tag = Vpitf-o1s then

   { remove -ох; concatenate }

When the lemmatizer applies this rule to the verb form четох (roughly /četoh/), where the inflection –ox encodes the features 1st person singular and the past indefinite tense (“I read”), it  produces the lemma чета (/četa/).

The parser uses MaltParser and Mate Dependency Parser for training dependency trees. The input is the result from the tagger and the lemmatizer, and the output a dependency tree or trees for the sentences in the text, using either an internal set of dependency relations developed for the CoNLL 2006 Shared Task or the Universal Dependencies.

The current version of the BTB-pipe can be used in three different modes: as a standalone application, as a command line, and as a web service. The output of the pipe can be in the WebLicht standard developed within CLARIN-D (Hinrichs et al. 2010) or in the NAF format (Fokkens et al. 2014). Currently, ClaDA-BG is redesigning and reimplementing some of the modules using spaCy with the goal of improving the performance of the pipeline.


Linguistic Annotation in BTB-Pipe

References

Antske Fokkens, Aitor Soroa, Zuhaitz Beloki, German Rigau, Willem Robert van Hage and Piek Vossen. NAF: the NLP Annotation Format. Technical Report NWR-2014-3. Version 1.1. NewsReader project: Building structured event indexes of large volumes of financial and economic data for decision making - ICT 316404.

Erhard Hinrichs, Marie Hinrichs, and Thomas Zastrow. 2010. WebLicht: Web-Based LRT Services for German. In Proceedings of the ACL 2010 System Demonstrations, pages 25-29, Uppsala, Sweden

Dimitar Popov, Kiril Simov and SvetlomiraVidinska. 1998. A Dictionary of Writing, Pronunciation and Punctuation of Bulgarian Language. Atlantis LK, Sofia, Bulgaria..

Kiril Simov, Petya Osenova, and Milena Slavcheva. 2004. BTB:TR03: BulTreeBank morphosyntactic tagset BTB-TS version 2.0. Technical Report.

Kiril Simov, Petya Osenova. A Hybrid System for MorphoSyntactic Disambiguation in Bulgarian. In the Proceedings of the RANLP 2001 Conference, Tzigov Chark, Bulgaria, 5-7 September 2001, pages 288-290.


Click here to read more about Tour de CLARIN