Skip to main content

Tour de CLARIN: CLARIN-LT presents a Lithuanian Dependency Treebank

Submitted by Jakob Lenardič on

Blog post written by Agnė Bielinskienė​


ALKSNIS is a syntactically annotated corpus of Lithuanian. The corpus serves as a gold standard for the syntactic analysis of Lithuanian. ALKSNIS currently consists of 2,355 syntactically annotated sentences in the PML (Prague Mark-up Language) format. The format allows researchers to visualise and edit the syntactic trees with the editor TrED.


Figure 1. Using TrED to show the syntactic structure of a sentence from the ALKSNIS corpus

Figure 1 above shows the syntactic tree structure of the Lithuanian sentence Kovo mėnesį Lietuvos gyventojai labiausiai pasitikėjo Priešgaisrine gelbėjimo tarnyba (“In March, Lithuanian residents mostly trusted the Fire and Rescue Service”), as presented by TrEd. Each terminal node corresponds to a word, a punctuation mark or other text element (symbol, digit, etc.) within a sentence, while the links show the syntactic dependencies. The prepared list of abbreviations for syntactic labels and the presentation of the syntactic relations and dependences were based on the experience of Czech researchers (Hajič et al. 1999). The editor TrEd presents the following information for each node:

  1. the form used in the sentence (e.g., gyventojai “residents“ in the given example);

  2. the corresponding lemma (e.g., gyventojas “resident”, which is the singular form of the plural gyventojai),

  3. the morphology tag (e.g., gyventojai “residents” has the tag Ncmpnn-, which stands for Noun, common, masculine, plural, nominative, non-reflexive, - indistinctive), and

  4. the syntactic function (e.g., gyventojai “residents” is the grammatical subject in the given example).

The corpus can also be searched via the ANNIS interface (Krause and Zeldes, 2016). The interface visualises the syntactic dependencies of a sentence and lists its morphosyntactic features, as shown in Figure 2: Patalpos jau išnuomotos. Taip pat jau rezervuota pusė ploto kitais metais iškilsiančiame statinyje. Dauguma didmenine (“The premises have already been leased. Also, half of the area of the building to be finished next year has already been reserved. Mostly wholesale”).


Figure 2. Using ANNIS to parse a sentence in ALKSNIS

So far, the syntactically annotated corpus has been successfully used by different user groups. For example, at Vytautas Magnus University, students are taught to work with ALKSNIS as part of the curriculum and use corpus data to do various assignments or to develop their theses (for instance, Kristina Brokaitė’s Master’s Thesis used the corpus to analyse grammatical forms of various complex and non-complex predicates in Lithuanian).

The corpus will be enriched with new texts and converted to the Universal Dependency (UD) format. The CoNLL-U format provided by the UD guidelines will serve as the core version of the ALKSNIS treebank. We also plan to annotate the corpus for multiword expressions (also see Lithuania’s Tour de CLARIN post on Colloc, which is a tool for annotating MWEs). This will help enhance the usability of the corpus in parsing and in data-driven applications of MWE processing models as well as provide linguists with the information about the syntactic behaviour of Lithuanian MWEs. Finally, a syntactic parser is going to be trained on the basis of the Alksnis corpus.

References:

Bielinskienė A., Boizou L., Kovalevskaitė J., and Rimkutė E. 2016. Lithuanian Dependency Treebank ALKSNIS. In Proceedings of the Seventh International Conference Baltic HLT 2016. Amsterdam: IOS Press, 107–114. http://ebooks.iospress.nl/volumearticle/45523

Hajič J., Panevová J., Buráňová E., Urešová Z., Bémová A. Annotations at Analytical Level. Instructions for Annotators (11.10.1999), UK MFF ÚFAL Praha, 1999.

Krause, Th. and Zeldes, A. 2016. ANNIS3: A New Architecture for Generic Corpus Query and Visualization. In Digital Scholarship in the Humanities 2016 (31). http://dsh.oxfordjournals.org/content/31/1/118
 


Click here to read more about Tour de CLARIN