Tour de CLARIN: Bulgarian Child Language Corpus

Submitted by Jakob Lenardič on 16 November 2019

Blog post written by Petya Osenova and Kiril Simov, edited by Darja Fišer and Jakob Lenardič

The first systematic study of child speech in Bulgaria is attributed to the early 20th century philosopher Prof. Ivan Georgov, but interest in child language truly took off in the last decade of the 20th century and the beginning of the 21st, with Yuliana Stoyanova and Velka Popova who worked on longitudinal data. The contemporary systematic study of child speech, based on solid empirical material, is also associated with the creation of the first Bulgarian corpus in the CHILDES framework by the research team of the Laboratory of Applied Linguistics (LABLING) at Shumen University. The corpus is based on an array of longitudinal data from Popova’s personal archive.

The CHILDES framework is reputed for its openness and rationality, which are leading factors in the processes of cooperation and globalization in the Humanities. This is a guarantee of both the broad social validity of the research results based on corpora, and their integration into initiatives for exchanging linguistic data and technologies aimed at overcoming the current fragmentation of the research field. Moreover, CHILDES and the sister initiative TalkBank are already integrated into CLARIN as one of the Knowledge Centres. The Bulgarian child language corpus enables cross-lingual research and contributes to a modern, convenient standard for the study of linguistic ontogeny, which, thanks to its universal parameters, enables rapid, accurate and reliable comparison with a large number of languages and the development of solid typologies and modern theories.

The corpus comprises two types of speech resources: CORPUS A (spontaneous speech material of four children at their early age – from one to three years old) and CORPUS B (comprising stories based on a series of pictures with 90 children at pre-school age (from three to six years old). For the sake of integrity and processing, the speech resources are presented in two formats – in Cyrillic as well as Latin. Figure 1 illustrates the encoding of two children.

Figure 1: Excerpt from the Bulgarian CHILDES corpus

Future development of the corpus includes annotation with part-of-speech and morphological information, and integration with the WebCLaRK online service, a Bulgarian portal for language services on the web. Video data also exists for the same material, the processing of which is in progress and will be included in one ClassTalk session in the TalkBank database. The data comprises recorded classroom interactions in a number of kindergarten groups. Video transcription of the video data will follow the same basic principles as audio transcription. These Bulgarian corpora could be used not only for research of classroom interactions between the teacher and children, but also as sample material for training students of pedagogy.

Click here to read more about Tour de CLARIN