Tour de CLARIN: CoLaJE Corpus

Submitted by Jakob Lenardič on 1 March 2021

Written by Nicolas Larousse and Christophe Parisse. 

One of the most prominent CLARIN-FR resources is the CoLaJE corpus, which was developed as part of the Project ANR CoLaJE. The corpus pieces together the emergence and development of communication and language in young children, using an interdisciplinary and multimodal approach. The project provides a crossroads for researchers and artists from all over the world (France, Belgium, Italy, England, United States, Canada, Brazil) and from a variety of fields (linguists, psychologists, speech therapists, film makers, musicians, composers) whose paths lead them to child language.

The corpus builds upon a shared database made up of the first-ever longitudinal recordings of children's spontaneous productions from age 1 to age 7 and consists of data from children learning French as well as French Sign Language. The corpus is available for download from the ORTOLANG repository and can be accessed online through a companion website. It contains more 250 hours of video recorded spontaneous child-parent interaction and the available transcriptions in the CHAT format (CHILDES) contain more that 1 million words. All the data is available for research and teaching purposes and it has already been used for several PhDs and master dissertations.

The corpus has been used to analyze many features of language development. The simultaneous study of phonology, prosody, gesture, dialogue has allowed the project team to enrich their perspective on the linguistic development of children. For instance, one working group that was part of the original CoLaJE project team successfully used the corpus to examine the interaction between prosody and gesture in deaf-signing children. Another group focused on the emergence of the discursive, pragmatic, and intersubjective competence in children’s communication, while a third group studied the development of grammatical markers, such as the acquisition of the French verbal system, through a comparative study of all children included in the corpus

The data in the corpus has also been used by external researchers in various methodologies, emphasizing the multimodality of language development. A large number of publications have been using the CoLaJE corpus and a special issue of the journal French Language Studies (volume 22, issue 1) presents some of the main development of the research produced on the basis of this corpus. Amongst others, Leroy-Collombel and Morgenstern (2012) have used the corpus to prepare a longitudinal study showing how a French-speaking child acquired possessive markers from when she was 1 and a half years old to 3 years old, while Sekali (2012) has used the corpus to study the acquisition of French adverbial clauses. A more detailed description of the corpus can be found in Morgenstern and Parisse (2012).