Tour de CLARIN: The COCOON Factory

Submitted by Jakob Lenardič on 3 February 2021

Written by Nicolas Larouse and Christophe Parisse

The COCOON platform is aimed at individual researchers and research teams in the humanities and social sciences for the management of their digital oral resources. It combines the functionalities of a data repository, archive and discovery portal. 

COCOON contains a large collection of 13000 recordings and 4600 transcriptions which amount to 5600 hours of speech in 248 different languages from all over the world as well as historical French language data, which can all be browsed and downloaded. Prominent collections include corpora that were prepared on the basis of dialectological surveys carried out in France and elsewhere, such as the Linguistic and ethnographic atlas of Gascony, the Corpus of Parisian spoken French from 2000 onwards, and the Oral Corpus of Afro-Asian Languages. The COCOON corpora can be found through search engines such as CLARIN's Virtual Language Observatory. The transcriptions can also be searched in the CLARIN’s Federated Content Search

COCOON also offers a chain of services for navigating and archiving the oral resources. A prominent navigation tool is the geographic search function, which provides a scalable map that shows the distribution of the oral recordings both across the world (Figure 1) and, when zoomed in, within each country (Figure 2). Each recording, which can be directly listened to from the map itself, is also equipped with metadata showing the year the recording took place, information about the collection in which it is contained, as well as biographic information about the speaker. Due to such detailed metadata, COCOON is particularly valuable for typologists and sociolinguists and as such represents a key tool for the linguistics community in France.

Figure 1: The distribution of the COCOON oral resources across the world; for instance, there are 6810 oral recordings in Europe.

As an archiving tool, COCOON ensures that the researchers' data (mainly audio or video recordings) are automatically normalized in formats supported by the system archiving operator CINES (National Computer Center for Higher Education) and whose list is available online. A file in a broadcast format is also produced in a degraded quality to facilitate its use in web mode.  

Figure 2: The distribution of the resources in Paris, Orleans, and surrounding areas.

COCOON has been successfully re-used in several research projects. For instance, researchers in the Epic Nepal project have taken existing COCOON resources as a basis for building a corpus of huḍkelī, which are sung performances of oral epics in the shamanic tradition of Western Nepal. These recordings can now be accessed online with the Karaoke Tool in COCOON, a user-friendly online interface for listening to the recordings, which are presented along with sentence-aligned transcriptions of the original sung Nepalese and their translations, as well as annotated for the musical instruments. Similarly, the Pangloss collection, which is an online archive of the endangered and under-documented languages of the world, has also been built from COCOON resources and can also be listened to with the Karaoke Tool.