Tour de CLARIN: Interview with Gerd Carling

Submitted by Jakob Lenardič on 21 October 2020

Gerd Carling is Associate Professor at the Department of Linguistics at Lund University who specializes in linguistic phylogenetics. She is the main editor of the Diachronic Atlas of Comparative Linguistics, which is hosted by the LUND CLARIN Knowledge Centre. The interview was conducted via email. [Photo: Idun Blomé.]

1. Please introduce yourself. What sparked your interest in historical linguistics? What motivated you to take a digital approach in this field?

My name is Gerd Carling and I work as Associate Professor at the Department of Linguistics at Lund University. My interests within linguistics have always been in the direction of typology and historical linguistics. Even though I began my PhD and postdoctoral studies within a more traditional, philological orientation, I was always interested in the digital aspects of humanities, including both corpus and phylogenetic linguistics. Languages, like biological populations, inherit traits from their ancestors, diverge into different lineages, go extinct, and engage in horizontal transfer. Therefore, linguistic phylogenetics and similar computational historical methods draw extensively on techniques pioneered in biology.

2. You are the main editor of the Diachronic Atlas of Comparative Linguistics (DiACL) database. How did this project come about?

The compilation of a typological and lexical database started as a project involving teachers and students in courses at the Department of Linguistics in Lund. We were all dealing with the typological aspects of language, both of grammar and of lexicon, and our field sites and expert areas were distributed around the world. Some researchers and students were more interested in grammar, others inclined more towards lexicon, some worked on Indo-European modern and ancient languages, and some worked with Austronesian languages, while others worked with Amazonian ones.

To bring all of these approaches together, we decided to compile and pool data in a more consistent way, so that we could more easily collaborate and compare our findings. I developed two projects that allowed us to do this in a more systematic fashion, where we were able to hire students to compile data and started planning the database infrastructure. At the beginning (2013–2015), the knowhow for building this type of database in Lund was relatively restricted, and we discussed several possible solutions for designing the resource, mainly with the Lund University GIS Centre and the IT Department at the Faculty of Humanities and Theology, which provided support for licences and the database infrastructure.

3. What is the role of the Lund University Humanities Lab CLARIN Knowledge Centre in DiACL?

Very soon, in the early stages of the project, we met a number of difficult obstacles, not the least digital ones. The first pilot infrastructure for DiACL that we built with a programmer from the IT department was not able to meet the demands of the linguistic database that we wanted to build. Therefore, I had to recruit a programmer from Leiden University with education in historical linguistics, who took over the programming and building of the infrastructure, with the support of programmers from the IT department. The hosting of the database was taken over by the Lund University CLARIN Knowledge Centre Lab together with the Swedish CLARIN consortium. In particular, the building of the lexical database turned out to be very complex and tricky. Finally yet importantly, the migration of data (2016–2017), which was compiled in CSV files, into the database infrastructure generated an enormous amount of errors and mistakes, which required many recodings and screenings of the database.

The Lund Humanities Lab and the CLARIN consortium played an important role in overcoming these problems, mainly in the later phase of the project (2016–2019), when we worked towards getting the data ready for publication. Researchers from the Humanities Lab, in particular Johan Frid, who is the coordinator of the Lund K-Centre, were very important in designing the code for extracting and analysing the data.

4. Could you briefly describe the make-up of this database (e.g., overall size, the languages and time periods covered by the database)?

The database currently has 569 languages, including 21,542 grammatical data points (i.e., generalizations of grammatical structure in languages) and 72,033 lexemes (words in languages), which are connected by 43,095 cognacy links (connections of words to a joint ancestor in a tree-like structure) and 71,730 links to concepts (connections between a word in a language to a prototype meaning, such as MOTHER). The data spans 25 language families and covers a period of 3,500 years, making it one of the largest diachronic databases, as well as one of the most well-annotated ones.

5. Why is DiACL important for the historical linguistics community? What are the main features of the database that are especially useful for conducting diachronic research in lexicography, phonology or morphosyntax?

The DiACL database is special since it is a joint grammatical and lexical database. In addition, the database has a strictly comparative and diachronic focus. Many other databases of the same type, such as the World Atlas of Language Structures (WALS) or South American Indigenous Language Structures (SAILS), are synchronic and focus on either typology, morphology, or lexicon. Additionally, the DiACL database is now an integrated part of two larger attempts to bring together all similar databases globally – CLICS for lexicon and Grammaticon for grammar.

6. Have you used the DiACL database in any of your own recent research or publications? Could you briefly discuss some of the results?

DiACL has been used in several recent publications, first and foremost the monograph Mouton Atlas of Languages and Cultures (Carling 2019). This monograph compiles and publishes the part of the data that covers Eurasia and discusses both the motivations for this work and findings from the database in a non-technical manner that is accessible to researchers, students, and interested non-academics. A second volume to come will deal with South America in the same fashion.

My colleagues and I have also published several papers on the phylogenetic trends of typology on the basis of our database. For instance, in one paper (Cathcart, Carling et al. 2018), we present a case study using the Indo-European data in DiACL to show that linguistic complexity is dependent on the notion of “areality“, which means that genealogically unrelated languages come to share linguistic features, often because they are spoken in the same geographic area. In another paper (Carling, Larsson et al. 2018), we examine the Eurasian subset of DiACL, which includes Indo-European, adjacent languages from different families, and earlier states of contemporary languages, dead branches, as well as later stages of migrated languages, from the earliest sources up to the modern period, to show how languages developmentally cluster by areality on the one hand and genealogy on the other.

In a coming study, we will demonstrate that universal hierarchies of grammar, connected to the frequency and economy of language (e.g., the singular is more common than plural), affect the general evolution of grammar. In another study, we will show that grammatical gender correlates with environmental aspects and spreads by migration, due to its stability within the language family (i.e., gender systems do not change or disappear). In lexicology we have demonstrated that borrowing depends on cultural factors and is highly affected by sociolinguistic factors, including language size and power (Carling, Cronhamn et al. 2019). Other studies to come will investigate various causes for gender assignment, based on large amounts of lexical data.

7. Do your graduate students used DiACL in their own work, and would you like to highlight any of their recent findings or publications?

The database and the research involved in the database are highly integrated with the work by graduate students. Often, students work in parallel on a project involving the big data of the database, and their own, more limited datasets, which focus on an individual language or a branch of a family. One MA student, Anne Goergens, investigated alignment in Arawakan languages. Another student, Sandra Cronhamn, looked at lexical borrowing in Tupí, using DiACL data. Filip Larsson wrote his MA thesis on areal structure in typology. All based their work on DiACL data.

8. Aside from DiACL, have you collaborated with the Lund University Humanities Lab in any other research project? And could you please briefly describe what you’ve done in this regard?

My work with the Lund University Humanities CLARIN Centre is mostly connected to the type of research that relates to work with the DiACL database. However, I’ve involved the computational knowhow of the Lund University Humanities CLARIN Centre in other projects where we have used data from other international databases, including a recent project with the NorthEuralex lab, where we find that the stability of sounds over time correlates with preference in first language acquisition.