Tour de CLARIN: Interview with Thomas Gaillat

Submitted by Jakob Lenardič on 19 October 2021

This interview is with Thomas Gaillat, who is a lecturer in corpus linguistics and a teacher of English for specific purposes. He is part of the steering committee of the French CORLI K-centre.

Interview by Jakob Lenardič

Could you please introduce yourself – your background and current academic position?

Today I work as a lecturer in corpus linguistics and as a teacher of English for Specific Purposes (ESP) at Rennes 2 University in France. These two roles illustrate my deep interest in understanding how people acquire a foreign language. I have been teaching English for 20 years at all levels of the curriculum, from children to university students. I have always wondered what could make a teaching method better, and research naturally caught my interest. I love the blend between the practitioner's experience and the distant, objective analysis of the researcher.

In 2004, I joined the University of Rennes 1 language department as an ESP teacher where I also acted as director for two years. This position gave me the opportunity to enter the academic world and understand the linguistic needs and interests of learners of English specialised in other disciplines. I got involved in the design of online course material. I also took charge of a master’s class in the design of language learning online courses. Through these activities I felt the need to go further in the analysis of learner language. My initial experience in the computer industry, back in the nineties, pushed me towards computer methods applied to language learning. And so I embarked on the doctorate odyssey.

I received my doctorate in 2016 at the University of Sorbonne Paris Cité, awarded summa cum laude. The thesis focused on corpus interoperability as a method to explore how this, that and it, as referential forms, are used by learners of English. As my work intersected the domains of corpus linguistics and Natural Language Processing my work was supervised by Profs. Pascale Sébillot (INSA Rennes), specialized in , and Nicolas Ballier (Université de Paris), specialized in Linguistics.

In 2017, I was fortunate to be awarded a postdoctoral position at the Insight Data Centre NUI Galway, Ireland as part of a H2020 project (SSIX) focused on Sentiment Analysis. As a linguist I joined Brian Davis’s Knowledge Discovery Unit and contributed to the development of an AI-based pipeline dedicated to predicting sentiments in financial tweets. This work involved tight collaborations between statisticians, programmers and linguists. This first-hand experience showed the benefits of combining complementary skills for the design of online systems.

In 2019 I joined the Linguistics and Didactics research team of Rennes 2 University. I am now in a position to leverage my previous experience in order to contribute to the research on data analytics related to language learners.

You represent your lab LIDILE in the CORLI steering committee: why have you decided to join CORLI and how does the centre support your research and that of your colleagues?

CORLI is a French organization that positions corpora in the centre of its activities, very much like our research team. My PhD relied on making several corpora interoperable. Corpora are the fuel for machine learning approaches. Unsupervised and supervised learning methods do not just rely on efficient algorithms but also on linguistic metadata that enrich the source data. In the context of policies fostering AI projects, this message needs to be conveyed and, to me, CORLI fulfils this role in France. I am happy to contribute to this.

What expertise do you and your laboratory bring to the K-centre?

Our lab includes three main research domains, i.e. translation studies, linguistics and foreign language didactics. Most of our projects rely on the development of multilingual corpora. One of our focuses is learner corpora and their exploitation. We have expertise in the many ways corpora can be used for i) language training purposes, ii) as sources in the training of future language teachers and iii) as sources for the design of ICALL systems. These are questions tackled by the corpus community.

We are also in the process of designing a database infrastructure supporting a dynamic corpus querying. To this end, we are transferring some of our corpora to the Nakala repository, which is a CLARIN-FR Huma-Num node. This architecture will rely on persistent data making corpus items queriable and retrievable. Nakala’s APIs will give controlled access to corpus items. Through this experience, we have acquired knowledge regarding architectural design constraints. We have faced issues regarding the classification of corpus files related to specific individuals and their metadata such as L1, L2s or mode. The solutions we found might be of interest for other researchers in the corpus community.

You’ve been involved in the creation of linguistic resources and the development of machine learning methods related to multilingual learner corpora. Could you introduce some that you feel are most noteworthy? What are their main features?

That’s right. Over the course of the last three years I’ve taken part in three projects based on learner corpora.

First, I was the Principal Investigator of a University of Rennes project, which aimed at developing a tool to automatically extract and visualize linguistic profiles in texts written by learners of English. We showed how learner writings in English can be exploited with NLP tools to compute and visualize complexity metrics. An ORTOLANG corpus annotated in terms of the Common European Framework of Reference (CEFR) proficiency was used as a benchmark for comparisons with new learner writings. The system creates visualizations superimposing metric values and proficiency levels. This form of feedback aims at giving teachers and learners tools to give objective measurements of specific linguistic features.

Secondly, I played an active role in a PHC Ulysses Franco Irish programme between the universities of Paris-Diderot and Insight NUI Galway. The project, led by Nicolas Ballier (France) and Manel Zarrouk (Ireland), investigated the use of linguistic complexity metrics for the automatic detection of proficiency levels in English writing. We implemented a supervised learning approach as part of an Automatic Essay Scoring system. The objective was to uncover CEFR-criterial features in writings written by learners of English as a foreign language. The method relied on the concept of microsystems with features related to learner-specific linguistic systems in which several forms operate paradigmatically. Based on a dataset extracted from the EF-CAMDAT corpus, we trained a multinomial logistic regression model and an elastic net model. Evaluation was conducted on an internal test set as well as an external data set extracted from the ASAG corpus. The results showed that microsystems combined with other features help with proficiency prediction.

Finally, the Corpus InterLangue (CIL) is a resource that has long been developed in our team here in Rennes. It was initiated more than ten years ago and now includes more than a hundred audio recordings and writings from learners of French or English. As explained previously, we are going to make this corpus available online. What’s more is that it will be accompanied by a series of R scripts forming a pipeline allowing for data curation, data set creation, and data visualization. The idea is to have a modular approach for flexible data sets depending on the research questions.

How are such resources and methods related to CORLI i.e. have they been involved in their development? If so, how?

By nature, these resources are related to CORLI. The corpus which was developed for the first project is available on ORTOLANG and CORLI provides an inventory of all language-corpus resources available in French labs.

We also intend to make the CIL corpus referenced by CORLI via the resources page of the website. The R scripts will also be referenced as part as the tools section of the website.

Aside from your developmental work, you have also conducted your own qualitative research on the basis of L2-learner corpora (for instance, the paper “A multifactorial analysis of this, that and it proforms in anaphoric constructions in learner English”). Could you describe such work? What were the main findings/results?

The main focus of this work is to better understand Interlanguage. This concept, born in 1972 under the efforts of Larry Selinker, hypothesizes the existence of developmental patterns and learning stages in foreign language learning. By using corpus-based analyses the idea is to try to identify some of these stages and patterns. The publication you mention is one of the outcomes of my PhD in which I showed that learners have troubles in using this, that and it in English. Based on comparisons between two learner corpora (of different L1s) and one native English corpus, I developed different types of models showing the probabilities of use of a form compared with its two other competitors. The results showed evidence of the existence of a paradigmatic microsystem in which the forms compete functionally as proforms. This has implications for language teaching, as most of the time this and that are used in a binary dialectic, i.e. the spatial distinction (nearness vs. farness). And yet learners also hesitate with it in some contexts. This shows that the forms’ anaphoric value plays an important role and that learners tend to be confused when referring to discourse entities. Teaching materials thus need to adapt to this evidence.

Taking a broader view, the concept of microsystems was initially theorized in the late seventies by two linguists – Yves Gentilhomme and Bernard Py. They showed that learners use such forms in an unstable manner. Learners group forms unexpectedly and the delineations of the groupings evolve. By using corpora we are in a position to explore and capture many microsystems. For instance, as an English teacher I see students hesitate between may, might and can. By applying modelling methods, it is possible to compute the probabilities of use of each of these modals depending on contexts and proficiency levels. This can help to diagnose what learners need to adjust. And to do this, we need annotated corpora that include rich linguistic features.

The broader horizon would be the establishment of what Sylvain Auroux calls the third revolution of “grammatization”. Based on quantitative methods applied to linguistics, probabilistic models will be trained on rich linguistic data sets. The models will encapsulate the grammar of a language, and this will support AI applications.