Tour de CLARIN: Interview with Michaela Mahlberg

Submitted by Jakob Lenardič on 29 October 2020

Michaela Mahlberg is Chair of Corpus Linguistics at the Department of English Language and Linguistics at the University of Birmingham. She is Principal Investigator of the project that has developed the CLiC search system, which is one of the flagship resources of CLARIN-UK.

1. Please introduce yourself. What is your research background, and what inspired you to approach literature from a corpus stylistic perspective, both in research and teaching?

I did a degree in English and Mathematics at Bonn University, and then went on to complete a PhD in English Linguistics at the University of Saarbrücken. I got into using corpus linguistics for the study of literature through my interest in Charles Dickens, my favourite author. From a corpus linguistics point of view, Dickens is quite fascinating, too. He is a master of using language to great effect, and his use of repetition in particular has often been commented on. Repetition, and hence frequency, is obviously right up our street as corpus linguists. Dickens also seems to have been especially aware of the typical language use and patterns that are common in the language in general. An observation in David Copperfield almost sounds like a corpus linguistic comment: “conventional phrases are a sort of fireworks, easily let off, and liable to take a great variety of shapes and colours not at all suggested by their original form”. In practical terms, an important catalyst for my focus on literature was a workshop that Martin Wynne, who is now the coordinator of CLARIN-UK, organized in Oxford – more than 14 years ago – to look at “Corpus Approaches to the Language of Literature”. This opportunity really made me think through the fundamental principles of what it means to study literature with corpus methods. These kinds of questions have kept me busy ever since!

2. What were the biggest bottlenecks in corpus-assisted approaches to study literary texts when you first started, and how has the field evolved since?

Examples from literary texts have always been used in corpus linguistics, but initially mainly as examples – to illustrate, for instance, the difference between types and tokens – rather than with a focus on the literariness of these texts. Equally, general reference corpora would typically contain samples from fiction – not necessarily full texts but only text samples. One reason for text samples can be copyright restrictions, but another was the belief that text samples are sufficient to study the phenomena that are of interest to corpus linguists. Overall, fiction was mainly treated as a register to compare other registers to. Notably, researchers from literary stylistics have contributed to demonstrating that literary qualities and features of individual texts are worth investigating with corpus methods. There is now also increasing interest from literary scholars. This exchange across disciplines is rather important, so that corpus linguists can demonstrate the benefits of methods but also learn about how literature is approached in other fields.

3. What are the main challenges today?

Today the key challenge is to bring developments in corpus linguistics and Digital Humanities better together. It is amazing how much is shared between the two fields. But seemingly, researchers in the two fields are not aware of these similarities. If you look at the research literature, there is very little in terms of cross-referencing and separate terminology is being developed that veils methodological similarities, especially at this point where technology is developing much faster than it has ever done.

4. Could you briefly introduce the Corpus Linguistics in Context (CLiC) search system, which is one of the flagship resources of CLARIN-UK? What distinguishes CLiC from other well-known corpus concordancers?

CLiC was designed with users in mind who want to focus on the literary properties of texts, so that the concordance function can also be seen as an aid for close reading and engagement with a text. We aimed to consider how literary scholars, English teachers or pupils in schools might find it useful to draw on standard corpus methods. So for CLiC, the text view, which shows how the search result appears in the running text, is equally important to the concordance view. Moreover, for the study of concordances there is a KWICgrouper function to help users sort concordance lines according to specific context words. It is further possible to add user-defined “tags” to a concordance analysis to support the classification of lines in an easy format. The most distinctive feature of CLiC, however, is that the corpora it accesses have been annotated for direct speech and narration so that concordance searches can be run for specific subsections of texts. The main distinction is between “quotes”, i.e. text within quotation marks, and “non-quotes”, i.e. text outside of quotation marks, which roughly equates to direct speech and narration – as the corpora mainly contain fiction from the 19th century, where direct speech still tends to be prevalent. It is also possible to focus searches on “suspensions”, i.e. stretches of narration that interrupt the speech of characters. The ability to search fiction in this way is crucial to study textual features that are characteristic of this specific register. In fiction, different discourse levels come together. If the voice of the narrator and the speech of fictional characters is just treated the same in the textual analysis, important information will be missed.

5. And how does CLiC serve the academic community from a research infrastructure perspective?

The CLiC web application is freely accessible – without a need to log in. Importantly, we made the corpora as well as the code for CLiC openly available through GitHub. We have also created extensive documentation to further support open research and reproducibility. A good example of the effectiveness of this approach is the way in which the Corpus of African American Writers 1892–1912 (AAW) came to be added to the CLiC corpora. Claiborne Rice and Nicholas J. Rosato from the University of Louisiana at Lafayette had come across our documentation for the quote and non-quote annotation and compiled this corpus, which we then jointly incorporated into CLiC.

6. In your recent work, you and your colleagues have used CLiC to study speech-bundles in 19th century English novels, primarily Charles Dickens’s novels. How did you use CLiC to extract and study the speech bundles? Could you briefly present the main aims of the research, and how does fictional speech in the CLiC corpora relate to real spoken language?

In our study, we generated 5-grams for the quotes subcorpus of the Dickens novels corpus. Lexical bundles are n-grams that are highly frequent, as determined by specified frequency thresholds. So far, lexical bundles in fiction have mainly been looked at to compare fiction as a register against other registers. In such comparisons, lexical bundles are considered across the whole texts. In our study, we specifically focus on frequent 5-grams in the quotes subcorpus. We compare Dickens to other 19th century fiction, as well as to the BNC1994, which is also available through the Oxford Text Archive CLARIN-UK repository. This allows us to identify fictional speech-bundles, i.e. bundles that have particular functions in creating fictional worlds. Such bundles include those that generally appear in fictional speech across a range of authors, as well as bundles that reflect idiosyncratic authorial features. Most interesting are the bundles that are shared between fictional and real speech. It has been a common view that fictional speech is really rather different from real spoken language. But speech-bundles show that both in fiction and real life people say things like it seems to me that, what do you think of or and all that sort of. Such phrases probably receive less attention in the study of literature precisely because they are generally common in everyday speech.

7. Why is it important to contrastively compare fictional corpora with non-fictional corpora in the context of a Digital Humanities approach to literary theory?

Patterns that are shared across the language of fiction and non-fiction are an important pointer to the link between fiction and the real world. This link works in two ways. For instance, when fictional people use phrases like real people do, these phrases trigger readerly responses that draw on the reader’s linguistic background knowledge of how people speak. So information is read into a character. On the other hand, fiction can portray real world phenomena in specific ways and it can equally affect the reader’s perception of the real world. Approaches in Digital Humanities that focus on the study of literary and cultural history provide such big picture views of fictional worlds. Such approaches are very similar to corpus studies of non-fiction texts that identify specific discourses.

8. How is CLiC supporting the new generations of researchers, and how is it advancing the state of the art in research methods? What in your opinion are the key next steps for the CLARIN research infrastructure in the coming years in order to best serve its user base?

In addition to supporting open research and replicability, as I described above, we have also been running a CLiC blog for a couple of years now. On this blog, researchers and educators present examples of how they use CLiC or discuss related topics, methods and resources. The aim of the blog is to facilitate the sharing of ideas as well as encourage new ways of thinking about digital resources. Most contributors are early career researchers, but we also get blog posts from teachers – who in a way thus start very early with supporting new generations of researchers.

CLARIN already provides amazing resources and opportunities. I find the Federated Content Search – the ability to search across a range of CLARIN resources at the same time – particularly fascinating! In the coming years, CLARIN can probably achieve even more in driving forward standards of interoperability. There is also still a lot of potential to set agendas with funders and support them in understanding the infrastructural needs of digital projects, as well as what’s needed to ensure the sustainability of infrastructures once a funded period ends.