Skip to main content

Using large-scale text collections for research

Submitted by martin.wynne@b… on

NeDiMAH, the Network of Digital Methods in the Arts and Humanities, recently organised a workshop in Würzburg on using large-scale text collections for research. Martin Wynne reports. 

I had the opportunity to give a short introduction on some aspects of my interest in this topic. I outlined how the current problems include the fragmentation of currently available resources in different digital silos, with a variety of barriers to their combination and use, plus a lack of easily available tools for textual analysis of standardized online resources, and I briefly referred to the plans of the CLARIN research infrastructure to address some of these problems.

Christian Thomas explained how the Deutsches Textarchiv (DTA) is facilitating and making possible research with large-scale historical German text collections. The DTA has funding 2007-15, and now includes resources with more than 200 million words from the period 1600 to 1900. There are images and text, and automatic linguistic analysis is possible. The DTA is a CLARIN-D service centre. Integration in the CLARIN infrastructure means that resources can be discovered via the Virtual Language Observatory ( ), can be searched via the Federated Content Search ( ), and analysed and processed via WebLicht worklows. The DTA also contributes to discipline-specific working groups to work with as an outreach and dissemination strategy. The majority of texts are keyed in (see more at http://www.deutschestextarchiv.de/dtaq/stat/digimethod) The workflow for OCR texts is interesting -  structural markup is added to the electronic text (using a subset of P5), and then OCR errors are corrected. They find that it is easier to identify and correct errors in structured text. The Cascaded Analysis Broker provides a normalization of historical forms to allow for orthography-independent and lemma-based corpus searches, and this is integrated into the DTAQ quality assurance platform. Christian's slides can be found here.

The DTA is also a key partner in the Digitales Wörterbuch der deutschen Sprache (DWDS), an excellent concept allowing cross-searching of resources in different centres, and very well implemented. This offers a view of the future of corpus linguistics and the study of historical texts online.

Jan Rybicki from the Jagiellonian University in Kraków told us about a benchmark English corpus to compare the success or failure of stylometric tools. There was a very interesting discussion of the idea of how to build representative and comparable literary corpora, which put me in mind of the work of Gideon Toury in descriptive translation studies. There was also discussion of a possible project to build comparable benchmark corpora for multiple European literary traditions.

Rene van Stipirian (Nederlab) outlined the backgroud of how the study of history in the Netherlands is characterised by a fragmented environment of improvised resources. The project Nederlab will be funded by the NWO 2013-17 to address the integration of historical textual resources for research. Some very interesting statistics were presented: for the period to the end of the twentieth century there are 500 million surviving pages printed in Dutch, and 70 million of these are digitized, but only 5-10 million have good quality text – most are rather poor quality OCR. Nederlab brings together linguists, literary scholars and historians, and integrated access to resources will go online in the summer of 2015.

Allen Riddell from Dartmouth Colege in the US took an interesting and highly principled approach to building a representative literary corpus. He randomly selected works from bibliographic indexes, then went out and found works and scanned them if necessary. This seems to me to be a positive step, in contrast to the usual rather more opportunistic approach of basing the corpus composition of the more easily available texts. The approach to correcting the OCR text was also innovative and interesting – he used Amazon Mechanical Turk. Allen also referred to a paper on this topic at http://journal.code4lib.org/articles/6004. This also raised an interesting question – can a randomly selected corpus be representative, or do we need more manual intervention in selection (at the risk of personal bias)?

Tom van Nuenen from Tilburg University described how he scraped professional travel blogs from a Dutch site and starting to analyse the language. Puck Wildschut from the Uni Nijmegen described the early stages of her PhD work on comparing Nabokov novels using a mixture of corpus and cognitive stylistic approaches.

The discussion at the end of the first day focussed on an interesting and important question: how do we make corpus-building more professional? Reusability was seen to be key, and dependent on making sure that data was released in an orderly way, with clear documentation, and under a licence allowing reuse. And since what we are increasingly dealing with is large collections of entire texts (rather than the sampled and truncated smaller corpora of the past), then we should ensure that the texts that make up corpora should be reusable, so that others can take them to make different ad hoc corpora. This requires metadata at the level of the individual texts, and would be enhanced by the standardization of textual formats.

Maciej Eder from the Institute of Polish Studies at the Pedagogical University of Kraków introduced and demonstrated Stylo, a tool for stylometric analysis of texts. In this presentation, and one on the following day, I found some of the assumptions underlying stylometric research difficult to reconcile with what I think of as interesting and valid research questions in the humanities. How many literary scholars are comfortable with notions that the frequencies of word tokens, and the co-occurrence of these tokens give an insight into style? And the conclusion of a stylometric study always seems to be about testing and refining the methods. Conclusions like “stylometric methods are too sensitive to be applied to any big dataset” don’t actually engage with anyone outside of stylometry. Until someone comes up with a conclusion more relevant to textual studies, this is likely to remain a marginal activity, but maybe I’ve missed the point.

The focus on looking for and trying to prove the differences between the writing of men and women also strikes me as a little odd, and certainly contentious. Why prioritize this particular aspect of variation in the writers? Why try to essentialize the differences between men and women, and why not other factors? I’d be more interested in an approach which identified stylistic differences and then tried to find what the relevant variables might be, rather than an initial starting point assuming that men and women write differently, and trying to “prove” that by looking for differences.

On the second day of the workshop, Florentina Armaselu from the Centre Virtuel de la Connaissance de l’Europe (CVCE) described how they are making TEI versions of official documents on EU integration for research use. I suggested that there might be interesting connections with the Talk of Europe project, which will be seeking to connect together datasets of this type for research use with language technologies and tools.

Karina van Dalen-Oskam from the Huygens Institut in the Netherlands, one of the workshop organisers, introduced the project entitled The Riddle of Literary Quality which is investigating whether literariness can be identified in distributions of surface linguistic features. The current phase is focussing on lexical anbd syntactic features which can be identified automatically, although a later phase might investigate harder-to-identify stylistic features, such as speech presentation. In the discussion Maciej Eder suggested that the traces of literariness might reside not in absolute or relative frequencies of features, but in variation from norms (either up or down).

Gordan Ravancic (Institute of History in Zagreb joined us via Skype to introduce his project on crime records in Dubrovnik, “Town in Croatian Middle Ages”, which was fascinating, although not clearly linked to the topic of the workshop, as far as I could tell.

Some interesting notions and terminological distinctions were raised in discussions. Maciej Eder suggested that “big data” in textual studies is data where the files can’t be downloaded, examined or verified in any systematic way. This seems like a useful definition, and it immediately raised questions in the following talk. Emma Clarke from Trinity College Dublin presented work on topic modelling. This approach to distant reading can only be used on a corpus that can be downloaded, normalized and categorized, and would be difficult to use on the type of big data as defined by Eder, although it could potentially used as a discovery tool to explore indeterminate datasets. Christof Schlöch from the Computerphilologie group in Würzburg differentiated “smart data” from “big data”, and suggested that this was what we mostly wanted to be working with. Smart data is cleaned up and normalized to a certain extent, and is of known provenance, quality and extent.

The workshop concluded with discussions about potential outcomes of this and a previous NeDiMAH workshop. A possible stylometry project to build benchmark text collections and to promote the use of stylometric tools for genre analysis and attribution was outlined, with perhaps the ultimate goal of an ambitious European atlas of the history of the style of fiction. We also discussed the possible publication of a companion to the creation and use of large-scale text collections.

Read more about the workshop on the NeDiMAH webpages at http://www.nedimah.eu/call-for-papers/using-large-scale-text-collections-research-workshop-university-wurzburg-1st-and-2nd.

This blog article first appeared on Martin Wynne's own blog.