A Recap on the CLARIN and Libraries Workshop

Submitted by e.gorgaini@uu.nl on 8 September 2022

The CLARIN and Libraries workshop took place at KB National Library of the Netherlands on 9 and 10 May 2022. This was the first workshop with the explicit aim of bringing together the CLARIN community across Europe and research libraries to discuss issues relating to the delivery of digital content for researchers, and to plan practical steps for future collaboration.

There were 30 participants in the workshop, the majority with library-based roles, from 15 different European countries, which led to a stimulating and fruitful discussion. Participants were able to reflect on a number of major library initiatives (past, present and future) involving the delivery of textual content from large text collections. These projects, usually with target audiences of both readers and researchers, are often somewhat disconnected from each other, and also disconnected from research infrastructures.

Bridging Two Cultures

While new research infrastructures for arts, humanities and social sciences, such as CLARIN and DARIAH have emerged in recent decades, libraries have been for many centuries the most important resource for researchers, and remain so today in the digital age. For virtual, digital, distributed research infrastructures such as CLARIN to be effective, they need to work closely with libraries, which play key roles as creators and curators of digital data, and as intermediaries between researchers and digital data, tools and expertise.

While there are already existing collaborations that have broken down the separation between the new and old infrastructures, it was acknowledged that there are different communities of practice used to working with different datasets, different software environments and tools, and using different methods. The workshop explored a number of past, current and future initiatives to overcome these barriers.

Nederlab

Hennie Brugman presented the Nederlab research portal, which offers a platform where researchers can access library textual data and perform a number of operations on it, and which was developed in collaboration with CLARIAH-NL. The platform offers access to digital Dutch historical text collections that are aggregated, harmonised and collectively made searchable and analysable. The project to develop Nederlab ended in 2018, and although it remains live and is regularly used by scholars, there is no continued software development, and there are limited updates to collections. Nevertheless, the scope is impressive, with 24 collections, 19 billion words and almost 100 different annotation layers.

The lessons learned from the project to build and deliver the portal included the necessity of keeping a firm grip on the difference between running a project and a service, keeping know-how on board, the importance of delivering enrichments to collections back to the providers, and offering services that users want, including direct access to text files, access via APIs, and flexible ways of segmenting documents.

Over the period in which Nederlab was developed, and since, the architects detected a shift of emphasis for researchers from wanting interactive research environments to the need for online accessible data, and the need for an ‘IIIF for text’, meaning effective, flexible and robust ways to reference and use fragments of text, in the same way that the International Image Interoperability Framework (IIIF) makes this possible for images (including images of text on pages).

Text+

Peter Leinen from the German National Library presented Text+, a new German research data activity, which is being developed in a major project, also involving a range of partners from academia, including CLARIN and DARIAH, and infrastructure institutions including libraries. Text+ is a part of the National Research Data Infrastructure. The aim is to build a research data infrastructure focused on language and text data, for a wide range of disciplines in the humanities and social sciences. The data which Text+ aims to deliver includes not only collections of historical texts, but also contemporary language corpora, lexical resources, and digital editions. Text+ will be not just a network of repositories but will offer a comprehensive support infrastructure for all issues regarding collections, including interfaces, standards, authority data, long-term preservation, etc.

Access to Data for Researchers at a National Library

The KB, National Library of the Netherlands, has a relatively long history of offering online data services, which has included access to datasets of historical printed books, newspapers, mediaeval manuscripts, transcripts of radio news, and parliamentary papers. This can be dated back through more than 30 years of digitisation, and 10 years of providing interfaces to support distant reading, including projects such as Delpher, KB Lab Datasets and Linked Open Data. These initiatives have resulted in data-driven humanities research projects, and also in the development of new research tools and environments.

Looking to the future, the KB is looking for more and better ways to make data available via a variety of routes to researchers, with activities such as the FAIR@KB manifesto and CLARIAH FAIR dataset register, a new text and data mining room, and plans for developing a text suite for corpus selection operations and a tools-to-data solution for in-copyright collections.

Linking with Cultural Heritage

A current project in Belgium, entitled DATA-KBR-BE, is an interdisciplinary collaboration between cultural heritage experts, digital humanities researchers and data scientists. It is also highly relevant in this context, addressing many of the same issues which have already been highlighted in the other projects. The project is taking place in collaboration with the DARIAH and CLARIN consortia in Flanders and Belgium, and builds on much recent and ongoing work in DARIAH and in the digital humanities community more widely relating to the topic of ‘collections as data’. The vision behind DATA-KBR-BE of the optimal set of conditions for the proper exploitation of collections by researchers, was an important point of reference for the discussion in the KB workshop. DATA-KBR-BE will offer data-level access to digitised collections for digital humanities research.

Unlocking Digital Texts

Neil Jefferies (Bodleian Libraries, University of Oxford) presented Unlocking Digital Texts, a new collaboration between the Universities of Cambridge (UK), Oxford (UK), and Notre Dame (USA), with contributions from other institutions, and part of the AHRC/NEH New Directions in Digital Scholarship in Cultural Institutions programme. The project aims to make it easier to use a variety of textual formats as data in research, and will develop outline standards, prototypes, and proofs-of-concept, emulating the approach used with IIIF. It will build on existing standards and technologies (such as Text Encoding Initiative XML, IIIF, and the Oxford Common File Layout), rather than creating new formats or specific code dependencies. The project has links to Text+ and Nederlab, and is looking for further collaboration and knowledge exchange opportunities.

The workshop also reflected on the digital libraries landscape and differing levels of ongoing collaboration with CLARIN in Bulgaria, Czechia, Finland, Lithuania, Norway, Poland and Sweden.

Next Steps

An initial list of possible areas for collaboration included sharing CLARIN technologies in areas such as:

interactive online corpus linguistics platforms, many now curated and developed by CLARIN centres, e.g. Korp, Corpuscle
linguistic annotation of texts to enable more effective search
higher-level processing of texts, e.g. stylometry, named entity recognition
platforms to connect tools and texts to each other and in processing pipelines, via services such as the Language Resources Switchboard and Weblicht.

Discussion at the workshop identified further areas where more collaboration could be useful, which included:

making use of the libraries’ role in providing front-line research support embedded in universities and research institutions
working to overcome barriers presented by copyright and other legal and ethical restrictions on the use of digital texts
other parts of the research life-cycle: technologies, formats and tools in the digitisation and representation of texts.

Discussion started in the workshop will undoubtedly be continued in the new projects such as Text+ and DATA-KBR-BE, via existing forums such as the Conference of European National Librarians, and in emerging initiatives such as the European data space for cultural heritage.

The organisers were very happy to have taken part in CLARIN’s first post-pandemic international in-person gathering, and to have had the opportunity once again to meet old friends and make new ones, after such a long suspension of normal social activity. We look forward to more!

More details of the event, including the slides from the presentations, are available on the event page.