Written by Twan Goosen (Software developer, CLARIN ERIC)
This post was published earlier on Europeana Pro
In 2017, CLARIN carried out a pilot exploring the possibilities of integrating Europeana Collections’ material into its infrastructure and thus opening up new possibilities for the discovery and linguistic processing of textual cultural heritage content for a social sciences and humanities research audience. This integration is now entering a new stage, offering improved quality and increased processing potential.
Picture: [Fàbrica Gròber] - Thomas Bigas, Josep, 1910/1920, Ajuntament de Girona, Spain, Public Domain
Books, manuscripts, historical newspapers and many other kinds of textual cultural heritage objects (CHOs) provide valuable input for a broad range of research topics. The mission of CLARIN is to makes digital language resources available to scholars, researchers, students and citizen-scientists from all disciplines. As partners in the Europeana Digital Service Infrastructure (DSI), Europeana and CLARIN have worked together to embed cultural heritage material into CLARIN’s infrastructure. Based on the experience gained during the pilot and building on improved dissemination services and metadata quality offered by Europeana, CLARIN recently carried out a new evaluation of the available datasets and made a new selection. The selection process focused on full text content such as digitised books, periodicals and newspapers with textual content obtained through optical character recognition (OCR). Other types of objects that were also considered are high resolution scans of manuscripts and speech audio. In order to qualify, resources had to be directly available in their raw form and have no legal restrictions for reuse. Currently, 22 collections containing about 135,000 cultural heritage objects have been identified as meeting these criteria.
Connected tools for seamless processing
After finalising the selection, CLARIN set up a mechanism for regular retrieval of metadata for the selected collections. Once retrieved, the metadata is ingested into CLARIN’s language resource catalogue, the Virtual Language Observatory ( ).
Straightaway, we can see that the newly introduced resources provide a substantial contribution to the number of relevant search results for certain queries. For example, searching for Slovenian text resources, almost all of the 73,000+ results originate from a Europeana data provider - in this case the Digital Library of Slovenia. Similarly, the availability of Hungarian and Polish text resources have been greatly enhanced.
As well as offering researchers a familiar way of discovering cultural heritage objects relevant to their research, the VLO also provides a direct path to analysis of discovered resources. For example, this 18th century pamphlet, offered as a PDF with embedded full text content by the Irish Manuscripts Commission and the Oireachtas Library, can now be found via the VLO.
By going to the Resources view and selecting the Process with the Language Resource Switchboard option, you see a list of invokable tools - nine at the time of writing. Among the options are grammatical analysis through the Weblicht Dependency Parsing chain and the Voyant suite for computer-assisted text analysis. Note that, although the LRS can be invoked for any resource, it does not have linked tools for all language or resource types, and that a file size limitation applies in the current version. An upcoming version will see this limitation lifted.
Picture: A resource and two examples of processing. Left: First page of the pamphlet 'To all the good people of Ireland, friendly and seasonable advice'. The Oireachtas Library & Research Service., Ireland, Public Domain; Top-right: partial view of WebLicht Easy Chain for Dependency Parsing output; Bottom-right: Relative frequency of the terms 'evil', 'friend', and 'good' as plotted by the Voyant tool for distant reading. Screenshots by the author.
Newly integrated content will further fulfil the potential
Now that production-quality integration of a sizeable selection of good quality and well-described resources has been achieved, we can see the contours of the potential of such integration on a larger scale. Current efforts to make full-text content available for large collections of digitised newspapers in the Europeana Newspapers project make it likely that this potential will be further fulfilled at a substantial scale in the near future. Furthermore, CLARIN will proceed to evaluate additional collections beyond the ‘low-hanging fruit’ and aim to keep expanding the volume of cultural heritage resources at researchers’ fingertips.
Search, find and process full-text cultural heritage resources with the VLO now!
If you are curious about the collections available in the Virtual Language Observatory and would like to find out what tools are available for processing them, simply go to vlo.clarin.eu, enter some search terms and start exploring.