The CLARIN Bazaar 2021

Below you will find a list of the virtual stalls that can be visited at the CLARIN Bazaar. Without the need to travel, the CLARIN2021 Bazaar is an even more global event than previously, with many countries and continents represented. Please go and talk to stallholders, see their virtual material, and share your ideas!

Bazaar Presentations (sorted by topical category)

Training and Knowledge Exchange

Put Yourself on the Map: Join the DH Course Registry! (Slides)

Anna Woldrich, Iulianna van der Lek

Are you a student or a graduate seeking to improve your skills in Digital Humanities (DH) abroad?

Are you a lecturer teaching a course in DH or a related field?

Do you want to make your course more visible outside your university network?

Do you want to attract more (foreign) students?

Do you want to shape the landscape of DH teaching in Europe and beyond and put your own activities on the map?

Then come to our stand at the CLARIN Bazaar and get to know the DH Course Registry (https://dhcr.clarin-dariah.eu/), a joint effort of CLARIN and DARIAH-EU, designed to showcase DH classes and encourage enrollment across Europe and beyond.

Increasing the visibility of DH training activities – both on the local and the European level – is of great concern to the DH community, not only in order to attract more students, but also as a way of consolidating DH as an academic discipline.

As traditional academic structures are rather resistant to the inherent interdisciplinary of DH initiatives, we have to look to other dissemination channels and go beyond barriers to reach out to the public outside an individual university. The DH Course Registry was built for just this purpose.

In addition, we offer an to export the course data in the DHCR for further analysis, such as diachronic research or custom web and/or data visualization (The ACDH-CH organized a virtual Hackathon, funded by CLARIAH-AT with the data of the DH Course Registry. In this hack, the task was to develop a creative mode of visualizing data and metadata about teaching activities: https://digital-humanities.at/en/dha/s-project/acdh-ch-open-data-virtual-hackathon-round-two)

We are happy to tell you more about the DH Course Registry and answer your questions at the CLARIN Bazaar.

DH Course Registry promo video.

How CLARIN Went Virtual: 1.5 Years Online, in Data and Images (Poster)

Francesca Frontini

In March 2020, with the outbreak of the COVID pandemic, CLARIN's activities moved predominantly online. This shift concerned the day-to-day management and internal meetings, but also, and crucially, user involvement activities. During these months, two main conferences, seven workshops and twelve cafés were organised by the central office. Thanks to a new support instrument, CLARIN has also been able to assist researchers in various countries in the organisation of five events, including one large conference (AIUCD 2021, the Italian Digital Humanities conference). In addition to these, a number of project-related training events and webinars were held and various activities were carried out by consortia. This shift is not unique to CLARIN, of course, but it is important to reflect upon what these eighteen months of virtual events have taught us, and what lessons can be learned for the post-COVID future. By visiting our stall you will be able to learn more about the variety of virtual events and activities (conferences, cafés, webinars and tutorials, hackathons), and the possibilities that these new channels have opened up, both in terms of reaching out to new communities as well as addressing new topics. Attendants to our Bazaar poster are welcome to share their impressions and thoughts about the advantages and disadvantages of virtual user involvement events, as well as about the new formats and solutions that are emerging. Suggestions of topics, models and tools for future initiatives are also particularly welcome.

UPSKILLS: Towards New Teaching Resources for a New Professional Profile (Slides)

Maja Miličević Petrović

In September 2020, an Erasmus+ strategic partnership called 'UPgrading the SKIlls of Linguistics and Language Students' (UPSKILLS) was started. The partners in the project are the University of Malta (coordinator), University of Belgrade, University of Bologna, University of Graz, University of Rijeka and CLARIN ERIC, with the Universities of Zurich and Geneva as key associate partners. The core objective of UPSKILLS is to identify and tackle the gaps between the knowledge and skills linguistics and language students typically acquire at universities, and those commonly required by the labour market. The 'identifying' part has been completed, through a comprehensive needs analysis comprising curricula surveys, the study of a corpus of job adverts, and interviews with representatives of the language industry. The 'tackling' part will involve the development of a new curriculum component and supporting materials ready to be embedded in existing courses and degree programmes. In the UPSKILLS stall, I will present (1) the final outcome of the needs analysis, which is a newly defined overarching target profile for students of linguistics and languages – the 'language data and project specialist', and (2) the initial steps towards a new set of innovative teaching resources, which will be connected to the CLARIN infrastructure in multiple ways.

TRIPLE Training on Open Science (Slides)

Lottie Provost, Francesca Di Donato

TRIPLE is a H2020 project with the primary aim of developing a discovery platform for SSH. Within the project, the need for increased exchanges and a common understanding of recent European Open Science advancements called for a specific task dedicated to training on Open Science and the . However, while the training sessions were initially designed to fit the consortium members’ needs, we chose to open them up to the community and to focus on topics relating both to TRIPLE activities (i.e. the EOSC onboarding) and to services and solutions which are relevant to the whole community (i.e. Open Research Europe, and the EOSC state-of-the-art objectives). We will focus on activities performed within this task and in particular on the organisation of training sessions in the form of webinars, the production of guidelines and training material and the close collaboration with research institutions and training communities. We will introduce the implemented synergies with the main RIs in the SSH field (OPERAS, CLARIN, CESSDA, DARIAH) and with training coordinators communities (EOSC Skills and Training Working Group, OpenAIRE CoP of Training coordinators, SSHOC Training community, ICDI Competence Center). We will also present the strategies adopted to provide support to TRIPLE members on Open Science and the EOSC via adequate training, to engage new potential interested audiences in TRIPLE’s events and to produce FAIR training materials whose reusability by the general public will be ensured.

Digital Humanities

AIUCD2021 Conference Went Virtual Through the Support of CLARIN: Experience, Opportunities and Suggestions (Poster)

Enrica Salvatori (University of Pisa), Monica Monachini (CNR-ILC), Francesca Frontini (CNR-ILC), Angelo Mario Del Grosso (CNR-ILC), Federico Boschetti (CNR-ILC & VeDPH)

AIUCD is the Italian Association for Digital Humanities and Digital Culture, member of EADH and ADHO. Its mission is to foster methodological and theoretical research, scientific collaboration and the development of shared practices, resources and tools in the field of Digital Humanities. Additionally, AIUCD seeks to stimulate reflection on the theoretical foundations of computational methods within the digital culture.

The 10th AIUCD annual conference was focused on DHs for society: e-quality, participation, rights and values in the digital age, and was originally planned to be held in Pisa.

As CLARIN is of relevance to the AIUCD community, CLARIN-IT was part of the programme committee and part of the team for the overall conference support. A keynote by the CLARIN ERIC Director was also foreseen from the beginning.

When CLARIN ERIC announced its new programme of support for virtual events, crucial during the pandemic emergency, CLARIN-IT and the AIUCD2021 PC members, immediately applied for the CLARIN support. Thus, AIUCD2021 was to be the inaugural event of this new CLARIN instrument.

During the Bazaar we will illustrate our experience from many points of view: as organizers of this new conference format, as members of the association, as speakers, chairs and moderators. We will talk about the format of the sessions and the innovative solutions adopted (e.g. visuals recording and scientific journal pitches), and about advantages, impact and benefits coming from the CLARIN ERIC circle. We will discuss the statistical data of the conference: number of subscribers, average number of participants per session, parallel activities etc. We will also reflect about the opportunities offered by the virtualization of the conference (e.g. the increase of participants from abroad and of young generation researchers thanks to no fees payment). Finally, we will share our concluding remarks and our suggestions for the virtualization of similar events. https://aiucd2021.labcd.unipi.it/en/home-english/

AFOr: Collecting, Analysing, Sharing Memories through a Multidisciplinary Open Archive of Oral Sources (Slides)

AFOr Research Group

We present the status of three years of work on AFOr (https://afor.dev), an archive of oral sources meant to collect, preserve, analyse, and foster the re-use of the memories of a historical neighbourhood in Modena (Italy). With a strong multidisciplinary approach, the project has experimented with different technologies and methods to encourage the research on the territory, and to establish Villaggio Artigiano as historical heritage in terms of material and immaterial goods. Starting from Corpus Linguistics practices, and through the collection of audio/video interviews, written documents, working tools, and the creation of ad-hoc open source software tools, AFOr aims at exploring ways in which oral sources can be used in and outside of historical studies, with a particular focus on urban regeneration and bottom-up processes of territorial re-activation. This latter application has also brought about a software tool prototype presented at Festival Filosofia 2021 (Modena), Paesaggio di voci ('Landscape of voices') to map how spectators and inhabitants perceive the places and spaces experienced in their daily lives.

AFOr was co-developed by a team of sociologists, economists, historians, digital librarians, linguists, architects, IT engineers, and mathematicians, and is partnered with the University of Modena and Reggio Emilia, Istituto Storico di Modena, as well as AISO (Associazione Italiana Storia Orale). https://voci.afor.dev for "paesaggio di voci".

FONTI 4.0: Towards an Unsupervised Transcription Chain for Accessible Analog Oral Archives (Slides)

Roberta Bianca Luzietti, Alessandra Origani, Niccolò Pretto, Sergio Canazza

FONTI 4.0 aims at creating an innovative tool for the preservation, access and use of historical sound documents. The project focuses on oral sources recorded on magnetic tape, important for the history, culture, and language from the early decades of the 20th century until today. The number of analog archives that are being discovered is rapidly growing, and in order to save them from degradation and make use of their content, recordings must be analysed and digitally preserved. However, considering the vastness of the heritage, 'manual' analyses operations for most archives are unsustainable. The FONTI 4.0 project aims at exploiting cutting-edge technologies for the correction of digitisation errors, speech transcription, and information extraction in order to make historical oral sources available for the use and re-use across multiple research fields. During the first part of the project, we developed a workflow and digital filters for correcting speed and equalisation errors during the digitisation process. Subsequently, we conducted a transcription experiment to test the performance of different speech-to-text software applied to oral sources. The goal was to identify which factors had a major impact on automatic transcription accuracy. To do so, we created a corpus made of transcribed and annotated historical speech recording material. Given the necessity to improve the preservation process, we believe that unsupervised transcription chains can make a real difference for the future of oral archives. http://csc.dei.unipd.it/fonti40en/

Data Curation Using NLP

Lithuanian Arbitrary Collocations: Recognition Criteria and Methods (Poster)

Erika Rimkutė, Loïc Boizou, Ieva Bumbulienė, Jolanta Kovalevskaitė, Jurgita Vaičenonienė

We aim to present the methodology of Lithuanian arbitrary collocation recognition developed during an ongoing project 'Arbitrary Collocations of Lithuanian: Identification, Description and Usage (ARKA)', funded by a grant (No. S-LIP-20-18) from the Research Council of Lithuania.

Domain Specific Languages on Editing Papyri: The GreekSchools Case Study (Slides)

Simone Zenzaro, Federico Boschetti, Angelo Mario Del Grosso

Within the ERC AdG 885222-GreekSchools, we aim to manage the edit of multiple papyrological texts: diplomatic and literary editions and the corresponding apparatuses and their translations. To endow scholars with automatic consistency and coherence of editorial choices and to support the whole editing process, we leverage Domain Specific Languages (DSLs), a formal language definition in a bounded domain. Digital text editing can be handled in multiple ways depending on the editorial purpose. We identify four possible editing approaches to digital textual scholarship: (1) word processor, (2) structured text (e.g. XML), (3) GUI-centric, (4) Domain-Specific Language (DSL). Each of them has pros and cons. In particular, we analyse five dimensions: familiarity, compactness, completeness, data elaboration support, and the need for technical training. With familiarity we refer to how much the scholar can avoid shifting his established working paradigm/environment. Compactness is the ratio between quantity of information and formalisation size. Completeness refers to the information the content represents. The capability to extract or deduce information from the data is addressed by data elaboration support. Finally, we consider it important to evaluate the amount of technical training for text editing. For example, structured texts grant completeness of information, while requiring extensive technical training. In this context only the DSL approach encompasses all these dimensions, while the other approaches compromises on some of them. We propose a DSL-based editor that will support and improve the editing workflow in the context of the ERC project.

The WageIndicator Collective Agreements Database and its Machine Learning-Based Developments (Poster)

Daniela Ceccon

Since 2012, the WageIndicator Foundation has maintained a Collective Agreements Database, where the texts of 1600 collective agreements (CBAs) from sixty-one countries and in twenty-seven languages have been uploaded, coded and annotated. This database is a unique example at global level: collective agreements are documents containing conditions of employment that result from negotiations between independent unions and employers, and their content is often surrounded by an atmosphere of secrecy. Under the SSHOC project and with the support of the CLARIN Research Infrastructure, the agreements have been manually and automatically annotated on several levels: for each agreement, the team answers a series of questions and selects the appropriate piece of text (clause) for each. Within this project, machine learning techniques and models are being used to identify where in a CBA a specific topic is addressed. In this stall you will learn more about the database and see how machine learning can help its improvement through an experimental tool.

Metadata, Citation, RDM

Component Metadata Infrastructure (CMDI) Help Desk, News, Discussion and Rumours (Slides)

Task Force

The CMDI task force has been continuing the work on the design and implementation of a set of ‘core metadata components’, with the aim to simplify the creation of metadata for a wide variety of use cases that is both FAIR and optimised for the CLARIN infrastructure. Now in a more advanced development stage, we still welcome ideas and feedback from metadata creators and modellers, repository managers, developers of software processing metadata, and anyone else with experience with or interest in metadata in the context of CLARIN and the broader research infrastructure landscape. Of course, we are also there to answer your CMDI-related questions, or discuss any other matters related to metadata in CLARIN.

Much Ado About Data Citation (Slides)

Edward J. Gray, Nicolas Larrousse, Daan Broeder, Cesare Concordia, Athina Papadopoulou, Jan Brase

The SSHOC project is bringing together the key research data infrastructure players in SSH communities. The SSHOC task 'Making Data Findable by Being Citable' (3.4), part of the CLARIN-led work package 'Lifting Technologies and Services into the SSH Cloud' is working on the use of citation in the SSH. The workshop 'Data Citation in Practice' was organised in June 2021, which presented solutions for efficient data citation via different perspectives from the SSH. Following this event, a study was carried out on eighty-five data repositories from the SSH domain investigating their approach to and facilities for data-citation (see SSHOC Deliverable D 3.5). This work was done by making extensive use of the 'FAIR SSH Citation Prototype', developed in the SSHOC citation task, to harvest metadata in a normalised way from heterogeneous technologies provided by these different repositories. The presentation will provide insights on how these results can be adapted to be used by CLARIN as recommendations to improve data citation practice and thus foster the visibility of CLARIN resources. Another interesting complementary topic is the potential use of the citation-prototype, already integrated into the CLARIN Language Resource Switchboard, and its development for CLARIN in other complementary ways, for example with the emerging DOG (Digital Object Gateway) CLARIN project.

Ask your SIS: Collecting Centre Recommendations on Data Deposition Formats (Slides)

CLARIN Standards Committee

When users wish to deposit data with a CLARIN centre, it would be proper to let them know which centres are potentially interested in those data and – for the centre of their choice – which format is ideal (or merely acceptable, or not-very-acceptable) for encoding the data in question.

When centres that offer data deposition services undergo assessment in order to gain, or to maintain, the status of a B-centre, CoreTrustSeal requirement 8 asks, 'Does the repository publish a list of preferred formats?'.

As all modern research infrastructures do, CLARIN needs a set of indicators that make it possible to perform objective measurements of its performance. One of CLARIN's Key Performance Indicators has the following measurement defined: 'Percentage of centres offering repository services that have published an overview of formats that can be processed in their repository'.

The bottom line is: publishing format recommendations by centres with data deposition services is good for users (and consequently outreach), good for the centres in question, and good for CLARIN as a federated . As creatures that live with data, on data, and for data, we appreciate being able to collect it in one place – for ease of querying, comparisons and statistics across the board. For that purpose, the CLARIN Standards Committee has rebooted the Standards Information System, which is now equipped with a format-oriented module, about which we are going to tell you at our stall.

Catching up with the DELAD Initiative to Share Corpora of Speech of Individuals with Communication Disorders

Henk van den Heuvel, Esther Hoorn, Satu Salaasti

What is DELAD? DELAD stands for Database Enterprise for Language And speech Disorders, and is also Swedish for SHARED. DELAD is an initiative to share corpora of speech of individuals with communication disorders (CSD) among researchers. In our stall at the Bazaar we look forward to presenting and discussing with our visitors the results of our latest workshop in January, see: https://www.clarin.eu/blog/outcomes-fifth-delad-workshop. As a very recent result of this workshop, the DELAD steering group recorded a video which addresses the basic elements of a Data Protection Impact Assessment (DPIA) for sharing sensitive patient research data, see https://delad.ruhosting.nl/wordpress/dpia-role-play-with-video/. We would be pleased to show this to you, too. And of course any other topic of your choice when it comes to sharing CSD can be put on the table.

Advanced Technologies and Resource Harmonisation

The Switchboard: Demo and Discussion of Existing and New Features

Emanuel Dima

The Switchboard is a web application serving as a broker between datasets and data processing/analysis tools, publicly available at https://switchboard.clarin.eu/. In this stall I will demonstrate the latest features, which include content inspection and text extraction from certain file formats. The purpose of the demo is to gather feedback on the current features and discuss possible future ones.

CLARIN Resource and Tool Families

Jakob Lenardič, Darja Fišer

The CLARIN Resource and Tool Families initiative provides user-friendly, manually curated overviews of prominent language resources and technologies (LRTs) deposited across the distributed CLARIN infrastructure. On the one hand, the families offer researchers from digital humanities, social sciences and human language technologies aggregated, user-friendly overviews of LRTs of similar kinds, including a unified, human-readable presentation of their metadata. On the other hand, this initiative aims to raise awareness of the importance of good metadata documentation and to facilitate better curation of the LRTs by their authors.

In this presentation, we present this latter, curatorial aspect of the initiative, focusing particularly on work done in 2021. We present (i) the ongoing curation of the Resource and Tool families, (ii) proposals on streamlining new deposits, which include the drafting of a 'best-practice' guide for depositing that focuses on the qualitative description of the LRTs, (iii) efforts to encourage the depositing of new tools and resources with CLARIN, and (iv) a gap analysis for identifying new, valuable families to be included as new CRF overviews in the future.

UmbrellaBird, Building a Data Stewardship Organisation for an International Multilingual Large Language Model Training Corpus (Poster)

Yacne Jernite, Margaret Mitchell, Huu Nhuyen

BigScience is a year-long collaborative workshop bringing together nearly 600 participants from fifty countries to work on the questions raised by the growing importance of Large Language Models in language technology worldwide. In particular, much of our focus has been on the collection and management of a training dataset, with four of the twenty-five working groups that make up the project dedicated to that aspect. In the course of this work, we have proposed and are starting to implement a global collaborative data governance structure that involves data rights holders, various data providers and data hosts (data custodians), legal scholars and data activists, data modelers, and a Data Stewardship Organisation (DSO) to help disseminate norms and tools and provide a conversation space for all stakeholders. Given the similarities between the expressed values and mission statements of CLARIN and our proposed DSO, we welcome the opportunity to discuss our work so far with participants of the Bazaar and to scope out possible future collaborations on the scholarship and implementation of modern human data governance best practices.