CLARIN in the EOSC: Demonstration

Diagram showing the workflow in this demonstration

Introduction

This demonstration is intended to showcase the use of CLARIN data and tools in the context of the European Open Science Cloud. It was made for the EOSC launch event in November 2018. More information on CLARIN's role in the EOSC can be found here.

Update: Please note that as of 2024, the EOSC portal has been discontinued. However the records on CLARIN resources are still findable via the SSH Open Marketplace instead.

The Research Question

Do parliamentary speeches of female and male members of parliament differ?

If they do, what are typical topics for each group?

The Dataset

Dr. Maciej Ogrodniczuk selected the following dataset based on the Polish Parliamentary Corpus: utterances from male and female Members of Parliament (MP), extracted from the current cadency (8th) of Sejm, between 2015-11-12 (session 1, day 1) and 2016-10-21 (session 28, day 3).

For both groups, the following sample was extracted from his dataset:

Number of	Female MPs	Male MPs
Utterances	19,983	19,983
Speakers	126	291
Words ("tokens")	957,228	969,514

The individual utterances per MP are stored in separate text files, with a prefix indicating the sex of the speaker, e.g. f-AgataBorowiec.txt stands for utterances from the female MP named Agata Borowiec.

All these files are available as a zipfile that can be downloaded from B2SHARE.

Learn more

... about the Polish parliamentary corpus in the following publications:

[Ogrodniczuk 2012] Maciej Ogrodniczuk. 2012. The Polish Sejm Corpus.
[Ogrodniczuk 2018] Maciej Ogrodniczuk. 2018. The Polish Parliamentary Corpus.

... about parliamentary corpora and their applications on the CLARIN website.

... about other parliamentary corpora on the CLARIN website.

Searching for tools & processing the dataset

1. EOSC Portal Search

At the EOSC Portal market place, the researcher searches for "language analysis". This leads to 1 result: the Language Resource Switchboard.
In the description of the Language Resource Switchboard, several relevant analysis methods are listed (e.g. Topic Modelling and Stylometry). It also states it can be invoked from B2DROP.
Researcher goes to the B2DROP landing page in the EOSC-portal market place and goes to the actual application.

2. B2DROP file upload

The researcher uses single-sign-on (B2ACCESS) to login to his B2DROP workspace.
There he uploads the dataset and shares the file with a Share Link.

Now he clicks on the … icon next to the file and selects Switchboard.

3. Language Resource Switchboard

After being redirected to the Language Resource Switchboard, he indicates that the input file contains Polish data.

Then he clicks on Show Tools.

This combination results in 1 available tool, in the category Stylometry, called WebSty. He clicks on this tool to see more details.
Now he invokes WebSty via the Click to start tool button.

4. WebSty tool

Now that the WebSty application is displayed, the researcher selects the appropriate parameters to run a noun-based comparison between the female and male MPs:
- As Method of Analysis he chooses Content Similarity.
- In advanced options he selects Choice of features and then clicks on the tab BOW ("bag of words").

Then he clicks on the Analyze button. A highly efficient parallel computation process now starts. This computation entails:
- Performing a linguistic analysis of all sentences, as to find the part of speech of each word – this is required to determine the nouns.
- At the same time a morphosyntactic analysis is made to determine the lemma (the uninflected base form) of the nouns. Especially for languages with a rich case system – like Polish – this is a very important step.
- Based on the results of the linguistic analysis described above, the similarity between the nouns used in female and the male group is calculated.

Once the processing is over, the researcher scrolls down to the Results section.
He now clicks on the Importance of features to inspect which nouns are use differently in both groups.

As grouping method he now selects first level (the female vs. male group) and then he clicks the Analyze button.

Now the researcher has access to a table in the application with the nouns that are statistically more likely to be used by the female MPs, in descending order of statistic significance. In the Result section he can also download the outcomes as an Excel table (available via B2SHARE).

Word (lemma)	English translation
dziecko	child
niepełnosprawna	disabled woman
niepełnosprawność	disability
opieka	care
pracownik	employee
rodzina	family
edukacja	education
kobieta	woman
praca	work
aborcja	abortion
matka	mother
placówka	establishment
rodzic	parent
cel	goal
mała	small one (child)
zdrowie	health

Using these results, the researcher can conclude that indeed there is a significant difference in the topics that the female MPs are addressing. They are talking more than their male colleagues on topics like healthcare and family structures.

Learn more

... about WebSty:

in this Tour-de-CLARIN blog article
in these articles:
- Piasecki M., Walkowiak T., Eder M.: Open Stylometric System WebSty: Integrated Language Processing, Analysis and Visualisation. In: CMST, Vol. 24 (1) 2018, 43-58
- Eder, M., Piasecki, M., & Walkowiak, T.: An open stylometric system based on multilevel text analysis. Cognitive Studies | Études cognitives, 2017(17)
- Walkowiak, T.: Language Processing Modelling Notation – Orchestration of NLP Microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX, 2017, Springer International Publishing, pp. 464-473
in this video presentation

5. Publishing the results in B2SHARE

At the EOSC-portal market place, the researcher searches for "publish research data" Then he finds B2SHARE as potential publication platform.
From there he navigates to B2SHARE and uses single-sign on to authenticate. Since he authenticated to B2DROP before, it is not necessary anymore to enter a username and password.
He clicks on the Create a new record button, enters a title, selects the CLARIN community and finally clicks on Create Draft Record.
After entering the necessary metadata, he checks the Submit draft for publication checkbox and clicks on the Save and Publish button. This will make the dataset available via a persistent identifier.
His submissions will be findable in the Virtual Language Observatory and B2FIND within a day.

Alternative data publication option

Many CLARIN centres are providing depositing services.

Acknowledgements

Maciej Ogrodniczuk – providing the dataset and expertise, presenting the case at the EOSC launch event
Tomasz Walkowiak – providing support and suggestions for WebSty and related CLARIN-PL tools
Claus Zinn – designing, implementing and configuring the Language Resource Switchboard
Darja Fišer – providing input on the research question and feedback on the implementation of this demonstration