Guest blog post: 'Empirical data in linguistics'

Submitted by karolina@clarin.eu on 17 July 2017

From 21 to 23 June 2017, the national project JANES in cooperation with the Slovene consortium CLARIN.SI organized the fifth ReLDI seminar “Empirical data in linguistics”. The seminar, which took place at the Faculty of Electrical Engineering in Ljubljana, was attended by 50 participants from 5 former Yugoslavian republics.

ReLDI (Regional Linguistic Data Initiative) is a two-year institutional partnership between research organizations that work with language data and are based in Switzerland, Serbia and Croatia. It is financed by the Swiss National Science Foundation in the framework of the SCOPES programme. The ReLDI webpage serves as a repository of resources and tools used for linguistic analysis; in the future, it will also host a series of online lectures on experimental and corpus-based research methods, programming and statistics in linguistic research.

The seminar was given in English, while the slides, which are available here, were in Serbian. On the first day, the lecturers Maja Miličević and Tanja Samardžić gave two introductory talks, the first on data prediction in linguistics and the second on corpus-based linguistic research. As in other research fields, linguistics also makes use of empirical data for predicting certain events, such as the likelihood of a linguistic element appearing in a certain position in corpus data. In the afternoon, the participants learnt how to query corpora with regular expressions and CQL in noSketchEngine. Before finishing the first day’s work, the participants formed groups and were given instructions for the practical assignment that would be done over the next two days. In the hands-on sessions, they used CQL to gather data from either the Slovene reference corpus Kres or the social media corpus Janes and use them to test their hypotheses.

On the second day, Dr Maja Miličević lectured on the role of experimentation in linguistics, where she talked about how researchers should approach large sets of language data, how to ensure a high degree of control in quantitative analyses and how to discern and analyse relationships in the data. In the afternoon hands-on session, an introduction to the R software environment for statistical computation and visualisation was given, which the participants used to process and visualise their collected data.

Descriptive statistics, inferential methods, and the use of statistical tests for language data were the focus of the third and final day of the seminar, which ended with presentations of the results of the groups’ practical work. Since the seminar aimed to bridge the gap between statistics and corpus-based research, it proved to be helpful both for language researchers without a background in statistics and those researchers who had done statistical work before but had no prior experience with corpus-based research.

Blog post written by: Ana Slavec and Jakob Lenardič