Phonetics, corpus linguistics & dialectometry

Submitted by Linda Stokman on 16 April 2019

Blog post by Wilbert Heeringa (Fryske Akademy, Leeuwarden, The Netherlands / Center for Frisian Language & Culture, Groningen, The Netherlands) who received a CLARIN Mobility Grant in January 2019 to visit Språkbanken (University of Gothenburg, Sweden, SWE-CLARIN)

1. Introduction

At the CLARIN Anual Conference 2018 in Pisa, ItalyI presented Visible Vowels, a web app for the analysis, normalization and visualization of acoustic vowel measurements: f0, formants and duration. My presentation drew the attention of Dimitrios Kokkinakis (Researcher in Natural Language Processing at Språkbanken) at the and he asked me to present the program more extensively in Gothenburg. At the same time I was also interested to learn more about the tools that were developed at Språkbanken for processing and searching Swedish language corpora. The CLARIN Mobility Grant made it possible for me to visit Sprakbanken, compare tools and exchange knowledge and ideas. The visit took place from 28 January to 1 February 2019.

During the visit Dana Dannélls and Yvonne Adesam presented the Språkbanken tools: Korp, Strix, Sparv, Karp, and Lärka. I presented Visible Vowels and WoordWaark in a workshop that took place on 30 January. I presented Gabmap in a workshop that was held on 31 January. Both workshops were attended by Dimitris Kokkinakis, Lars Borin and Charalambos (Haris) Themistocleous and others that took an interest. I also met with the programmers of the tools: Martin Hammerstedt and Anne Schumacher.

I was able to get a closer look at all the different tools. These tools are described here below.

2. Språkbanken tools

Språkbanken (the Swedish Language Bank) was established in 1975 as a national center located in the Faculty of Arts, University of Gothenburg. The task of Språkbanken is to collect, develop, and store (Swedish) text corpora, and to make linguistic data extracted from the corpora available to researchers and to the public.

Språkbanken offers open and free access to sophisticated (linguistic) search in digital, richly annotated language resources (written Swedish representing all historical periods and all genres). The resources include text corpora (monolingual and parallel) and lexical resources (modern and historical, mono- and multilingual).

The corpora and linguistic data of Språkbanken are primarily presented in the form of concordances, accessed through a search interface. Språkbanken's lexical resources can be browsed online and most of them can be freely downloaded in standard formats. The resources and tools are made available to researchers through several web-based graphical user interfaces and through web service APIs (see also: https://spraakbanken.gu.se/eng/about-us/about-spr%C3%A5kbanken).

Språkbanken aims to facilitate language technology researchers, linguists and lexicographers working on Swedish, educators and students, and the public.

The colleagues Dana Dannélls and Yvonne Adesam presented the Språkbanken tools Korp, Strix, Sparv, Karp, and Lärka. All of these tools are open-source (see: https://github.com/spraakbanken/). Below I shortly discuss each of them.

Korp

Using Korp the user can search about 15 billion words of contemporary and historical Swedish. Korp is opensource.

Search types

Korp allows three types of searches: simple, extendedand advanced.

In simple search the user can search for a word form in the corpora by entering it in the search box. The search box has an auto-complete feature that suggests keywords together with their parts-of-speech in parentheses (note that this only works for POS-tagged corpora). The keywords displayed in grey are words that do not occur in any of the selected corpora. If a POS-tagged keyword in the list is selected, all word forms whose dictionary form and part-of speech match those of the selected keyword will be included in the search results.

Extended Search allows the user to search not only for individual word forms, but also for sequences of consecutive words. The values of the attributes of each keyword in the sequence can be defined individually. Part of speech tags can be included in a search as well.

Using extended search it is possible to perform comparative statistics, for example comparing political parties. The colleagues of Språkbankendemonstrated this by an example when they compared the frequencies of the words ‘equality’ and ‘freedom’ of two political parties.

In advanced search, the search criteria and the keywords are expressed as a CQP query. You can e.g. search for dependencies in dependency-parsed corpora in ways not supported by the extended search.

Search result views

The concordance view lists all sentences containing a match, with the matched sequence highlighted in bold text. You can choose the way results are displayed by clicking on one of the three tabs: “KWIC” (Key Word in Context, default), “Statistics” and “Word picture”.

When choosing KWIC each sentence is displayed on its own line and the matched words in the middle on top of each other. The view can be scrolled horizontally if some of the longer sentences do not fit in the browser window.

The statistics view shows the total number of occurrences for each matched word in the results as well as the number of occurrences in individual corpora. The number of occurrences are shown as relative frequencies per million tokens, a common measure in corpus linguistics, and (in parentheses) as absolute frequencies. The relative frequency shown in the Trend Diagram is always tied to a specific time period (e.g. year, month or day). It is calculated as the search results matching the time period divided by the number of tokens of all selected corpora times one million.

The word picture view shows the words most commonly associated with the keyword by dependency in all of the selected corpora. The “commonness” of a word does not derive directly from its frequency but from a statistical measure known as mutual information.

Korp is also used by Kielipankki (the language bank of Finnland). In this version of Korp audio recordings with their transcriptions are included as well.

See: https://spraakbanken.gu.se/korp/and https://www.kielipankki.fi/support/korp/

Screenshot of Korp. The user search for hus ‘house’.

Strix

Strix is Språkbanken's tool for document-centric corpus linguistics. While Korp focuses on small linguistic entities such as words and sentences, the domain of Strix are whole documents and their factual and other content (rather than their linguistic form).

See: https://spraakbanken.gu.se/strix/

Sparv

Sparv provides an interface to the text import and annotation pipeline used by Korp and Strix. A user can upload their own texts and have them linguistically annotated and save the results for further offline processing.

The lexical analysis in Sparv consists of several steps: tokenisation, lemmatisation, identification of lemgrams and word senses and compound analysis. A lemgram is a lexical identifier which refers to an inflection table in the SALDO lexicon. SALDO (Swedish Associative Thesaurus version 2) is an extensive electronic lexicon resource for modern Swedish written language.

For part-of-speech tagging, Sparv uses HunPos (Halácsy et al., 2007), a trigram tagger, with a model trained on the SUC 3.0 corpus. The syntactic analysis of Swedish in Sparv is performed using MaltParser (Nivre et al., 2007), a statistical dependency parser, with a model trained on the Swedish treebank Talbanken (Nivre et al., 2005).

See: https://spraakbanken.gu.se/sparv/and https://people.cs.umu.se/johanna/sltc2016/abstracts/SLTC_2016_paper_31.pdf.

Karp

Karp is Språkbanken Text’s environment for searching, browsing, editing and developing lexical resources and other formally structured linguistic datasets. The editing environment has been used to develop the Swedish FrameNet, as well as constructicons for Swedish and Russian.

Search types

Karp allows two types of searches: freetextand extended. The freetext search allows the user to search for a word or expression. In the extended search mode, the user can construct more complex queries. For example, the user may search for "wordform equals house". In the menu currently showing "equals", the user can instead choose "begins with", "ends with" etc. The buttons called And, Or and Except let the user add even more criteria.

Search result views

The program shows the hits, for each hit ‘sense’, ‘lemgram’, ‘part of speech’, ‘primary’, ‘secondary’, children (primary)’ and ‘children (secondary)’ are shown. Additionally, statistics on the basis of the baseform, wordform, part of speech etc. can be obtained.

See: https://spraakbanken.gu.se/karp/

Lärka

Lärka started out as a platform for automated corpus-based language and grammar exercises but is now being developed into a tool for second-language learning research, allowing logging of carefully designed language learning exercises and systematic investigation of the effect of particular parameters of interaction and exercise design on learner progress. Lärka currently includes: exercises for linguists (e.g. parts of speech), exercises for learners (e.g. word guessing), Texteval (a text difficulty evaluation for Swedish as a second (or foreign) language), Hit-Ex (a sentence selection tool) and Cefrit (an annotation editor).

See: https://spraakbanken.gu.se/larka/

3. Visible Vowels

Visible Vowels is a web app for the normalization and visualization of vowel measurements, in particular f0, F1, F2, F3 and duration. During the development the aim was to combine user-friendliness with a maximum of flexibility and functionality. The user can convert Hz values to scales such as Bark, Mel and ST. Additionally, 13 methods for vowel normalization are available. Transformed values can be saved as a data file. Visible Vowels presents the data in 'live view': with every change in the settings the graph is immediately adjusted accordingly. This makes the comparison of, for example, different normalization techniques extremely easy. Line graphs, scatter plots (2D and 3D), dot plots and bar graphs can be created. The generated graphs can be saved in different file formats.

The app is implemented in R, using Shiny, a web application framework developed by Rstudio. The main packages used are shiny, shinyBS, ggplot2 and plot3D. Visible Vowels is available via: visiblevowels.org. The app is also available as the R package 'visvow' in the CRAN repository which makes it possible to install the app locally and to run as a standalone program.

Although the program was intended to visualize regional language and dialect variation, Dimitrios Kokkinakis mentioned another interesting application, namely the visualisation of the effect of speech impairment due to diseases like dementia.

Screenshot of the ‘Formants’ tab in Visibible Vowels. The vowel distribution of Dutch accents of four regions, two regions in Flanders/Belgium (‘FL’) and two regions in the Netherlands (‘NL’) are shown.

4. WoordWaark

`WoordWaark' means literally: WordWork. The basic idea is: working on words on the web. It is an interactive multimedia language database for Groningen dialect by which written and spoken material can be searched and/or listened to. The staff members are: Goffe Jensma (supervision and design), Wilbert Heeringa (implementation, programmer) and Eva Smidt (support and PR). WoordWaark is meant to be largely `fed' by the Groningen community which makes it unique. Therefore, the language community needs to be involved and PR is very crucial. The idea behind WoordWaark is moulding linguistic research in the shape of marketing.

The app presently consists of the following components: dictionaries, corpus, speaking map, donation and voices.

a. dictionaries

The dictionaries of S.J.H. Reker (2016, pocket dictionary) and K. Ter Laan (1952) are available, and can be searched via Groningen dialect or Dutch by using an autocomplete menu.

b. corpus

The Digital Corpus of written Groningen dialect can be search. This corpus will include 800 sources: books, journals, etc. A search can be narrowed by entering region, location, author and/or time interval. The interface is simple compared to the interface of Korp, since it is meant for a broad audience (not just for linguists).

c. speaking map

Via a clickable map the user can choose the location where his of her dialect is spoken. Once a location is chosen, a small dictionary for that location is presented, based on questionaire material. In the future, user can extend the dictionaries themselves, where recordings of the pronunciations of the words can be added as well. The idea of each location having its own dictionary is unique and assume a strong involvement of the Groningen community.

Screenshot of WoordWaark. In the speaking map the user chose the dialect of Groningen city (‘Groningen stad’). The screenshot shows the digital dictionary for the Groningen city dialect. The user searched for English ‘blackberries’ and found multiple Groningen dialect translations such as ‘broamm’, ‘brummels’, ‘brommels’ etc.

d. donation

The user can submit his favorite Groningen dialect word. Once the word is submitted, the word will immediately be added to the location where the dialect of the user is spoken.

e. voices

In Dutch ‘stemmen’. This is an app that has been developed on supervision of Dr. Nanna Hilton, the programmer is Daniel Wanitsch. For each of about 15 words the app shows multiple variants. The user has to choose the variant that he or she uses and has to pronounce the word. At the end, the app guesses the location where the dialect of the participant is spoken.

5. Gabmap

Gabmap is a web application that visualizes dialect variation. The app was originally developed by Peter Kleiweg under supervision of John Nerbonne. The app uses functions of the RuG/L04 package which exists since 2001, and which has been freely distributed since 2004. Gabmap was developed since the end of 2010 and was first published on Github on June 4, 2011.

The original version available at: http://www.let.rug.nl/~kleiweg/L04/webapp. A version that has been forked and maintained by Çağrı Çöltekin is available at: http://www.gabmap.nl/. Currently, this version is maintained by Martijn Wieling. Peter Kleiweg developed a Docker image of Gabmap installed in Lubuntu 16.04, see: https://github.com/pebbe/Gabmap-docker.

6. Conclusion

It was very inspiring the have a closer look at the Språkbanken tools and to compare Korp with WoordWaark. Korp is primarily meant for researchers, while WoordWaark is meant for a larger audience. However, both Korp and Karp offer both a simple and a extended (and advanced) interface. This approach makes the program suitable for both ‘laymen’ and researchers. In the long run it may be good to consider a similar approach for WoordWaark.

I was also particularly interested in Sparv, since we are developing a POS tagger for Frisian at the Frisian Akademy in cooperation with the University of Groningen (Gosse Bouma, Information Science). The interface of Sparv may serve as an example when developing an online tool.

It was also nice to see that Dimitrios Kokkinakis may benefit from Visible Vowels when analyzing the speech of dementia patients. It would be wonderful when Visible Vowels would even more be discovered by colleagues that work in medical sciences.

In short: the meeting with the Språkbanken colleagues Kokkinakis, Laris Borin (who made many useful comments with regard to WoordWaark), Haris Themistocleous, Dana Dannélls, Yvonne Adesam, Ely Matos (FrameNet Brasil), Benjamin Lyngfelt, Anne Schumacher (programmer), Martin Hammerstedt (programmer) and others was very inspiring.

The building in Gothenburg where the Språkbanken team currently is hosted. As can be seen in the picture, Gothenburg was decorated by snow when I visited the group.