Donate Speech: Digital Speech Database to Boost Research and Innovation

The Project

Donate Speech is an ambitious collection of digital speech data intended for both academic and commercial use. Underpinned by an award-winning media campaign, Donate Speech (Lahjoita puhetta) has collected around 4000 hours of colloquial Finnish speech from more than 25 000 speakers during 2020 and early 2021. The project, a collaboration between the Finnish Broadcasting Company YLE, the Finnish State Development Company Ilmastorahasto Oy (formerly Vake Oy) and the University of Helsinki, set out to collect the speech data in order to accelerate the development and implementation of language-based Artificial Intelligence (AI) applications for Finnish.

*The website of the Donate Speech (Lahjoita puhetta) project.*

Although several extensive speech databases for linguistic research already exist in Finland, access to them is limited for commercial purposes. This large resource of everyday, spontaneous speech was designed with commercial use in mind and could be useful for the development of a range of IT and speech processing applications, such as subtitling, translation, and voice-enabled interfaces. While automatic speech recognition (speech-to-text) and speech synthesis (text-to-speech) in Finnish have been available in a few devices and applications for several years, implementing or enhancing many end-user services requires better and more reliable processing support for colloquial Finnish.

Methodology

Donate Speech set out to collect everyday, spontaneous speech from as many different groups of Finnish speakers and as many individuals as possible, including speech from second-language Finnish learners.

Participants were able to donate speech via a web browser or Android or iOS mobile app, both of which offered a selection of tasks to encourage users to talk freely. Representatives from both industry and academia developed the general specifications for the app, which was then designed by software solutions development company Solita. The speech processing components are openly accessible and open source.

*The award-winning informercial was key to the campaign's success. View the video* *here*.

‘It was the media coverage and the

infomercials produced by YLE

that made it possible to get the

attention of thousands of speech donors.’

Kristen Lindén

YLE developed around forty tasks with light-hearted themes for stimulating the collection of data, using videos, pictures and textual content, with an easy-to-use single button to start and stop the recording. Some of the most popular themes included Rakkain eläimeni (My dearest pet), Mistä kodikkuus syntyy? (What makes a cozy home?), Tärkeä esineeni (An important object of mine), Lempivaate (My favourite piece of clothing), Mikä suututtaa? (What’s infuriating?), Katson ikkunasta (While I am looking out of the window) and Kerro aamiaisesta? (What was your breakfast like?).

The total amount of time donated was added as a gamification element. The technical platform also asked participants to add some metadata, such as dialectical background, basic demographics such as age group, gender, mother tongue, profession, and level of education.

To raise awareness and encourage widespread participation, YLE launched an extensive public outreach campaign through its TV and radio channels. As part of the campaign, YLE made comical infomercials calling on the general public to donate speech, which were broadcast in the summer and autumn of 2020, with some reruns during the spring of 2021.

‘The software platform has been published as open source, allowing other organisations to build their own systems for collecting similar speech material or for similar campaigns in other countries.’ Krister Lindén

A key aspect of the Donate Speech project were the legal requirements surrounding the protection of personal data under European and national data protection legislation, most notably the General Data Protection Regulation (GDPR).

After careful consideration, it was determined that the project possessed ‘legitimate interest’ for processing personal data. A data protection impact assessment was also carried out, which found that the processing of personal data in the context of Donate Speech did not result in a high risk, based on the measures that had been put in place.

In practice, participants were provided with two documents: a brief statement including the basic terms of participation, which participants were asked to accept, as well as a more comprehensive data protection policy, outlining the way in which personal data are processed in the campaign and how donated speech samples could retrospectively be removed, if required.

The data will be deposited data with CLARIN, which offers a legal metadata classification system for all datasets, including those that cannot be made openly available. Users are informed of potential restrictions they need to be aware of when accessing the data. The material can be shared with individual researchers, universities and research organisations or private companies that need it in order to study language or artificial intelligence, for developing AI solutions or for higher education purposes.

Outcome

In total, more than 25 000 citizens in Finland donated more than 220 000 speech samples comprising roughly 4000 hours of colloquial speech to be used by academia and industry for developing and researching language and AI applications.

Based on the available metadata, people between twenty and sixty years of age made around three quarters of the donations, with young women being most active. Donations were made from all the regions of Finland, with ninety-five per cent of the participants being native speakers.

More than two thirds of the data was donated by students, pensioners, teachers, entrepreneurs, experts and nurses, with the remainder contributed by more than thirty other professions from diverse parts of society. Almost two thirds had a higher education and almost a third a secondary education.

The web interface was more popular than the apps, and was used by two thirds of the participants. Most of the recordings were between ten seconds and three minutes long, with an average length of thirty to sixty seconds.

Initial random sampling and manual transcription produced encouraging results and suggests that the material is on average no more difficult to recognise than previously recorded speech data under more controlled conditions.

In terms of the campaign’s success, the media campaign turned out to be key. Indeed, the media campaign has received two prestigious awards. In the spring of 2021, it won first prize in the mobile service category of the annual marketing competition GrandOne. Later that year, it won the category of Best European Digital Audio Project 2021 in the annual Prix Europa competition for European broadcasters. The prize was awarded for a fresh way to conceptualise broadcasting and its output, the new cooperation model between commercial and public service entities and a broadcasting company such as YLE, as well as a great web service accompanied by a light-hearted and humorous campaign.

It is anticipated that the digital speech data will be useful for the development of a range of lT and speech processing capacities. The data will allow much more precise training of speaker-independent supervised speech recognition, as well as new directions in research in unsupervised or minimally supervised machine learning of speech processing using current neural network technology. These advancements could improve performances of speech bots or automated call centres, as well as speech-enabled form filling, potentially providing access to digital services for people who have typically been excluded, such as those with impaired vision or low finger dexterity.

Several large companies involved in speech recognition, such as Microsoft and IBM, have already shown interest in evaluating the data, as it may enable them to train their systems to become better at recognising Finnish.

The speech material will be stored in and distributed by the Language Bank of Finland (Kielipankki), which is coordinated by the national FIN-CLARIN consortium, and will likely be redistributed in the spring of 2022.

‘The datasets need to be openly available but in a controlled way, and in a way that still fits within the FAIR principles. This nuanced way of distributing data is what CLARIN can offer.’ Krister Lindén

Publications and Future Plans

Donate Speech has already extended the project to include Finnish-Swedish or the Swedish spoken in Finland (Donera Prat), and is thinking of including more languages in the future. Krister Lindén says: ‘The software is there, and the legal framework is there. It only took YLE a couple of person months to start a similar campaign for Finnish-Swedish, because the framework was already there. Now we’re thinking of going ahead and doing this for Sami as well, and then it will require even less time and effort.’

A book chapter on the project is due to appear in the forthcoming edited volume CLARIN. The Infrastructure for Language Resources (Berlin: de Gruyter, 2022), edited by Darja Fišer and Andreas Witt.

Views on CLARIN

‘Donate Speech deposited the material with CLARIN for several reason. First, its licensing policies: CLARIN has readily available licensing and rights management technology, which means that the safeguarding of personal data is available. CLARIN’s licencing scheme means that it’s possible to place restrictions on datasets. Most of the speech data from Donate Speech will fall into the restricted category, which means that they are not publicly available on the internet. And that is the key point: they can be openly accessible, that is, accessible via a known procedure, but not publicly accessible. They need to be openly available, but in a controlled way, and in a way that still fits with the FAIR principles. This nuanced way of distributing data is what CLARIN can offer.’ Krister Lindén

‘CLARIN provides a framework for other countries to follow suit, and also enables transnational collaboration. There may be other countries that would like to mimic this campaign. Speech recognition is of course already done in many languages, so now it is more a case of other countries being able to collect this dataset. There are plenty of funding instruments already within CLARIN ERIC to enable this sort of project. And this of course also applies to other datasets. In general, it would be great to have more cross-border cooperation, like with the ParlaMint dataset, where you harmonise the format and then the content. That could be promoted for other datasets as well, not just parliamentary datasets.’ Krister Lindén

Contributors

The Language Bank of Finland – Kielipankki/FIN-CLARIN: Krister Lindén, University of Helsinki; Mietta Lennes, University of Helsinki; Tommi Jauhiainen, University of Helsinki

FIN-CLARIN Partners: Jörg Tiedemann, University of Helsinki; Mikko Kurimo, Aalto University; Tommi Kurki, University of Turku

National Broadcasting Company – YLE: Ville Alijoki, Jyri Loikkanen, Aleksi Rossi, Katja Solla, Henna-Leena Kallio, Tiia Lappalainen, Minna Lusa, Laura Joutsi, Marjo Kosonen and Jenni Stammeier

Donate Speech campaign

YLE Donate Speech campaign page

Follow-up campaign for Finnish-Swedish

Language Bank of Finland