Spoken Corpora | CLARIN ERIC

Corpora of spoken language contain transcriptions of spontaneous or planned speech, such as broadcast news or elicited narratives and dialogues. They are often aligned with the accompanying recordings. They are an invaluable resource for various kinds of linguistic research, such as phonology, conversational analysis, and dialectology. Such corpora are carefully sampled and rich in sociodemographic metadata.

There are 148 spoken corpora in the CLARIN infrastructure, 134 of which contain both the transcriptions of spoken or spontaneous speech and the associated recordings, and 14 only the transcriptions. Most of the corpora are monolingual, accounting for the following 15 languages: Arabic, Czech, Dutch, Estonian, Finnish, French, German, Hungarian, Italian, Nepali, Norwegian, Polish, Skolt Saami, Slovenian, Spanish, and Swedish. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged, many with markup specific to speech corpora, such as phonemic and prosodic annotation.

For an introduction to oral history and an overview of the technology that can be used in the processing of oral history resources and interview data, visit the Oral History & Technology website. Here, you can find information on anything from analogue tapes and handwritten summaries, to digital recordings, including digital transcripts, speaker allocation/recognition, emotion markers, speech velocity and much more.

Below, we first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes to the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Spoken corpora in the CLARIN infrastructure

Corpora with transcriptions and audio recordings

Corpus	Language	Description	Availability
Arabic Speech Corpus Licence: CC BY 4.0	Arabic	This corpus is available for download from the Oxford Text Archive. For the relevant publication, see Halabi (2016)	Download
Audioatlas Siebenbuergisch-Saechsischer Dialekte Size: 450,000 words Annotation: Geomapping, orthographic/partial phonetic transcription, semantic labelling Licence: CLARIN RES	Bavarian, German, Romanian	This corpus contains 2274 recordings (approx. 360h) of spoken dialectal German (Saxonian) recorded in Transilvania (Romania) in approx. 250 different locations. This up-to-now unpublished material has been collected on analog tape in the 1960s and 70s by different linguists based at the universities of Bukarest, Hermannstadt and Klausenburg.	Download
ASR training dataset for Croatian ParlaSpeech-HR Size: 1816 hours, 403925 entries Annotation: normalised transcriptions, speaker metadata, word-level alignment to the recordings Licence: CC BY-SA 4.0	Croatian	This corpus is built from parliamentary proceedings available in the Croatian part of the ParlaMint corpus and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of segments 8-20 seconds in length. There are two transcripts available: the original one, and the one normalised via a simple rule-based normaliser. Each of the transcripts contains word-level alignments to the recordings. Each segment has a reference to the ParlaMint 2.1 corpus via utterance IDs. There is speaker information available for 381,849 segments, i.e., 95% of all segments. Speaker information consists of all the speaker information available from the ParlaMint 2.1 corpus (name, party, gender, age, status, role). There are all together 309 speakers in the dataset. The dataset is divided into a training, a development, and a testing subset. Development data consist of 500 segments coming from the 5 most frequent speakers, with the goal of not losing speaker variety on dev data. Test data consist of 513 segments that come from 3 male (258 segments) and 3 female speakers (255 segments). There are no segments coming from the 6 test speakers in the two remaining subsets. The 22,076 instances not having speaker information are not assigned to any of the three subsets. The remaining 380,836 instances form the training set. This corpus is available for download from the CLARIN.SI repository.	Download
Croatian Adult Spoken Language Corpus (HrAL) Size: 250,000 tokens Annotation: speaker metadata Licence: author attribution required	Croatian	This corpus contains spontaneous conversations among 617 speakers from all Croatian counties, and it comprises more than 250 000 tokens and more than 100 000 types. Data for the corpus were collected from 2010 to 2012, from 2014 to 2015 and during 2016. Participants were adults who spoke Croatian as their mother tongue and first language. Transcripts were annotated with the ages and genders of the speakers, as well as the location of the conversation. A separate spreadsheet lists the speakers' origin, where they have spent most of their life and their level of education. The coverage of metadata for individual samples varies, and is in general more complete for samples collected from 2014 onwards. The corpus is available for download and browsing from a dedicated website. For the relevant publication, see Kuvač Kraljević and Hržica (2016)	Browse Download
DIALEKT v1: dialectal corpus with multi-tier transcription Size: 100,000 words Annotation: orthographically and phonetically (dialect features) transcribed, MSD-tagged, lemmatised Licence: Academic Licence Agreement for Czech National Corpus Data	Czech	This corpus contains traditional dialectological material, mostly unprepared monologue-type speech. The corpus is available download (upon request) and through the concordancer KonText. For the relevant publication, see Komrsková et al. (2018)	Concordancer Download
ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio) Size: 2.8 million words Annotation: recordings and transcripts anonymised Licence: Academic Licence Agreement for Czech National Corpus Data	Czech	This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Benešová et al. (2015)	Concordancer Download
ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions & audio) Size: 1 million words Annotation: orthographically and phonetically transcribed; MSD-tagged, lemmatised Licence: Academic Licence Agreement for Czech National Corpus Data	Czech	This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Komrsková et al. (2018)	Concordancer Download
OVM – Otázky Václava Moravce Size: 35 hours Annotation: word-by-word transcriptions, including the transcription of some non-speech events Licence: CC BY-NC 3.0	Czech	This corpus contains transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. The corpus is available for download from LINDAT and through the concordancer KonText.	Concordancer Download
Prague DaTabase of Spoken Czech 1.0 Size: 770,000 tokens, 7324 minutes Annotation: MSD-tagged, lemmatised Licence: CC BY-NC SA 4.0	Czech	This corpus contains spontaneous dialogue. The corpus is available for download from LINDAT. For the relevant publication, see Hajič et al. (2008)	Download
Spoken corpus of Karel Makoň Size: 1000 hours Licence: CC BY-SA 3.0	Czech	This corpus contains talks on Christian mysticism given by Karel Makoň. The corpus is available for download from LINDAT.	Download
Czech Malach Cross-lingual Speech Retrieval Test Collection Size: 592 hours Annotation: manual annotations of selected topics and interviews' metadata Licence: CC BY-NC-ND 4.0	Czech, English, French, German, Spanish	This corpus contains interviews with survivors of the Holocaust. The corpus is available for download from LINDAT.	Download
IFA Spoken Language Corpus Size: 50,000 words (41 minutes/speaker) Annotation: Hand-segmented speech Licence: CLARIN PUB	Dutch	The corpus is available for download from an informal webpage.	Download
JASMIN Speech Corpus Size: 115 hours Annotation: PoS-tagged, lemmatised, phonetically transcribed Licence: CLARIN RES	Dutch	The corpus contains recordings of human-machine interaction and read speech performed by children, non-native speakers and senior people. The corpus is available download from the Dutch Language Institute.	Download
Air Traffic Control Communication Size: 20 hours Annotation: speaker information Licence: CC BY-NC-ND 3.0	English	This corpus contains recordings of communication between air traffic controllers and pilots The corpus is available for download from LINDAT and through the concordancer KonText.	Concordancer Download
Boston University Radio Speech Corpus Size: 7 hours Annotation: PoS-tagged, phonetic alignment, prosodic markers Licence: CLARIN RES	English	This corpus contains recordings and texts from radio news.	Download
Buckeye Corpus of Conversational Speech Annotation: phonetic labels Licence: CLARIN RES	English	This corpus contains an interview. The corpus is available for download from ORTOLANG. For the relevant publication, see Pitt et al. (2005)	Download
ELFA Corpus Size: 13 hours Licence: CLARIN RES, MS-C-No ReD-ND-FF	English	This corpus contains recorded lectures and seminars. The corpus is available for download from FIN-CLARIN.	Download
MultiCHannel Articulatory database: English Size: 5 hours Annotation: orthographically transcribed, Electromagnetic Articulography Licence: CLARIN PUB	English	This coprus features a set of 460 short sentences designed to include the main connected speech processes in English (e.g. assimilations, weak forms ...). All recordings made in the same sound damped studio at the Edinburgh Speech Production Facility based in the department of Speech and Language Sciences, Queen Margaret University College, UK. The database contains audio files, laryngograph waveforms, electromagnetic articulograph (EMA) tracks and electropalatograph (EPG) tracks.	Download
Corpus of Lecture Speech Size: 41 hours Annotation: orthographically transcribed Licence: CC-BY-SA	Estonian	This corpus contains recordings of academic lectures and oral conference presentations. The corpus is available for download from a dedicated webpage.	Download
Corpus of Radio Interviews Size: 36 hours Annotation: speech annotation to orthographically transcribed Licence: CC-BY	Estonian	This corpus contains telephone interviews from different radio programmes. The corpus is available for download from (CELR distribution).	Download
Corpus of Radio News Size: 19 hours Annotation: speech annotation to orthographically transcribed	Estonian	This corpus contains public broadcast news. The corpus is available for download from META-SHARE (CELR distribution).	Download
Estonian Dialect Corpus Size: 1.3 million words Annotation: phonetically transcribed, MSD-tagged, partly syntactically parsed Licence: CLARIN ACA	Estonian	This corpus contains interviews. The corpus is available for download from META-SHARE (CELR distribution).	Download
Estonian Emotional Speech Corpus Size: 1234 sentences Licence: CC-BY	Estonian	This corpus contains read sentences that express anger, joy and sadness, or are neutral. TThe corpus is available for download from META-SHARE (CELR distribution). For the relevant publication, see Altrov and Pajupuu (2012)	Download
Estonian North Wind and the Sun Corpus v.1.0.3 Annotation: word segmentation and phonemes in SAMPA	Estonian	This corpus contains recordings of the tale “Põhjatuul ja päike” (North Wind and the Sun) read by the same speakers who participated in the Phonetic Corpus of Estonian Spontaneous Speech. The corpus is available for download from META-SHARE (CELR distribution).	Download
Phonetic Corpus of Estonian Spontaneous Speech v.1.0.4 Size: 635,000 words, 90 hours Annotation: orthographically and phonetically transcribed, syllables, prosodic feet, intonation phrases, changes in voice quality Licence: CLARIN_RES	Estonian	This corpus contains spontaneous speech by speakers with different dialectological and social backgrounds. The corpus is available for download from META-SHARE (CELR distribution).	Download
Faroese Danish Corpus Hamburg 0.2.dan (FADAC-0.2.dan Hamburg) Annotation: EXMARaLDA Licence: HZSK-RES (restricted, non-commercial only)	Faroese, Danish	This corpus contains informal interviews.
Aalto University DSP Course Conversation Corpus 2013-2016, Downloadable Version Size: 5200 utterances Licence: CLARIN ACA	Finnish	This corpus contains spontaneous conversations. The corpus is available for download from FIN-CLARIN.	Download
Finnish Broadcast Corpus Size: 18 hours Licence: CLARIN RES	Finnish	This corpus contains radio and TV broadcasts. The corpus is available for download from FIN-CLARIN and for online querying through the LAT-platform.	Concordancer Download
Follow-up Study of Dialects of Finnish Size: 12,200 Hours Licence: CLARIN RES	Finnish	This corpus contains interviews. This corpus is available for online querying through the LAT-platform.	LAT platform
Route to A wing Size: 218 tokens Annotation: PoS-tagged Licence: CC-0	Finnish	This corpus contains spontaneous conversations. This corpus is available for online querying through the concordancer Korp.	Concordancer
Samples of Spoken Finnish Size: 100 hours Annotation: syntactically parsed (TDT alpha), named entities (FiNER), PoS-tagged, lemmatized; orthographically transcribed Licence: CC-BY	Finnish	This corpus contains interviews. This corpus is available for online querying through the LAT platform and through the concordancer Korp.	Concordancer LAT platform
The Finnish Dialect Syntax Archive Size: 1.2 million words Annotation: MSD-tagged Licence: CC-BY-NC-ND	Finnish	This corpus contains interviews. The corpus is available for online querying through the LAT platform and through the concordancer Korp.	Concordancer LAT platform
The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s) Size: 210 hours Licence: restricted	Finnish	This corpus contains interviews. The corpus is available for online querying through the LAT platform and through the concordancer Korp.	LAT platform
The Corpus of Border Karelia Size: 120 hours Licence: CC-BY	Finnish, Karelian	This corpus contains interviews. The corpus is available for download from FIN-CLARIN.	Download
Plenary Sessions of the Parliament of Finland Size: 22.5 million words Licence: CC-BY-NC-ND	Finnish, Swedish	This corpus contains the proceedings of the Finnish Parliament. The corpus is available through a dedicated webpage and through the concordancer Korp.	Concordancer Download
CLAPI Size: 323,595 words Licence: CC BY-NC-SA 4.0	French	This is a collection containing around 40 corpora which contain social interactions in different contexts: professional, private, institutional, commercial, medical, and educational situations. Most of the corpora can be downloaded and queried through a dedicated concordancer.	Concordancer
Corpus de Français Parlé Parisien des années 2000 Licence: CC-BY	French	This corpus contains interviews. The corpus is available for download from a dedicated webpage.	Download
Corpus for the Study of Contemporary French Size: 10 million words, 350 hours Annotation: orthographically aligned, PoS-tagged Licence: CC-BY 4.0	French	This corpus contains debates, classroom interactions, literary and scientific texts, regional and national press, etc. The corpus is available for through a dedicated concordancer.	Concordancer
Corpus of Orleans Licence: CC BY-NC-SA 3.0	French	This corpus contains recordings of the everyday speech of Orléans residents between 1969 and 1974. The corpus is available for download from the Huma-num repository.	Download
Phonologie du Français Contemporain Licence: CC-BY	French	This corpus is available for download from a dedicated webpage.	Download
AbsolventInnen Size: 2 hours Annotation: orthographically transcribed, phonetic, phonemic transcription Licence: CLARIN ACA	German	This corpus provides data for examining the pronunciation of gender-neutral forms in German. The recordings took place at the IPS in the Munich region. 56 texts were recorded from 40 speakers. The texts came from newspapers, websites, administration offices, social services, etc., and were modified to contain either one of the three gender-neutral forms or the extended form. Each of the speakers read the 56 sentences, with target words, 25 % each, asterisk, underscore, uppercase-I or the feminine plural-form in a counterbalancing measures design. Filler sentences for this study are not a part of the corpus but will be part of further investigations. That means, that there are 56 recordings per session.	Download
aGender Size: 47 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The speech corpus aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The purpose of the corpus is the automatic detection of gender and/or age (7 mixed classes ranging from 7 - 80 years). The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions.	Download
Australiendeutsch Size: 330,000 words, 65 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
BAS Alcohol Language Corpus Size: 94 hours Annotation: orthographically transcribed, phonemic, user state Licence: CLARIN ACA	German	This corpus contains recordings of 162 speakers while being sober and intoxicated. Beginning with version 3, this corpus edition also contains an emuR compatible database version of the corpus (with a minor bugfix in the database in version 3.1).	Download
BAS Database for Signer-Independent Continuous Sign Language Recognition Size: 55 hours Annotation: Sign language Licence: CLARIN ACA	German	The contains both isolated and continuous utterances of various signers. Since we use a vision-based approach for sign language recognition the corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images. The vocabulary comprises 450 basic signs in German Sign Language (DGS) representing different word types. Based on this vocabulary, overall 780 sentences were constructed.	Download
BAS Regional Variants of German - Juveniles Size: 100 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus contains both read and non-scripted German utterances. It comprises the original RVG prompts (telephone numbers, sentences, commands, digits, etc.) plus spellings, date and time expressions, and free form responses to questions, e.g. "What are you wearing?", "How did you get here?", etc. The speakers were adolescents between 13 and 20 years of age, recruited in public schools in Munich and the suburbs. More than 95% of the speakers have German as their mother language, and almost all of them attended school in Bavaria; 89 of them were male and 93 female.	Download
BAS Siemens Hoergeraete Corpus Size: 24 hours Annotation: Turn segmentation Licence: CLARIN ACA	German	This is a corpus of spontaneous, relatively casual dialogues in German. Each pair of dialogue partners is recorded conversing under real-noise conditions (in a noisy cafeteria and in a car going at different velocities), as well as in a studio at various levels of lombard noise played directly into the subjects' ears.	Download
BAS SmartWeb Video Size: 16.2 hours Annotation: orthographically transcribed, user state Licence: CLARIN ACA	German	The corpus comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include 156 field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), 99 field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as 36 mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorbike Corpus SMC).	Download
BAS Verbmobil Emotion Size: 17 hours Annotation: orthographically transcribed, emotions Licence: CLARIN ACA	German	This database contains speech signals of dialogues in which a subject was recorded during a conversation via a spontaneous speech translation system. The response of the system was designed to invoke emotions (e.g. anger) in the subjects. It is part of the larger Verbmobil 2 speech data collection. Starting from BAS Clarin Respository version 2, the database is also distributed as an emuR comptatible emu database.	Download
BAS ZIPTEL Size: 14 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The ZipTel telephone speech database contains recordings of people applying for a SpeechDat prompt sheet via telephone. For the SpeechDat data collection, calls for participation were published in "phone", the customer magazine of the mobile telephone provider "e-plus", and in numerous newspapers all over Germany. In these calls, a telephone number was given where callers could order a SpeechDat prompt sheet. The calls were recorded by an automatic telephone server; callers were asked to provide name, address and telephone number. The ZipTel telephone speech database consists of 1957 recording sessions with a total of 7746 signal files. A recording session corresponds to one phone call, each signal file contains a single recorded utterance from the recording session.	Download
Belgische TV-Debatten Size: 10 hours Annotation: orthographically transcribed, lemmaized Licence: CLARIN RES	German	This corpus contains broadcast TV debates The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Berliner Wendekorpus Size: 260,000 words, 28 hours Annotation: literal and PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains narrative interviews on German reunification. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Bielefeld Speech and Gesture Alignment Corpus Size: 9881 words Annotation: Annotations of gestures and speech-gesture referents Licence: CLARIN ACA	German	The corpus is made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. There are 1764 isolated gestures in the corpus	Download
Biographische und Reiseerzählungen Size: 50,000 words, 6 hours Annotation: orthographically transcribed Licence: CLARIN RES	German	This corpus contains narrative and biographic interviews. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
CI Articulation Size: 5 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains speech recordings of normal hearing speakers and speakers equipped with Cochlear Implants (CI). Speech data were collected with the software SpeechRecorder, for each recording a BPF file was generated (*.par).	Download
Cluster Production in Cochlear Implant Patients (diachronic data) Size: 14 min Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains diachronic speech recordings from three cochlear implant (CI) users. For data used in the corresponding synchronic study, please refer to the CI_2 corpora. This corpus contains recordings used for the analysis of the temporal dynamics of the consonant cluster /ʃtr/.	Download
Consonant Cluster Production in Cochlear Implant Patients Size: 2 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpous contains German speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG).	Download
Corpus BITS Size: 16.5 hours Annotation: orthographically transcribed, phonetic, phonemic, prosodic Licence: CLARIN RES	German	This is a corpus for speech synthesis using concatenative technique.	Download
Corpus BROTHERS Size: 1.5 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains recordings of pairs of brothers between the ages of 19 and 31. The native and recorded language is German. Recordings consist of minimal pairs in carrier sentences, a different set of sentences aimed at elicitating the full range of German vowels ('Berliner Sätze'), and a spontaneous dialogue about a TV-series. Recordings were made via a table microphone (studio quality) and via telephone (telephone quality).	Download
Corpus FORMTASK Size: 24.5 hours Annotation: orthographically transcribed Licence: CLARIN PUB	German	This is a corpus of telephone conversations including prompted descriptions of typical forms (Berlin public transport ticket, invoices, Austrian parking tickets, newsstand receipts, money transfer forms) found in everyday life.	Download
Corpus HEMPEL Size: 25.5 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus is a collection of more than 3900 spontaneous speech items recorded as extra material during the German SpeechDat-II project. Speakers were asked to report what they had been doing during the last hour: "Was haben Sie in der letzten Stunde gemacht?". This item was recorded as the last item of the recording session. Speakers had become acquainted with the recording procedure and they were quite relaxed because they knew that this item was the last to be recorded. This resulted in quite natural, colloquial speech, sometimes with marked regional accent.	Download
Corpus RVG1_CLARIN Size: 32 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus is a collection of more than 500 speakers of different dialect regions of Germany. The recordings were made using four different microphones (two in low and two in high quality) and consist of single digits, connected digits, phone numbers, phonetically balanced sentences, computer command phrases prompted on a screen, and 1 min spontaneous speech (monologue). The speakers were recorded in normal office environments. The backround noise was limited to the usual noise in office environment, eg. door slam, backround crosstalk, phone ringing, paper rustle, PC noise, etc.	Download
Corpus SC1 Size: 1.5 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus contains speech of 88 different speakers, reading the German story 'Der Nordwind und die Sonne'. Subcorpus T contains the recordings of 16 native Germans (L1). The other 72 speakers which were born and educated in other countries (L2) are pooled in subcorpus C. Every speaker has a distinct accent.	Download
Corpus SC10 Size: 10 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus contains read and non-prompted German and mother tongue speech of 70 different speakers from 17 mother tongues (L1) in a variety of speaking styles e.g. reading, retelling, free talk etc.	Download
Corpus SC2 Size: 9 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus contains read speech of 10 different speakers with screen prompted 'automobil diagnosis phrases' recorded under real conditions in two different car maintenance halls. The language is German. All speakers are male native Germans and have never participated in such a task before. They are all experts in the field of car diagnosis. Each speaker has spoken 800 3-7 word utterances derived from 100 different sentences (see sc2_ort.txt) resulting in a total of 8000 utterances.	Download
Corpus SHC Size: 30.6 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorboke Corpus SMC).	Download
Corpus SI100 Size: 31.5 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus contains read speech of 101 different speakers (50 female, 50 male, 1 unknown). Each speaker has read approx. 100 sentences from either the SZ subcorpus or the CeBit subcorpus. The language is German. The subcorpus SZ contains 544 sentences from newspaper articles ("Sueddeutsche Zeitung"). The subcorpus CeBit contains 483 sentences from newspaper articles about the CeBit 1995. Each subcorpus is divided into 5 parts of approx. 100 utterances each. Every speaker read only one part of one subcorpus (with some exceptions), thus resulting in a total of 10.387 recorded utterances	Download
Corpus SI1000 Size: 32.8 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus contains read speech of 10 different speakers. Each speaker has read approx. 1000 sentences from a German news paper corpus, thus resulting in a total of approx. 10000 recorded utterances. The recording took place at the Institut fuer Phonetik, University of Munich, Germany in 1994.	Download
Deutsche Hochlautung Size: 10,000 words, 2 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains broadcasts in standard German. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Deutsche Mundarten: ehemalige deutsche Ostgebiete Size: 838,000 words, 461 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains interviews and elicited speech in German dialects. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Deutsche Standardsprache: König-Korpus Size: 50,000 words, 6 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains interviews and elicited speech in standard German The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Deutsche Umgangssprachen: Pfeffer-Korpus Size: 646,000 words, 80 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains interviews in regional varieties of German. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Dialogstrukturen Size: 140,000 words, 15 hours Annotation: orthographically transcribed, intonation, lemmatised, PoS-tagged, time alignment Licence: CLARIN RES	German	This corpus contains authentic interaction from various domains. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Elizitierte Konfliktgespräche Size: 160,000 words, 12 hours Annotation: orthographically transcribed Licence: CLARIN RES	German	This corpus contains elicited conflict interaction. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Emigrantendeutsch in Israel Size: 232,000 words, 285 hours Annotation: orthographically transcribed, lemma, PoS-tagged, time alignment Licence: CLARIN RES	German	This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Emigrantendeutsch in Israel: Wiener in Jerusalem Size: 225,000 words, 51 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Forschungs- und Lehrkorpus gesprochenes Deutsch Size: 2.3 million words, 230 hours Annotation: literal and PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains authentic interactions from various domains. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Gesprochenes Wortkorpus für Untersuchungen zur auditiven Verarbeitung von Sprache und emotionaler Prosodie Size: 3 hours Annotation: phonetic Licence: CLARIN ACA	German	WaSeP contains recordings of one female and one male speaker, both professional actors, uttering single German nouns and pseudowords in multiple emotional prosodies. This edition improves the segmentation of the phonetic annotation, adds Praat TextGrid files and removes a few irregular items.	Download
Grundstrukturen: Freiburger Korpus Size: 600,000 words, 70 hours Annotation: orthographically transcribed, intonation, lemmatised, PoS-tagged, time alignment Licence: CLARIN RES	German	This corpus contains authentic interaction from various domains. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Hamburg Modern Times Corpus Size: 3 hours Annotation: manual annotation of phonetic phenomena, accent/stress marking Licence: HZSK-ACA (academic, non-commercial only)	German	This corpus contains task-oriented communcation (e.g., a film retelling) in the context of studying adult L2 acquisition.	Download
Mehrsprachige Kinder im Vorschulalter Size: 17,000 words, 13 hours Annotation: literal and PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains elicitation tasks with pre-school children. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Natural Media Motion-Capture Corpus Size: 3 hours Annotation: orthographically transcribed, gestures, motion capture of hands Licence: CLARIN ACA	German	The corpus consists of data from 18 participants, whose task was to describe nine objects each to an experimenter, without using everyday vocabulary about forms, sizes or objects. The participants were recorded on audio and several video cameras, and their hand movements were recorded using an optical VICON motion capture system.	Download
Nautilus Speaker Characterization Size: 155 hours Annotation: orthographically transcribed, Turn taking, perceivend inter-personal speaker characteristics, voice descriptions Licence: CLARIN ACA	German	This corpus contains scripted, semi-spontaneous, and spontaneous human-human dialogs. In total, 300 speakers of German without noticeable accent participated and were recorded in an acoustically-isolated room. Interactions between speakers and their interlocutor are provided in separate mono files, accompanied by timestamps and tags that define the speaker's turns. The speech corresponding to one of the semi-spontaneous dialogs was labeled with respect to perceived interpersonal speaker characteristics and naive voice descriptions. These labels are found alongside the documentation.	Download
PhattSessionz Adolescents Speech Corpus Size: 208 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains recordings of 1019 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 45 locations in Germany. The speech material recorded is a superset of the German SpeechDat-II and RVG-I corpora.	Download
PhonDat 1 Size: 21.4 hours Annotation: orthographically transcribed, phonemic Licence: CLARIN ACA	German	The corpus contains read speech of 201 different speakers. Each speaket read a subcorpus of 450 different sentence equivalents (including alphanumericals and two shorter passages of prose text); 8 speakers read the whole sentence corpus; 40 speakers read the subcorpora BR and MR; 112 speakers read 70 utterances of the rest corpus, including alphabet, numbers 0 to 12 and stories. The corpus contains a total of 21587 recorded utterances.	Download
PhonDat 2 Size: 4.3 hours Annotation: orthographically transcribed, phonemic, phonetic Licence: CLARIN ACA	German	The corpus contains read speech of 16 different speakers, 6 women and 10 men. Each speaker reads a corpus of 200 different sentences from a train query task. They were recorded at three different sites in Germany (University of Kiel, University of Bonn, University of Munich). The language is German. The corpus contains a total of 3200 recorded utterances.	Download
Russlanddeutsche Dialekte Size: 100,000 words, 10 hours Annotation: literal and PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German	This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Sibilant Production in Cochlear Implant Patients Size: 1 hour Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpous contains German speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). CI_2_Sibilants contains recordings used for the analysis of /s/ and /ʃ/ in the following words: 'Tasse', 'Tasche'.	Download
Sibilant Production in Cochlear Implant Patients (diachronic data) Size: unknown Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains diachronic speech recordings from three cochlear implant (CI) users. For data used in the corresponding synchronic study, please refer to the CI_2 corpora. CI_3_Sibilants contains recordings used for the analysis of /s/ and /ʃ/ in the following words: 'Tasse', 'Tasche'.	Download
SmartKom Home Size: 11 hours Annotation: orthographically transcribed, phonemic, gestures, mimic, emotions Licence: CLARIN ACA	German	This corpus contains multi modal recordings of 65 actors who use the SmartKom system. SmartKom Home should be an intelligent communication assistant for the private environment. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human.	Download
SmartKom Mobil Size: 11 hours Annotation: orthographically transcribed, phonemic, gestures, mimic, emotions Licence: CLARIN ACA	German	This corpus contains multi modal recordings of 73 actors who use the SmartKom system. SmartKom Mobil is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and should more or less communicate like a human. Experiments were not performed in the field but rather in a studio-like environment.	Download
SmartKom Public Size: 11 hours Annotation: orthographically transcribed, phonemic, gestures, mimic, emotions Licence: CLARIN ACA	German	This corpus contains multi modal recordings of 86 actors who use the SmartKom system. SmartKom Public is comparable to a traditional public phone booth but equipped with additional intelligent communication devices. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human.	Download
SmartWeb Motorbike Corpus SMC Size: 6.3 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	The corpus comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The SMC corpus itself contains 36 mobile recordings performed on a BMW motorbike.	Download
Spoken production of gender-neutral nouns in German Size: 2 hours Annotation: orthographically transcribed Licence: CLARIN PUB	German	This corpus examines the pronunciation of different genderneutral forms in German. Various source texts were used, like newspaper articles, websites, etc.	Download
The Karl-Eberhard-Corpus of spontaneously spoken conversations in Southern German Size: 40 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains 79 speakers of Southern German. Two speakers, usually acquainted with each other, had an one hour long conversation in separate booths.	Download
The Zurich Tangram Corpus - BAS Edition Size: 48 hours Annotation: orthographically transcribed, word and phonemic segmentation Licence: CLARIN ACA	German	This corpus contains tasks, where one subject (the instructor) describes different Tangram figures to another subject (the receiver) so that the receiver can recreate the same order of figures that the instructor has in front of them. The subjects initially don't know each other and work together to solve these tasks in three consecutive sessions. This edition only features the transcribed segments, not those in between, and uses separate files for the subject.	Download
The Zurich Tangram Corpus - UZH Edition Size: 48 hours Annotation: orthographically transcribed, turn segmentation Licence: CLARIN ACA	German	This corpus contains tasks, where one subject (the instructor) describes different Tangram figures to another subject (the receiver) so that the receiver can recreate the same order of figures that the instructor has in front of them. The subjects initially don't know each other and work together to solve these tasks in three consecutive sessions. This edition features the complete recordings, but lacking phone and word segmentation. Subjects audio tracks are combined into stereo files. If you would like just the transcribed segments with separate files for the subjects or want the word and phone segmentation see corpus ZTC_BAS.	Download
Voice Onset Time in Cochlear Implant Patients Size: 35 min Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpous contains German speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). It contains recordings used for the analysis of voice onset time in /t/ in the word 'teilen'.	Download
Voice Onset Time in Cochlear Implant Patients (diachronic data) Size: unknown Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpus contains diachronic speech recordings from three cochlear implant (CI) users. For data used in the corresponding synchronic study, please refer to the CI_2 corpora. CI_3_Sibilants contains recordings used for the analysis of /s/ and /ʃ/ in the following words: 'Tasse', 'Tasche'. CI_3_VOT contains recordings used for the analysis of voice onset time in /t/ in the word 'teilen'.	Download
Vowel Production in Cochlear Implant Patients Size: 2 hours Annotation: orthographically transcribed Licence: CLARIN ACA	German	This corpous contains German speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). It contains recordings used for the analysis of sevel long, lexically stressed vowels in the words 'Taten', 'stetig', 'Toter', 'Stute', 'töten', 'Tüte' and 'kriegen'.	Download
Zweite Generation deutschsprachiger Migranten in Israel Size: 125 hours Annotation: orthographically transcribed, code switching Licence: CLARIN RES	German	This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
GeWiss Size: 1.4 million tokens, 123 hours Annotation: code switching	German (L2 and L1), English, Polish, Italian (L1)	This corpus contains transcripts and audio recordings of spoken academic discourse, primarily talks including discussions and oral exams. For the relevant publication, see Fandrych et al. (2014)	Concordancer
Deutsche Mundarten: Zwirner-Korpus Size: 4 million words, 1076 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German, (some Frisian and Dutch)	This corpus contains interviews and elicited speech in German dialects. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Deutsche Mundarten: DDR Size: 212,000 words, 385 hours Annotation: PoS-tagged, lemmatised, time-aligned, orthographically transcribed Licence: CLARIN RES	German, (some Sorbian)	This corpus contains interviews and elicited speech in German dialects. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
EXMARaLDA Demo Corpus 1.0 Size: 2 hours Annotation: suprasegmental information, accentuation/stress marking Licence: HZSK-PUB (public, non-commercial only)	German, English, French, Spanish, Turkish, Polish, Vietnamese, Swedish, Norwegian, Italian, Russian, Afrikaans, Portuguese	This corpus is a demo of the EXMARaLDA system. The corpus is available for download from a CLARIN-D repository.	Download
Corpus Verbmobil I Size: 77 hours Annotation: orthographically transcribed, phonetic, phonemic, prosodic Licence: CLARIN ACA	German, English, Japanese	The Verbmobil (VM) dialog database is a collection of German, American and Japanese dialog recordings in the appointment scheduling task. The data were collected during the first phase (1993 - 1996) of the German VM project funded by the German Ministry of Science and Technology (BMBF). Starting with version 3, the corpus is also provided as an emuR comptatible database.	Download
Corpus Verbmobil II Size: 65.8 hours Annotation: orthographically transcribed, phonetic, phonemic, prosodic Licence: CLARIN ACA	German, English, Japanese	Verbmobil 2 contains the speech of 401 speakers participating in 810 recordings. The emotional tagged recordings are not part of this edition but are collected inthe corpus 'BAS VMEmo'. The total VM2 corpus amounts to 17.6GB of data containing 58961 conversational turns distributed on 39 CD-R. VM2 contains dialogs in German, English, Japanese and mixed language pairs (partly with interpreter). The domain is appointment scheduling, travel planing, leisure time planing. Starting from version 3, the corpus is also available in emuR compatible emuDB format (see annotation files ending in *_annot.json).	Download
Gesprochene Wissenschaftssprache Kontrastiv Size: 760,000 words, 92 hours Annotation: literal and PoS-tagged, lemmatised, time-aligned, orthographically transcribed, annotation of discourse phenomena and language mixing Licence: CLARIN RES	German, English, Polish, Bulgarian	This corpus contains academic interaction. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim).	Concordancer Download
Hamburg Adult Bilingual LAnguage (HABLA) Size: 79 hours Licence: HZSK-RES (restricted, non-commercial only)	German, French, Italian	This corpus contains interviews. For the relevant publication, see Kupisch et al. (2012)
Budapest Sociolinguistic Interview - version 2 Size: 270,000 words Annotation: MSD-tagged, spoken language phenomena (hesitation, consonant drops) Licence: CLARIN RES	Hungarian	This corpus contains sociolinguistic interviews conducted with 50 individuals. The corpus is available for download and through a dedicated concordancer. For the relevant publication, see Kontra and Váradi (1997)	Concordancer Download
Gamli: Icelandic Oral History Corpus Size: 146 hours of transcribed audio Annotation: Subset is manually annotated with speaker ID and time alignment Licence: CC BY 4.0	Icelandic	This is an ASR corpus for Icelandic oral histories. The corpus contains 210 unique speakers, 90 women and 120 men (plus the interviewers: 14 men and 1 woman), but the total audio length with each individual speaker varies quite a lot with three men accounting for one third of the entire data. The age ranges from 38 to 99, but most of the speakers are 60+ (94.8%) and the average age of the speakers is 77 years. This ratio is unprecedented in all existing corpora for Icelandic speech (cf. 4.8% of speakers in Samrómur are 60+) and makes Gamli an important addition to that collection. The corpus is available for download from the CLARIN-IS repository. For the relevant publication, see O’Brien et al. (2023)	Download
Kennslurómur Size: 51 hours Annotation: sentence-segmented orthographic transcriptions Licence: CC BY 4.0	Icelandic	This corpus contains recordings of lectures at Reykjavik University and the University of Iceland. The lectures were donated by the lecturers (172 lectures by 14 lecturers), transcribed with an Icelandic speech recognizer and then manually corrected by human transcribers and finally verified by a proofreader.	Download
Samrómur Size: 145 hours, 100,000 utterances Annotation: orthographically transcribed Licence: CC BY 4.0	Icelandic	This corpus contains validated speech-recordings and is a result of a crowd-sourcing effort run by the Language and Voice Lab at Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains recordings by 8,392 different speakers, with the average recording lenth being 5.2 seconds. Transcriptions of the read texts are also available. The corpus is available for download from the CLARIN.IS repository.	Download
Spjallromur - Icelandic Conversational Speech Size: 21 hours Licence: CC BY 4.0	Icelandic	This corpus contains recordings of 54 conversations by 102 speakers, recorded between September 2020 and September 2021. The corpus is available for download from the CLARIN.IS repository.	Download
Talrómur 2 Size: 56,225 utterances Licence: CC BY 4.0	Icelandic	This corpus consists of recordings of forty different speakers reading short sentences and is intended for modelling prosody. The corpus is available for download from the CLARIN.IS repository.	Download
The Icelandic Spoken Language Corpus Size: 536,000 tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY 4.0	Icelandic	This corpus contains four different subcorpora: (1) Spontaneous conversations, from the project ÍSTAL (An Icelandic Spoken Language Bank), (2) Group conversations, from the project MIN (Modern loanwords in the Nordic languages), (3) Parliamentary debates, (4) Conversations of teenagers with other teenagers and adults The corpus is available for download from CLARIN-IS (as a part of the Icelandic Gigaword Corpus) and for search through the concordancer Korp. For the relevant publication, see Steingrímsson et al. (2018)	Concordancer Download
CLIPS : corpora e lessici di italiano parlato e scritto Size: 100 hours	Italian	This corpus contains speech from 15 different cities in Italy.	Download
Corpus CLIPS_MT_MANUAL Size: 3 hours Annotation: orthographically transcribed, phonemic, phonetic Licence: CLARIN ACA	Italian	This is a sub-corpus of the original Italian CLIPS corpus (Corpora e Lessici dell'Italiano Parlato e Scritto) that is manually annotated and covers only 15 maptask dialogues recorded in 15 locations by local speaker pairs. this corpus contains 3228 inspected and partially repaired WAV signal files, each containing one dialogue turn (.wav), 3228 corrected original CLIPS annotation files (.acs, .phn, .std, .wrd), 3228 BAS Partitur files containing the annotation tiers ORT, KAN and SAP (.par), 3228 EMU database annotation files (.vot, .hlb) covering 30 maptask dialogues performed by 30 speakers (each speaker pair performing two different map tasks) recorded in 15 different locations in Italy in 2000-2004.	Download
LMU AsiCa Size: 47 hours Annotation: phonetic transcription Licence: CLARIN RES	Italian	This corpus is a documentation of the South Italian dialect 'Calabrese'. The main objects when building this corpus were the analysis of syntactical structures and their geolinguistic mapping in form of interactive, webbased cartography. The corpus consists of several audio files containing recordings of some sixty speakers of Calabrese one half of which having migration experience in Germany the other half almost always having stayed in Calabria. Furthermore the informants were selected equally balanced regarding gender, age and geographical origin. Of most of the informants there exist at least one recording with spontanous speech and one recording based on stimuli each.	Download
Nganasan Spoken Language Corpus (NSLC) Size: 32 hours Annotation: alignment of transcriptions and audio recordings Licence: HZSK-RES (restricted, non-commercial only)	Nganasan, Russian	This second version 0.2 of the corpus is a subcorpus that comprises 177 communications, 136 of which contain an aligned audio recording, with glossed (Toolbox/FLEx) and annotated (EXMARaLDA) transcripts from 57 speakers. All texts have been translated into Russian and English, some also into German. The corpus also contains rich metadata on the communications and speakers.
LIA Size: 1.5 million tokens Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised Licence: CLARIN ACA	Norwegian	This corpus contains interviews and conversation in Norwegian dialects. The corpus is available through a Tekstlab concordancer (account needed).	Concordancer
NoTa-Oslo Size: 1 million tokens Annotation: orthographically transcribed, MSD-tagged, lemmatised Licence: CLARIN ACA	Norwegian	This corpus contains interviews and conversations in Oslo sociolects. The corpus is available through a Tekstlab concordancer (account needed).	Concordancer
TAUS Size: 270 000 tokens Annotation: MSD-tagged, lemmatised, orthographically and partially phonetically transcribed Licence: CLARIN ACA	Norwegian	This corpus contains informal interviews in Oslo sociolects. The corpus is available through a Tekstlab concordancer (account needed).	Concordancer
The BigBrother Corpus Size: 440,300 tokens Annotation: orthographically transcribed, msd-tagged, lemmatised Licence: CLARIN ACA	Norwegian	This corpus contains recordings and transcripts from the Norwegian Big Brother in 2001. The corpus is available through a Tekstlab concordancer.	Concordancer
Corpus of American Nordic Speech (CANS) Size: 251,000 tokens Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised Licence: CLARIN ACA	Norwegian, Swedish	This corpus contains interviews, conversations. Norwegian and Swedish dialects in America. The corpus is available through a Tekstlab concordancer. For the relevant publication, see Johannessen (2015)	Concordancer
Nordic Dialect Corpus v. 4.0 Size: 2,754,289 tokens Annotation: MSD-tagged, phonetically transcribed, orthographically transcribed Licence: CLARIN ACA	Norwegian, Swedish, Danish, Faroese, Icelandic, Övdalian	This corpus consists of pontaneous speech data from dialects of the North Germanic languages across all of the Nordic countries. The linguistic data in the corpus comes from a variety of sources, (see homepage - Data Collection), recorded in 1998 - 2015. The corpus transcribed and linked to audio and video, has a map function, and can be searched in a large variety of ways. The corpus can be accessed online via a concordancer provided by the TekstLab (a CLARINO node).	Concordancer
Hamburg Corpus of Polish in Germany (HamCoPoliG) Size: 38 hours Licence: HZSK-RES (restricted, non-commercial only)	Polish	This corpus contains spontaneous speech and reading tasks. For the relevant publication, see Czachór (2012)
Consecutive and Simultaneous Interpreting (CoSi) Size: 6 hours Licence: HZSK-RES (restricted, non-commercial only)	Portuguese, English	This corpus contains lectures in Portuguese with simultaneous interpretation in English.
ASR training dataset for Serbian JuzneVesti-SR Size: 50.55 hours, 10811 entries Annotation: normalised transcriptions (lowercased, punctuation removed, numerals spelled out), speaker metadata, word-level alignment to the recordings Licence: CC BY-SA 4.0	Serbian	This corpus consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called the 15 minuta. The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. The speaker_info attribute is less abundant due to the fact that compared to parliamentary corpora less data is available in this domain, so it covers only the guest name, guest description, host name, and speaker breakdown (when the host or the guest are speaking). This corpus is available for download from the CLARIN.SI repository.	Download
Skolt Saami Documentation Corpus (2016) Size: 19 hours Annotation: MSD-tagged Licence: CLARIN RES	Skolt Saami	This corpus contains interviews. This corpus is available for online querying through the LAT platform.	LAT platform
ASR database ARTUR 1.0 Size: 884 hours Annotation: orthographically transcribed Licence: CC BY-SA 4.0	Slovenian	This corpus was designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. The audio files are available in a separate repository entry. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected. The data are structured as follows: Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc. (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly.	Download (transcriptions) Download (audio files)
Hamburg Corpus of Argentinean Spanish (HaCASpa) Size: 19 hours Annotation: orthographically transcribed Licence: HZSK-RES (restricted, non-commercial only)	Spanish (Argentinian)	This corpus contains spontaneous speech and reading tasks. For the relevant publication, see Gabriel et al. (2010)
Catalan in a bilingual context (PhonCAT) Size: 144 hours Annotation: orthographically and phonetically transcribed Licence: HZSK-RES (restricted, non-commercial only)	Spanish (Catalan)	This corpus contains read, elicited and spontaneous speech. For the relevant publication, see Benet et al. (2012)
Schweizer Jugendsprache Size: 92 hours Annotation: orthographically transcribed Licence: CLARIN RES	Swiss German	This corpus contains recordings of adolescent pupils in Switzerland.	Download

Corpora with transcriptions only

Corpus	Language	Description	Availability
Map task corpus of heritage BCMS 1.0 Size: 12,988 tokens Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated with corpus-specific annotations Licence: CC BY-NC-SA 4.0	Bosnian, Croatian, Montenegrin, Serbian	This corpus of heritage Bosnian/Croatian/Montenegrin/Serbian (BCMS) consists of elicited conversations (map tasks) by 29 second-generation BCMS speakers originating from different regions of former Yugoslavia and living in German-speaking Switzerland. The corpus is suited for researchers of heritage BCMS, as well as students and teachers of BCMS living in diaspora. The corpus contains 30 turn-aligned transcripts with an average length of 6 minutes. The texts are annotated with the CLASSLA pipeline on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features. The corpus is enriched with corpus-specific annotations of truncations, elongations, stutter and code-switches. It is distributed in source and derived vertical formats. The corpus is available for download from CLARIN.SI as well as through the noSketchEngine and KonText concordancers.	Concordancer (noSketchEngine) Concordancer (KonText) Download
ORAL2008: Balanced corpus of informal spoken Czech Size: 1 million tokens Licence: CC BY-NC-SA 3.0	Czech	This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Benešová et al. (2015)	Concordancer Download
ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions) Size: 1 million tokens Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised Licence: CC BY-NC-SA 4.0	Czech	This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For the relevant publication, see Komrsková et al. (2018)	Concordancer Download
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 Size: 120,000 words Annotation: syntactic dependencies Licence: ACADEMIC (PDTSL)	Czech	This corpus is available for download from LINDAT.	Download
Languages in Migration Annotation: syntactic dependencies Licence: Czech National Corpus (Shuffled Corpus Data)	Czech, German	This corpus is a representation of authentic spoken Czech and German. It contains transcriptions of informal speech (private environment, spontaneity, unpreparedness etc.) by Czech-German bilingual speakers born in Czechoslovakia around 1955 and who departed for Germany after becoming 12 years old. The corpus is composed of interviews conducted from 2018–2020 with 20 speakers on language biographies and narrated in Czech and German respectively. The corpus is available for download from LINDAT and for online browsing through the KonText concordancer. For the relevant publication, see this list of publications	Concordancer Download
ParCorFull: A Parallel Corpus Annotated with Full Coreference Size: 160,000 tokens Annotation: coreference (nominal and clausal) Licence: CC BY-NC-ND 4.0	English, German	This corpus contains planned speech and newswire. The corpus is available for download from LINDAT.	Download
The Spoken Wikipedia Corpora Size: 1005 hours Annotation: text segmentation, normalization, time-alignment Licence: CC-BY SA 4.0	English, German, Dutch	This corpus contains transcripts of read Wikipedia articles The corpus is available for download from a CLARIN-D repository. For the relevant publication, see Köhn et al. (2016)	Download
Corpus of Spoken Estonian Size: 1 million words Annotation: unspecified tagging	Estonian	This corpus contains transcripts of recordings from various domains.
ALCEBLA Size: 72 hours Annotation: orthographic and phonetic transcription Licence: HZSK-RES (restricted, non-commercial only)	German, Spanish	This corpus contains Speech tasks performed by bilingual children. For the relevant publication, see Ulloa Saceda et al. (2012)
Corpus of Doctor-Patient Conversations from Ahus Size: 958,830 tokens Annotation: orthographically transcribed, MSD-tagged, lemmatised Licence: CLARIN ACA	Norwegian	This corpus contains doctor-patient conversations. The corpus is available through a Tekstlab concordancer (account needed).	Concordancer
Corpus of Serbian Forms of Address 1.1 Size: 171,546 words Annotation: MSD-tagged, lemmatised, normalised Licence: CC BY-NC-SA 4.1	Serbian	This corpus consists of transcripts of audio-recorded biographical interviews with 19 participants. The interviews are about forms of address that speakers use in colloquial and in formal settings, and about their attitudes and evaluations concerning particular forms of address. We provide original transcripts (written according to GAT conventions), as well as transcripts in CoNLL-U and TEI-XML format. The corpus has been normalised, tagged with morphosyntactic and lemma information using the CLASSLA-StanfordNLP tagger, and aligned with the respective turns in the audio files. Time alignments as well as partial annotation corrections are stored in TEI-XML. The corpus is available for download from CLARIN.SI as well as through the noSketchEngine and KonText concordancers.	Concordancer (noSketchEngine) Concordancer (KonText) Download
Spoken corpus Gos 2.0 Size: 1534 texts; 127,604 utterances; 2,462,368 words Annotation: phonetic and orthographic transcription, PoS tagging, lemmatisation Licence: CC BY-SA 4.0	Slovenian	This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words). The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer. For the relevant publication, see Verdonik and Zwitter-Vitez (2011)	Concordancer Download
Spoken corpus Gos VideoLectures 3.0 (transcription) Size: 126,000 words Annotation: PoS-tagged, lemmatised, orthographically and phonetically transcribed Licence: CC BY 4.0	Slovenian	This corpus contains public academic speech. The corpus is available for download from CLARIN.SI and through the concordancer KonText. For the version with audio recordings, click here. For the relevant publication, see Verdonik (2018)	Concordancer Download
Gothenburg Dialogue Corpus Size: 1,470,000 tokens Annotation: MSD-tagged, lemmatised Licence: CC-BY	Swedish	This corpus is available through the concordancer Korp (account needed).	Concordancer

Corpus

Language

Description

Availability

Map task corpus of heritage BCMS 1.0

Size: 12,988 tokens
Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated with corpus-specific annotations
Licence: CC BY-NC-SA 4.0

Bosnian, Croatian, Montenegrin, Serbian

This corpus of heritage Bosnian/Croatian/Montenegrin/Serbian (BCMS) consists of elicited conversations (map tasks) by 29 second-generation BCMS speakers originating from different regions of former Yugoslavia and living in German-speaking Switzerland. The corpus is suited for researchers of heritage BCMS, as well as students and teachers of BCMS living in diaspora.

The corpus contains 30 turn-aligned transcripts with an average length of 6 minutes. The texts are annotated with the CLASSLA pipeline on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features. The corpus is enriched with corpus-specific annotations of truncations, elongations, stutter and code-switches. It is distributed in source and derived vertical formats.

The corpus is available for download from CLARIN.SI as well as through the noSketchEngine and KonText concordancers.

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

ORAL2008: Balanced corpus of informal spoken Czech

Size: 1 million tokens
Licence: CC BY-NC-SA 3.0

Czech

This corpus contains informal conversations.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Benešová et al. (2015)

Concordancer

Download

ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)

Size: 1 million tokens
Annotation: orthographically and phonetically transcribed, MSD-tagged, lemmatised
Licence: CC BY-NC-SA 4.0

Czech

This corpus contains informal conversations.

The corpus is available for download from LINDAT and through the concordancer KonText.

For the relevant publication, see Komrsková et al. (2018)

Concordancer

Download

Prague Dependency Treebank of Spoken Language (PDTSL) 0.5

Size: 120,000 words
Annotation: syntactic dependencies
Licence: ACADEMIC (PDTSL)

Czech

This corpus is available for download from LINDAT.

Download

Languages in Migration

Annotation: syntactic dependencies
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech, German

This corpus is a representation of authentic spoken Czech and German.

It contains transcriptions of informal speech (private environment, spontaneity, unpreparedness etc.) by Czech-German bilingual speakers born in Czechoslovakia around 1955 and who departed for Germany after becoming 12 years old. The corpus is composed of interviews conducted from 2018–2020 with 20 speakers on language biographies and narrated in Czech and German respectively.

The corpus is available for download from LINDAT and for online browsing through the KonText concordancer.

For the relevant publication, see this list of publications

Concordancer

Download

ParCorFull: A Parallel Corpus Annotated with Full Coreference

Size: 160,000 tokens
Annotation: coreference (nominal and clausal)
Licence: CC BY-NC-ND 4.0

English, German

This corpus contains planned speech and newswire.

The corpus is available for download from LINDAT.

Download

The Spoken Wikipedia Corpora

Size: 1005 hours
Annotation: text segmentation, normalization, time-alignment
Licence: CC-BY SA 4.0

English, German, Dutch

This corpus contains transcripts of read Wikipedia articles

The corpus is available for download from a CLARIN-D repository.

For the relevant publication, see Köhn et al. (2016)

Download

Corpus of Spoken Estonian

Size: 1 million words
Annotation: unspecified tagging

Estonian

This corpus contains transcripts of recordings from various domains.

ALCEBLA

Size: 72 hours
Annotation: orthographic and phonetic transcription
Licence: HZSK-RES (restricted, non-commercial only)

German, Spanish

This corpus contains Speech tasks performed by bilingual children.

For the relevant publication, see Ulloa Saceda et al. (2012)

Corpus of Doctor-Patient Conversations from Ahus

Size: 958,830 tokens
Annotation: orthographically transcribed, MSD-tagged, lemmatised
Licence: CLARIN ACA

Norwegian

This corpus contains doctor-patient conversations.

The corpus is available through a Tekstlab concordancer (account needed).

Concordancer

Corpus of Serbian Forms of Address 1.1

Size: 171,546 words
Annotation: MSD-tagged, lemmatised, normalised
Licence: CC BY-NC-SA 4.1

Serbian

This corpus consists of transcripts of audio-recorded biographical interviews with 19 participants. The interviews are about forms of address that speakers use in colloquial and in formal settings, and about their attitudes and evaluations concerning particular forms of address.

We provide original transcripts (written according to GAT conventions), as well as transcripts in CoNLL-U and TEI-XML format. The corpus has been normalised, tagged with morphosyntactic and lemma information using the CLASSLA-StanfordNLP tagger, and aligned with the respective turns in the audio files. Time alignments as well as partial annotation corrections are stored in TEI-XML.

The corpus is available for download from CLARIN.SI as well as through the noSketchEngine and KonText concordancers.

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

Spoken corpus Gos 2.0

Size: 1534 texts; 127,604 utterances; 2,462,368 words
Annotation: phonetic and orthographic transcription, PoS tagging, lemmatisation
Licence: CC BY-SA 4.0

Slovenian

This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words).

The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer.

For the relevant publication, see Verdonik and Zwitter-Vitez (2011)

Concordancer

Download

Spoken corpus Gos VideoLectures 3.0 (transcription)

Size: 126,000 words
Annotation: PoS-tagged, lemmatised, orthographically and phonetically transcribed
Licence: CC BY 4.0

Slovenian

This corpus contains public academic speech.

The corpus is available for download from CLARIN.SI and through the concordancer KonText.

For the version with audio recordings, click here.

For the relevant publication, see Verdonik (2018)

Concordancer

Download

Gothenburg Dialogue Corpus

Size: 1,470,000 tokens
Annotation: MSD-tagged, lemmatised
Licence: CC-BY

Swedish

This corpus is available through the concordancer Korp (account needed).

Concordancer

Other spoken corpora

Corpus	Language	Description	Availability
Griffith Corpus of Spoken Australian English Size: 32,134 words	English	This corpus is available for download and through the concordancer of the Australian National Corpus.	Concordancer Download
Spoken BNC2014 Size: 10 million words	English	This corpus contains face-to-face conversations between people who speak British English as their first language. The corpus is available through the CQP concordancer.	Concordancer
The Aston Corpus of West Midlands English (ACWME) Annotation: orthographically transcribed	English	This corpus contains recordings of performances - comedy, drama, poetry, song and story-telling - and related interviews with performers, members of the audience and local and national celebrities. The corpus is available for download from a dedicated webpage.	Download
Vienna-Oxford International Corpus of English	English	This corpus contains naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF). The corpus is available through a dedicated concordancer.	Concordancer
AN.ANA.S._MT	English, Italian, Spanish	This corpus contains TV-broadcasts and elicited dialogues.
Babel - A Multi Language Database Annotation: orthographically transcribed	Hungarian	This corpus contains various elicited speech tasks.
BEA (Hungarian Spontaneous Speech Database) Size: 465 recordings Annotation: partial transcription Licence: restricted	Hungarian	This corpus contains spontaneous speech.
Hungarian Broadcast News Database Size: 25,000 words, 3.5 hours Annotation: audio-level annotations Licence: META_SHARE NC-NoReD	Hungarian	This corpus is available for download (upon request) from META-SHARE.	Download
Hungarian Gigaword Corpus / "spoken language" subcorpus Size: 76 million words Annotation: PoS-tagged, MSD-tagged	Hungarian	This corpus contains radio broadcasts (reading aloud and spontaneous conversation) The corpus is available through the Hungarian Gigaword Corpus concordancer.	Concordancer
Hungarian Kindergarten Language Corpus Size: 192,000 words Annotation: PoS-tagged, MSD-tagged Licence: restricted	Hungarian	This corpus contains elicited speech tasks (picture descriptions) and guided conversation with children. The corpus is available for download through META-SHARE.	Download
Hungarian Reference Speech Database Size: 6 hours Annotation: partial phonemic-level annotation Licence: META-SHARE No-Redistribution Commercial FF	Hungarian	This corpus contains reading tasks. The corpus is available for download (upon request) from META-SHARE.	Download
Medical Speech Database Annotation: phonetic transcription Licence: META-SHARE C-NoReD-FF	Hungarian	This corpus is available for download (upon request) from META-SHARE.	Download
Corpus LIP Size: 490,000 words	Italian	This corpus is available through a dedicated concordancer.	Concordancer
Corpus AVIP-API Annotation: orthographically transcribed	Italian	This corpus contains quasi-spontaneous dialogues (a map task). The corpus is available for download from a dedicated webpage.	Download
Corpus Lips Size: 700,000 words, 100 hours Annotation: PoS-tagged, lemmatised	Italian	This is a L2-learner corpus. The corpus is available for download from a dedicated webpage.	Download
Selezione dal "Corpus di parlato telegiornalistico. Anni Sessanta vs. 2005 Annotation: orthographically transcribed	Italian	This corpus contains news broadcast. The corpus is available for download from a dedicated webpage.	Download
SpIt-MDb (Spoken Italian - Multilevel Database) Annotation: orthographically transcribed	Italian	This corpus contains spontaneous speech. The corpus is available for download from a dedicated webpage.	Download
ESLORA 2.0 Size: 83 documents, 768,005 words, 898,914 tokens Annotation: manual alignment, orthographic transcription, PoS-tagging, lemmatisation Licence: academic, non-commercial	Spanish	This corpus consists of spontaneous conversations and semi-structured interviews recorded in Galicia between 2007 and 2015, which were orthographically transcribed and manually aligned to the audio files. The transcripts have been morphologically tagged and lemmatized with the statistical PoS-tagger XIADA. The corpus can be browsed via a dedicated search engine. The multiple functions of the search engine are fully described in the User Guide. For the relevant publication, see Barcala et al. (2018)#SEPVázquez Rozas and Barcala (2020)	Concordancer
Uralic Languages under the Influence (UraLUID) database Size: 108,000 tokens, 4 hours Annotation: MSD-tagged, time-alignment, phonetic and orthographic transcription	Udmurt, Tundra Nenets, Synya Khanty, Surgut Khanty	This corpus contains narratives (e.g., folk storites). The corpus is available for download from a dedicated website.	Download

Publications on Spoken Corpora

[Altrov and Pajupuu 2012] Rene Altrov, Hille Pajupuu. 2012. Estonian Emotional Speech Corpus: theoretical base and implementation.

[Barcala et al. 2018] Mario Barcala, Eva Domínguez, Alba Fernández, Raquel Rivas, Maria Paula Santalla, Victoria Vázquez, Rebeca Villapol. 2018. El corpus ESLORA de español oral: diseño, desarrollo y explotación. CHIMERA: Romance Corpora and Linguistic Studies 5 (2): 217⁠–237.

[Benešová et al. 2015] Lucie Benešová, Michal Křen, Martina Waclawičová. 2015. Korpus spontánní mluvené češtiny ORAL2013.

[Benet et al. 2012] Ariadna Benet, Susana Cortés, Conxita Lleó. 2012. Phonoprosodic Corpus of Spoken Catalan (PhonCAT).

[Czachór 2012] Agnieszka Czachór. 2012. Corpus of Polish Spoken in Germany. Collecting and Analyzing Written & Spoken Data for Investigating Contact-Induced Change.

[Gabriel et al. 2010] Christoph Gabriel, Ingo Feldhausen, Andrea Pešková, Laura Colantoni, Su-Ar Lee, Valeria Arana, Leopoldo Labastía. 2010. Argentinian Spanish Intonation.

[Hajič et al. 2008] Jan Hajič, Silvie Cinková, Marie Mikulová, Petr Pajas, Jan Ptáček, Josef Toman, Zdeňka Urešová. 2008. PDTSL: An Annotated Resource For Speech Reconstruction.

[Halabi 2016] Nawar Halabi. 2016. Modern Standard Arabic Phonetics for Speech Synthesis.

[Johannessen 2015] Janne Bondi Johannessen. 2015. The Corpus of American Norwegian Speech (CANS).

[Komrsková et al. 2018] Zuzana Komrsková, Marie Kopřivová, David Lukeš, Petra Poukarová, Hana Goláňová. 2018. New Spoken Corpora of Czech: ORTOFON and DIALEKT.

[Kontra and Váradi 1997] Miklós Kontra and Tamás Váradi. 1997. The Budapest Sociolinguistic Interview.

[Köhn et al. 2016] Arne Köhn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond.

[Kupisch et al. 2012] Tanja Kupisch, Dagmar Barton, Giulia Bianchi, Ilse Stangen. 2012. “he HABLA-Corpus (German-French and German-Italian).

[Pitt et al. 2005] Mark Pitt, Keith Johnson, Elizabeth Hume; Scott Kiesling, William Raymond.2005. The Buckeye Corpus of Conversational Speech: Labeling Conventions and a Test of Transcriber Reliability.

[Steingrímsson et al. 2018] Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus.

[Ulloa Saceda et al. 2012] Marta Ulloa Saceda; Lleó, Conxita & García Sánchez, Izarbe (2012): Corpora of spoken Spanish by simultaneous and successive German-Spanish bilingual and Spanish monolingual children.

[Vázquez Rozas and Barcala 2020] Victoria Vázquez Rozas and Mario Barcala. 2020. Computational tools and spoken corpus design: An ongoing dialogue. Caplletra 69 (2): 221–240.

[Verdonik 2018] Darinka Verdonik. Corpus and database GOS Videolectures.

[Verdonik and Zwitter-Vitez. 2012] Darinka Verdonik and Ana Zwitter-Vitez. 2012. Slovenski govorni korpus Gos.