Introduction
Corpora of spoken language contain transcriptions of spontaneous or planned speech, such as broadcast news or elicited narratives and dialogues. They are often aligned with the accompanying recordings. They are an invaluable resource for various kinds of linguistic research, such as phonology, conversational analysis, and dialectology. Such corpora are carefully sampled and rich in sociodemographic metadata.
There are 90 spoken corpora in the CLARIN infrastructure, 79 of which contain both the transcriptions of spoken or spontaneous speech and the associated recordings, and 11 only the transcriptions. Most of the corpora are monolingual, accounting for the following 15 languages: Arabic, Czech, Dutch, Estonian, Finnish, French, German, Hungarian, Italian, Nepali, Norwegian, Polish, Skoti Saami, Slovenian, Spanish, and Swedish. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged, many with mark-up specific to speech corpora, such as phonemic and prosodic annotation.
We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an email.
This website was last updated on 9 July 2020.
Spoken corpora in the CLARIN infrastructure
Corpora with transcriptions and audio recordings
Corpus | Language | Description | Availability |
---|---|---|---|
Licence: CC BY 4.0 |
Arabic |
The corpus is available for download from a dedicated webpage. For a relevant publication, see Halabi (2016). |
Download |
DIALEKT v1: dialectal corpus with multi-tier transcription
Size: 100,000 words |
Czech |
This corpus contains traditional dialectological material, mostly unprepared monologue-type speech. The corpus is available download (upon request) and through the concordancer KonText. For a related publication, see Komrsková et al. (2018). |
|
ORAL2013: balanced corpus of informal spoken Czech (transcriptions & audio)
Size: 2.8 million words |
Czech |
This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For a related publication, see Benešová et al. (2015). |
|
Size: 1 million words |
Czech |
This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For a related publication, see Komrsková et al. (2018). |
|
Size: 35 hours |
Czech |
This corpus contains transcribed recordings from the Czech political discussion broadcast “Otázky Václava Moravce“. The corpus is available for download from LINDAT and through the concordancer KonText. |
|
Prague DaTabase of Spoken Czech 1.0
Size: 770,000 tokens, 7324 minutes |
Czech |
This corpus contains spontaneous dialogue. The corpus is available for download from LINDAT. For a related publication, see Hajič et al. (2008). |
Download |
Size: 1000 hours |
Czech |
The corpus contains talks on Christian mysticism given by Karel Makoň. The corpus is available for download from LINDAT. |
Download |
Czech Malach Cross-lingual Speech Retrieval Test Collection
Size: 592 hours |
Czech, English, French, German, Spanish |
This corpus contains interviews with survivors of the Holocaust. The corpus is available for download from LINDAT. |
Download |
Size: 50,000 words (41 minutes/speaker)
|
Dutch |
The corpus is available for download from an informal webpage. |
Download |
Size: 115 hours |
Dutch |
The corpus contains recordings of human-machine interaction and read speech performed by children, non-native speakers and senior people. The corpus is available download from the Dutch Language Institute. |
Download |
Air Traffic Control Communication
Size: 20 hours |
English |
This corpus contains recordings of communication between air traffic controllers and pilots The corpus is available for download from LINDAT and through the concordancer KonText. |
|
Boston University Radio Speech Corpus
Size: 7 hours |
English |
This corpus contains recordings and texts from radio news. The corpus is available for download from the UPenn repository. |
Download |
Buckeye Corpus of Conversational Speech
Annotation: phonetic labels |
English |
This corpus contains an interview. The corpus is available for download from ORTOLANG. For a related publication, see Pitt et al. (2005). |
Download |
Size: 13 hours |
English |
This corpus contains recorded lectures and seminars. The corpus is available for download from FIN-CLARIN. |
Download |
Size: 41 hours |
Estonian |
This corpus contains recordings of academic lectures and oral conference presentations. The corpus is available for download from (CELR distribution).
|
Download |
Size: 36 hours |
Estonian |
This corpus contains telephone interviews from different radio programmes. The corpus is available for download from META-SHARE (CELR distribution). |
Download |
Size: 19 hours |
Estonian |
This corpus contains public broadcast news. The corpus is available for download from META-SHARE (CELR distribution). |
|
Size: 1.3 million words
|
Estonian |
This corpus contains interviews. The corpus is available for download from META-SHARE (CELR distribution). |
Download |
Estonian Emotional Speech Corpus
Size: 1234 sentences |
Estonian |
This corpus contains read sentences that express anger, joy and sadness, or are neutral. The corpus is available for download from META-SHARE (CELR distribution). For a related publication, see Altrov and Pajupuu (2012). |
Download |
Estonian North Wind and the Sun Corpus v.1.0.3 Annotation: words in standard orthography and phonemes in SAMPA |
Estonian |
This corpus contains recordings of the tale “Põhjatuul ja päike” (North Wind and the Sun) read by the same speakers who participated in the Phonetic Corpus of Estonian Spontaneous Speech. The corpus is available for download from META-SHARE (CELR distribution). |
Download |
Phonetic Corpus of Estonian Spontaneous Speech v.1.0.4
Size: 635,000 words, 90 hours |
Estonian |
This corpus contains spontaneous speech by speakers with different dialectological and social backgrounds. The corpus is available for download from META-SHARE (CELR distribution).
|
Download |
Faroese Danish Corpus Hamburg 0.2.dan (FADAC-0.2.dan Hamburg) |
Faroese, Danish |
This corpus contains informal interviews involving 82 speakers (27 female, 33 male).
|
|
Aalto University DSP Course Conversation Corpus 2013-2016, Downloadable Version
Size: 5200 utterances
|
Finnish |
This corpus contains spontaneous conversations. The corpus is available for download from FIN-CLARIN. |
Download |
Size: 18 hours |
Finnish |
This corpus contains radio and TV broadcasts. The corpus is available for download from FIN-CLARIN and for online querying through the LAT-platform. |
|
Follow-up Study of Dialects of Finnish
Size: 12,200 Hours |
Finnish |
This corpus contains interviews. This corpus is available for online querying through the LAT-platform. |
LAT platform |
Size: 218 tokens |
Finnish |
This corpus contains spontaneous conversations. This corpus is available for online querying through the concordancer Korp. |
Concordancer |
Licence: CC-BY |
Finnish |
This corpus contains interviews. This corpus is available for online querying through the LAT platform and through the concordancer Korp. |
|
The Finnish Dialect Syntax Archive
Size: 1.2 million words |
Finnish |
The corpus contains interviews. The corpus is available for online querying through the LAT platform and through the concordancer Korp. |
|
The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s)
Size: 210 hours |
Finnish |
This corpus contains spontaneous speech and interviews. The corpus is available for online querying through the LAT platform. |
LAT platform |
Size: 120 hours
|
Finnish, Karelian |
This corpus contains interviews. The corpus is available for download from FIN-CLARIN. |
Download |
Plenary Sessions of the Parliament of Finland
Size: 22.5 million words |
Finnish, Swedish |
This corpus contains the proceedings of the Finnish Parliament. The corpus is available through a dedicated webpage and through the concordancer Korp. |
|
Licence: CC BY-NC-SA 4.0 |
French |
This is a collection containing around 40 corpora which contain social interactions in different contexts: professional, private, institutional, commercial, medical, and educational situations. Most of the corpora can be queried through a dedicated concordancer. |
Concordancer |
Corpus de Français Parlé Parisien des années 2000 Licence: CC-BY |
French |
This corpus contains interviews. The corpus is available for download from a dedicated webpage. |
Download |
Corpus for the Study of Contemporary French Size: 10 million words, 350 hours Annotation: orthographically aligned, PoS-tagged Licence: CC-BY 4.0 |
French |
This corpus contains debates, classroom interactions, literary and scientific texts, regional and national press, etc. The corpus is available for through a dedicated concordancer. |
Concordancer |
Licence: CC BY-NC-SA 3.0 |
French |
This corpus contains recordings of the everyday speech of Orléans residents between 1969 and 1974. The corpus is available for download from the Huma-num repository. |
Download |
Phonologie du Français Contemporain
|
French |
The corpus is available for download from a dedicated webpage. |
Download |
Size: 3 hours |
German |
This corpus contains task-oriented communcation (e.g., a film retelling) in the context of studying adult L2 acquisition. |
Download |
Corpora of the Bavarian Archive for Speech Signals
Size: 47 corpora |
German, English |
This corpus collections contains a wide variety of spoken discourse, such as elicited speech tasks, spontaneous conversations in different settings (e.g., in a taxi, over the telephone), involving a variety of different speakers (e.g., from adolescents to adults, as well as speakers that are hard of hearing). The corpora are available for download from the BAS CLARIN B Centre. |
Download |
Size: 1.4 million tokens |
German (L2 and L1), English, Polish, Italian (L1) |
This corpus contains transcripts and audio recordings of spoken academic discourse, primarily talks including discussions and oral exams. For the relevant publication, see Fandrych et al. (2014) |
|
Size: 330,000 words, 65 hours |
German |
This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 10 hours |
German |
This corpus contains broadcast TV debates The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 260,000 words, 28 hours |
German |
This corpus contains narrative interviews on German reunification. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Biographische und Reiseerzählungen
Size: 50,000 words, 6 hours |
German |
This corpus contains narrative and biographic interviews. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 10,000 words, 2 hours |
German |
This corpus contains broadcasts in standard German. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Deutsche Mundarten: ehemalige deutsche Ostgebiete
Size: 838,000 words; 461 hours |
German |
This corpus contains interviews and elicited speech in German dialects. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Deutsche Standardsprache: König-Korpus
Size: 50,000 words; 6 hours |
German |
This corpus contains interviews and elicited speech in standard German The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). Note: Excerpt - complete corpus (around 120 hours) currently in the process of curation |
|
Deutsche Umgangssprachen: Pfeffer-Korpus
Size: 646,000 words, 80 hours |
German |
This corpus contains interviews in regional varieties of German. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 140,000 words, 15 hours |
German |
This corpus contains authentic interaction from various domains. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 160,000 words, 12 hours |
German |
This corpus contains elicited conflict interaction. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 232,000 words, 285 hours |
German |
This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Emigrantendeutsch in Israel: Wiener in Jerusalem
Size: 225,000 words, 51 hours |
German |
This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Forschungs- und Lehrkorpus gesprochenes Deutsch
Size: 2.3 million words, 230 hours |
German |
This corpus contains authentic interactions from various domains The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Grundstrukturen: Freiburger Korpus
Size: 600,000 words, 70 hours |
German |
This corpus contains authentic interaction from various domains. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Mehrsprachige Kinder im Vorschulalter
Size: 17,000 words, 13 hours |
German |
This corpus contains elicitation tasks with pre-school children. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 100,000 words, 10 hours |
German |
This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Zweite Generation deutschsprachiger Migranten in Israel
Size: 125 hours |
German |
This corpus contains interviews in German extraterritorial varieties. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Deutsche Mundarten: Zwirner-Korpus
Size: 4 million words; 1076 hours |
German, (some Frisian and Dutch) |
This corpus contains interviews and elicited speech in German dialects. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 212,000 words, 385 hours |
German, (some Sorbian) |
This corpus contains interviews and elicited speech in German dialects. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Gesprochene Wissenschaftssprache Kontrastiv
Size: 760,000 words, 92 hours |
German, English, Polish, Bulgarian |
This corpus contains academic interaction. The corpus is available for download and online browsing via the Database of Spoken German (AGD @ IDS Mannheim). |
|
Size: 2 hours |
German, English, French, Spanish, Turkish, Polish, Vietnamese, Swedish, Norwegian, Italian, Russian, Afrikaans, Portuguese |
This corpus is a demo of the EXMARaLDA system. The corpus is available for download from a CLARIN-D repository.
|
Download |
Hamburg Adult Bilingual LAnguage (HABLA)
Size: 79 hours |
German, French, Italian |
This corpus contains interviews. For a related publication, see Kupisch et al. (2012).
|
|
Budapest Sociolinguistic Interview - version 2
Size: 270,000 words |
Hungarian |
This corpus contains sociolinguistic interviews conducted with 50 individuals. The corpus is available for download and through a dedicated concordancer. For a related publication, see Kontra and Váradi (1997). |
|
Licence: ELRA |
Hungarian |
This corpus contains speech tasks involving adults and children. The corpus is available for download from the ELRA catalogue. |
Download |
The Icelandic Spoken Language Corpus Size: 536,000 tokensAnnotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY 4.0 |
Icelandic |
This corpus contains four different subcorpora: (1) Spontaneous conversations, from the project ÍSTAL (An Icelandic Spoken Language Bank); (2) Group conversations, from the project MIN (Modern loanwords in the Nordic languages); (3) Parliamentary debates; (4) Conversations of teenagers with other teenagers and adults The corpus is available for download from CLARIN-IS (as a part of the Icelandic Gigaword Corpus) and for search through the concordancer Korp. For relevant publication, see Steingrímsson et al. (2018) |
|
CLIPS : corpora e lessici di italiano parlato e scritto Size: 100 hours |
Italian |
This corpus contains speech from 15 different cities in Italy. |
Download |
Size: 5000 sentences, 4.5 hours |
Mbochi, French |
The corpus is available for download from the ELRA catalogue. |
Download |
Size: 31 hours 26 minutes |
Nepali |
The corpus is available for download from the ELRA Catalogue.
|
Download |
Nganasan Spoken Language Corpus (NSLC)
Size: 32 hours |
Nganasan, Russian |
This second version 0.2 of the corpus is a subcorpus that comprises 177 communications, 136 of which contain an aligned audio recording, with glossed (Toolbox/FLEx) and annotated (EXMARaLDA) transcripts from 57 speakers. All texts have been translated into Russian and English, some also into German. The corpus also contains rich metadata on the communications and speakers. |
|
Size: 1.5 million tokens
|
Norwegian |
This corpus contains interviews and conversation in Norwegian dialects. The corpus is available through the Tekstlab concordancer Glossa (account needed). |
Concordancer |
Size: 1 million tokens |
Norwegian |
This corpus contains interviews and conversations in Oslo sociolects. The corpus is available through the Tekstlab concordancer Glossa (account needed). |
Concordancer |
Size: 270 000 tokens |
Norwegian |
This corpus contains informal interviews in Oslo sociolects. The corpus is available through the Tekstlab concordancer Glossa(account needed). |
Concordancer |
Size: 440,300 tokens |
Norwegian |
This corpus contains recordings and transcripts from the Norwegian Big Brother in 2001. The corpus is available through the Tekstlab concordancer Glossa. |
Concordancer |
Corpus of American Nordic Speech (CANS)
Size: 251,000 tokens |
Norwegian, Swedish |
This corpus contains interviews, conversations. Norwegian and Swedish dialects in America. The corpus is available through the Tekstlab concordancer Glossa. For a related publication, see Johannessen (2015). |
Concordancer |
Size: 2,754,289 tokens
|
Norwegian, Swedish, Danish, Faroese, Icelandic, Övdalian |
This corpus consists of pontaneous speech data from dialects of the North Germanic languages across all of the Nordic countries. The linguistic data in the corpus comes from a variety of sources, (see homepage - Data Collection), recorded in 1998 - 2015. The corpus transcribed and linked to audio and video, has a map function, and can be searched in a large variety of ways. The corpus can be accessed online via a concordancer provided by the TekstLab (a CLARINO node). |
|
Hamburg Corpus of Polish in Germany (HamCoPoliG)
Size: 38 hours |
Polish |
This corpus contains spontaneous speech and reading tasks. For a related publication, see Czachór (2012).
|
|
Consecutive and Simultaneous Interpreting (CoSi)
Size: 6 hours |
Portuguese, English |
This corpus contains lectures in Portuguese with simultaneous interpretation in English.
|
|
Skolt Saami Documentation Corpus (2016)
|
Skolt Saami |
This corpus contains interviews. This corpus is available for online querying through the LAT platform. |
LAT platform |
Hamburg Corpus of Argentinean Spanish (HaCASpa)
Size: 19 hours |
Spanish (Argentinian) |
This corpus contains spontaneous speech and reading tasks. For a related publication, see Gabriel et al. (2010). |
|
Catalan in a bilingual context (PhonCAT)
Size: 144 hours |
Spanish (Catalan) |
This corpus contains read, elicited and spontaneous speech. For a related publication, see Benet et al. (2012). |
Corpora with transcriptions only
Corpus | Language | Description | Availability |
---|---|---|---|
ORAL2008: Balanced corpus of informal spoken Czech
Size: 1 million tokens |
Czech |
This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For a related publication, see Benešová et al. (2015). |
|
ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)
Size: 1 million tokens |
Czech |
This corpus contains informal conversations. The corpus is available for download from LINDAT and through the concordancer KonText. For a related publication, see Komrsková et al. (2018). |
|
Prague Dependency Treebank of Spoken Language (PDTSL) 0.5
Size: 120,000 words |
Czech |
The corpus is available for download from LINDAT. |
Download |
ParCorFull: A Parallel Corpus Annotated with Full Coreference
Size: 160,000 tokens |
English, German |
This corpus contains planned speech and newswire. The corpus is available for download from LINDAT. |
Download |
Annotation: text segmentation, normalization, time-alignment
|
English, German, Dutch |
The corpus contains transcripts of read Wikipedia articles The corpus is available for download from a CLARIN-D repository. For a related publication, see Köhn et al. (2016). |
Download |
Size: 1 million words |
Estonian |
The corpus contains transcripts of recordings from various domains. |
|
Size: 72 hours |
German, Spanish |
This corpus contains Speech tasks performed by bilingual children. For a related publication, see Ulloa Saceda et al. (2012). |
|
Corpus of Doctor-Patient Conversations from Ahus
Size: 958,830 tokens |
Norwegian |
This corpus contains doctor-patient conversations. The corpus is available through a Tekstlab concordancer (account needed). |
|
Size: 1 million words, 120 hours |
Slovenian |
This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer. For a related publication, see Verdonik and Zwitter-Vitez (2011). |
|
Spoken corpus Gos VideoLectures 3.0 (transcription)
Size: 126,000 words |
Slovenian |
This corpus contains public academic speech. The corpus is available for download from CLARIN.SI and through the concordancer KonText. For the version with audio recordings, click here. For a related publication, see Verdonik (2018). |
|
Size: 1,470,000 tokens |
Swedish |
The corpus is available through the concordancer Korp (account needed). |
Concordancer |
Other spoken corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Griffith Corpus of Spoken Australian English Size: 32,134 words |
English |
The corpus is available for download and through the concordancer of the Australian National Corpus. |
|
Size: 10 million words |
English |
The corpus contains face-to-face conversations between people who speak British English as their first language. The corpus is available through the CQP concordancer. |
Concordancer |
The Aston Corpus of West Midlands English (ACWME) Annotation: orthographically transcribed |
English |
The corpus contains recordings of performances - comedy, drama, poetry, song and story-telling - and related interviews with performers, members of the audience and local and national celebrities. The corpus is available for download from a dedicated webpage. |
Download |
English |
The corpus contains naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF). The corpus is available through a dedicated concordancer. |
Concordancer | |
English, Italian, Spanish |
The corpus contains TV-broadcasts and elicited dialogues. |
||
Babel - A Multi Language Database Annotation: orthographically transcribed |
Hungarian |
This corpus contains various elicited speech tasks. |
|
BEA (Hungarian Spontaneous Speech Database)
Size: 465 recordings |
Hungarian |
This corpus contains spontaneous speech. |
|
Hungarian Broadcast News Database
Size: 25,000 words, 3.5 hours |
Hungarian |
The corpus is available for download (upon request) from META-SHARE. |
Download |
Hungarian Gigaword Corpus / "spoken language" subcorpus
Size: 76 million words |
Hungarian |
The corpus contains radio broadcasts (reading aloud and spontaneous conversation) The corpus is available through the Hungarian Gigaword Corpus concordancer. |
Concordancer |
Hungarian Kindergarten Language Corpus
Size: 192,000 words |
Hungarian |
This corpus contains elicited speech tasks (picture descriptions) and guided conversation with children. The corpus is available for download through META-SHARE. |
Download |
Hungarian Reference Speech Database
Size: 6 hours |
Hungarian |
This corpus contains reading tasks. The corpus is available for download (upon request) from META-SHARE. |
Download |
Annotation: phonetic transcription |
Hungarian |
The corpus is available for download (upon request) from META-SHARE. |
Download |
Size: 490,000 words |
Italian |
The corpus is available through a dedicated concordancer. |
Concordancer |
Annotation: orthographically transcribed |
Italian |
The corpus contains quasi-spontaneous dialogues (a map task). The corpus is available for download from a dedicated webpage. |
Download |
Size: 700,000 words, 100 hours |
Italian |
This is a L2-learner corpus. The corpus is available for download from a dedicated webpage. |
Download |
Selezione dal "Corpus di parlato telegiornalistico. Anni Sessanta vs. 2005 Annotation: orthographically transcribed |
Italian |
This corpus contains news broadcast. The corpus is available for download from a dedicated webpage. |
Download |
SpIt-MDb (Spoken Italian - Multilevel Database) Annotation: orthographically transcribed |
Italian |
This corpus contains spontaneous speech. The corpus is available for download from a dedicated webpage. |
Download |
Uralic Languages under the Influence (UraLUID) database
Size: 108,000 tokens, 4 hours |
Udmurt, Tundra Nenets, Synya Khanty, Surgut Khanty |
This corpus contains narratives (e.g., folk storites). The corpus is available for download from a dedicated website. |
Download |
Publications on the spoken corpora
[Altrov and Pajupuu 2012] Rene Altrov, Hille Pajupuu. 2012. Estonian Emotional Speech Corpus: theoretical base and implementation
[Benešová et al. 2015] Lucie Benešová, Michal Křen, Martina Waclawičová. 2015. Korpus spontánní mluvené češtiny ORAL2013.
[Benet et al. 2012] Ariadna Benet, Susana Cortés, Conxita Lleó. 2012. Phonoprosodic Corpus of Spoken Catalan (PhonCAT).
[Czachór 2012] Agnieszka Czachór. 2012. Corpus of Polish Spoken in Germany. Collecting and Analyzing Written & Spoken Data for Investigating Contact-Induced Change.
[Gabriel et al. 2010] Christoph Gabriel, Ingo Feldhausen, Andrea Pešková, Laura Colantoni, Su-Ar Lee, Valeria Arana, Leopoldo Labastía. 2010. Argentinian Spanish Intonation.
[Hajič et al. 2008] Jan Hajič, Silvie Cinková, Marie Mikulová, Petr Pajas, Jan Ptáček, Josef Toman, Zdeňka Urešová. 2008. PDTSL: An Annotated Resource For Speech Reconstruction.
[Halabi 2016] Nawar Halabi. 2016. Modern Standard Arabic Phonetics for Speech Synthesis.
[Johannessen 2015] Janne Bondi Johannessen. 2015. The Corpus of American Norwegian Speech (CANS).
[Komrsková et al. 2018] Zuzana Komrsková, Marie Kopřivová, David Lukeš, Petra Poukarová, Hana Goláňová. 2018. New Spoken Corpora of Czech: ORTOFON and DIALEKT.
[Kontra and Váradi 1997] Miklós Kontra and Tamás Váradi. 1997. The Budapest Sociolinguistic Interview.
[Köhn et al. 2016] Arne Köhn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond.
[Kupisch et al. 2012] Tanja Kupisch, Dagmar Barton, Giulia Bianchi, Ilse Stangen. 2012. “he HABLA-Corpus (German-French and German-Italian).
[Pitt et al. 2005] Mark Pitt, Keith Johnson, Elizabeth Hume; Scott Kiesling, William Raymond.2005. The Buckeye Corpus of Conversational Speech: Labeling Conventions and a Test of Transcriber Reliability.
[Steingrímsson et al. 2018] Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus.
[Ulloa Saceda et al. 2012] Marta Ulloa Saceda; Lleó, Conxita & García Sánchez, Izarbe (2012): Corpora of spoken Spanish by simultaneous and successive German-Spanish bilingual and Spanish monolingual children.
[Verdonik 2018] Darinka Verdonik. Corpus and database GOS Videolectures.
[Verdonik and Zwitter-Vitez. 2012] Darinka Verdonik and Ana Zwitter-Vitez. 2012. Slovenski govorni korpus Gos.