Introduction
Computer-mediated communication (CMC) constitutes public and private communication on-line, such as posts on blogs, forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, mobile phone applications such as WhatsApp, e-mail and chat rooms. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also very important for the development of robust tools that can deal with non-standard spelling, vocabulary and grammar. Compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as resource to the scientific community, which is further exacerbated by the rapidly changing terms of service by content providers.
The CLARIN infrastructure offers 16 CMC corpora - most are available for Slovenian, but also for Czech, Dutch, Estonian, Finnish, French, German, Italian, and Lithuanian. Most of the corpora are richly tagged as well as available under public licences.
We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an email.
This website was last updated on 30 August 2021.
CMC corpora in the CLARIN infrastructure
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 1 million tokens |
Czech |
This corpus contains blog posts. The corpus is available for download from LINDAT. |
|
Size: 35 million tokens |
Dutch |
This corpus contains tweets, chats and SMS from 2005 to 2012. The corpus is available for searching online through the OpenSONAR environment. For the relevant publication, see Sanders (2012) |
|
Size: 660,798,199 tokens |
English |
This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards. The corpus is available for searching online through the Corpuscle concordancer. For the relevant publication, see Salway et al. (2016) |
|
DIDI – The DiDi corpus of South Tyrolean CMC 1.0.0 Size: 600,000 tokens |
English, German, Italian, Ladino |
This corpus consists of Facebook posts gathered from 136 Facebook users from South Tyrol. All texts are anonymised. The corpus is available for download from the EURAC Research CLARIN repository. For the relevant publication, see Frey et al. (2016) |
|
Size: 25 million tokens |
Estonian |
This corpus contains chat room messages, forum posts and news comments from 2000 to 2008 The corpus is available for download from a dedicated webpage associated with CLARIN Estonia and through a dedicated concordancer. |
|
Size: 2.6 billion tokens |
Finnish |
This corpus contains forum posts from the Suomi24 website from 2001 to 2016. The corpus is available for download from the FIN-CLARIN repository and through the concordancer Korp. |
|
Size: 80 million tokens |
French |
This corpus contains e-mails, forum posts, online chats, tweets and SMS. The corpus is available for download from Ortolang. For the relevant publication, see Panckhurst (2017) |
|
Size: 1,506,064,082 words |
French |
This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards. The corpus is available for searching online through the Corpuscle concordancer. For the relevant publication, see Salway et al. (2016) |
|
Size: 1 million tokens |
German |
This corpus contains online chats from 2000 to 2006 The corpus is available for download from the repository of CLARIN-D For the relevant publication, see Beißwenger (2013) |
|
PAISÀ Corpus of Italian Web Text Size: 380,000 pages, 250 million words |
Italian |
This corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services. The corpus is available for download from the EURAC Research CLARIN repository. |
|
Size: 190,000 comments |
Lithuanian |
This corpus contains forum posts from portals delfi.lt and lrytas.lt from 2010 to 2014. The corpus is available for download from the CLARIN-LT repository. |
|
Blog post and comment corpus Janes-Blog 1.0 Size: 34 million tokens |
Slovenian |
This corpus contains blog posts from RTV Slovenija and Publishwall. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText For the relevant publication, see Fišer et al. (2018) |
|
Size: 47 million tokens |
Slovenian |
This corpus contains forum posts from Avtomobilizem.com, MedOver.net and RTV Slovenija. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
|
News comment corpus Janes-News 1.0 Size: 14 million tokens |
Slovenian |
This corpus contains news comments from RTV Slovenija, Mladina and Reporter. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
|
Twitter corpus Janes-Tweet 1.0 Size: 139 million tokens |
Slovenian |
This corpus contains tweets written by Slovenian Twitter users from 2013 to 2017. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
|
Wikipedia talk corpus Janes-Wiki 1.0 Size: 5 million tokens |
Slovenian |
This corpus contains Slovenian Wikipedia user and talk pages. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
Other CMC corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 2.9 million tokens |
Dutch |
This corpus contains Facebook posts and WhatsApp messages from 2015 and 2016. For the relevant publication, see Hilte et al. (2016). |
|
Size: 100,000 tokens |
French |
This corpus contains eBay listings from 2005, 2017, and 2018. The corpus is manually annotated. The corpus is available for download from a dedicated webpage. For the relevant publication, see Gerstenberg, Hekkel, and Hewett (2019) |
|
Dereko – News and Wikipedia subcorpus Size: 670 million tokens |
German |
This corpus contains content from newsgroup posts and Wikipedia. The corpus is available through a dedicated concordancer. |
|
Size: 102 million tokens |
German |
This corpus contains blog posts. The corpus is available through a dedicated concordancer. |
|
Monitor corpus of tweets from Austrian users Size: 40 million tweets |
German, English |
The corpus contains tweets from 2007 to 2017. For the relevant publication, see Barbaresi (2016). |
|
Size: 600,000 tokens |
Lithuanian |
The corpus contains forum posts from the lyrtas.lt portal from 2014. The corpus is available for download from a dedicated webpage. For the relevant publication, see Kapočiūtė-Dzikienė et al. (2015). |
|
Size: 4 million tokens |
Lithuanian |
This corpus contains comments from the delfi.lt portal from 2015. The corpus is available for download from a dedicated webpage. For the relevant publication, see Kapočiūtė-Dzikienė et al. (2015). |
|
Corpus of Highly Emotive Internet Discussions Size: 160 milllion tokens |
Polish |
The corpus contains tweets. For the relevant publication, see Sobkowicz (2016). |
For access, contact the authors. |
Size: 0.5 million tokens |
Swiss German, German, French, Italian, Romansh |
This corpus contains around 25000 SMS from 2009. The corpus comes in two different versions which are available through separate concordancers - SMS Navigator and ANNIS. The version accessible through ANNIS is more richly annotated and includes PoS-tagging, normalization, annotation of nonce borrowings, etc. Access through the concordancers requires free registration. For the relevant publication, see Dürscheid and Stark (2011). |
|
Size: 5 million tokens |
Swiss German, German, French, Italian, Romansh |
This corpus contains 216 WhatsApp chats from 2014. The corpus is accessible online through the ANNIS system. For the relevant publication, see Ueberwasser and Stark (2017). |
|
The Corpus of Welsh Language Tweets Size: 7 million tokens |
Welsh |
The corpus contains tweets. The corpus is available for download from a dedicated webpage. |
Additional materials
Tutorial at CMC-Corpora 2017: "How to use for the annotation of CMC and social media resources: a practical introduction", 4 October 2017, Bolzano, Italy. [html]
CLARIN-PLUS workshop "Creation and Use of Social Media Resources", 18-19 May 2017, Kaunas, Lithuanian. [html]
Videolectures of the CLARIN-PLUS workshop. [html]
Publications on the CMC corpora
[Barbaresi 2016] Collection and Indexing of Tweets with a Geographical Focus.
[Beißwenger 2013] Michael Beißwenger. 2013. Das Dortmunder Chat-Korpus: ein annotiertes Korpus zur Sprachverwendung und sprachlichen Variation in der deutschsprachigen Chat-Kommunikation.
[Dürscheid and Stark 2011] Christa Dürscheid and Elisabeth Stark. 2011. sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland.
[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.
[Frey et al. 2016] Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W. Stemle. 2016. The DiDi Corpus of South Tyrolean CMC Data. A multilingual corpus of Facebook texts.
[Gerstenberg, Hekkel, and Hewett 2019] Annette Gerstenberg, Valerie Hekkel, and Freya Hewett. 2019. Online Auction Listings Between Community and Commerce.
[Hilte et al. 2016] Lisa Hilte, Reinhild Vandekerckhove, Walter Daelemans. 2016. Expressiveness in Flemish Online Teenage Talk: A Corpus-Based Analysis of Social and Medium-Related Linguistic Variation.
[Kapočiūtė-Dzikienė et al. 2015] Jurgita Kapočiūtė-Dzikienė, Ligita Šarkuté, Andrius Utka. 2015. The Effect of Author Set Size in Authorship Attribution for Lithuanian.
[Panckhurst 2017] Rachel Panckhurst. 2017. A digital corpus resource of authentic anonymized French text messages: 88milSMS—What about transcoding and linguistic annotation?
[Salway et al. 2016] Andrew Salway, Dag Elgesem, Knut Hofland, Øystein Reigem, Lubos Steskal. 2016. Topically-focused Blog Corpora for Multiple Languages.
[Sanders 2012] Eric Sanders. Collecting and Analysing Chats and Tweets in SoNaR.
[Sobkowicz 2016] Antoni Sobkowicz. 2016. Political Discourse in Polish Internet - Corpus of Highly Emotive Internet Discussions.
[Ueberwasser and Stark 2017] Simone Ueberwasser and Elisabeth Stark. 2017. What’s up, Switzerland? A corpus-based research project in a multilingual country.