CMC Corpora

Computer-mediated communication (CMC) constitutes public and private communication online, for instance blogs and forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, or mobile phone applications such as WhatsApp, email and chat rooms. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also important for the development of robust tools that can deal with non-standard spelling, vocabulary and grammar. The compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as a resource to the scientific community, as well as rapidly changing terms of service by content providers.

The CLARIN infrastructure offers 23 CMC corpora - most are available for Slovenian, but also for Czech, Dutch, Estonian, Finnish, French, German, Italian, and Lithuanian. Most of the corpora are richly tagged and are available under public licences.

The table first provides an overview of the corpora that are already part of the CLARIN infrastructure and then lists those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

CMC Corpora in the CLARIN Infrastructure

Corpus	Language	Description	Availability
Web corpus MaCoCu Annotation: annotated with extensive metadata Licence: CC0-No Rights Reserved	Albanian, Bosnian, Bulgarian, Catalan, Croatian, Modern Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, Turkish, Ukrainian	These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. The crawler is available at MaCoCu GitHub channel. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate and near-duplicated paragraphs, discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria, making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as 'short' or 'good', assigned based on paragraph length, URL and stopword density via the jusText tool) and fluency (score between 0 and 1, assigned with the Monocleaner tool), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus. The corpus is available for download from the Slovenian repository CLARIN.SI and can be easily read with the prevert parser. For the relevant publication, see Bañón et al. (2022)	Download (Albanian) Download (Bosnian) Download (Bulgarian) Download (Catalan) Download (Croatian) Download (Modern Greek) Download (Icelandic) Download (Macedonian) Download (Maltese) Download (Montenegrin) Download (Serbian) Download (Slovenian) Download (Turkish) Download (Ukrainian)
Corpus of contemporary blogs Size: 1 million tokens Annotation: tokenised, sentence tagging Licence: CC-BY	Czech	This corpus contains blog posts. The corpus is available for download from LINDAT.	Download
SoNaR New Media Size: 35 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CLARIN ACA	Dutch	This corpus contains tweets, chats and SMS from 2005 to 2012. The corpus is available for searching online through the OpenSONAR environment. For the relevant publication, see Sanders (2012)	Concordancer
Corpus of Global Web-Based English Size: 1.8 billion words; 1.8 million texts Licence: CLARIN RES (download); CLARIN ACA (online)	English	This corpus contains texts from web-pages in United States, Great Britain, Australia, India, and 16 other countries. About 60% of the texts come from blogs. The corpus is available for download from (the Finnish Language Bank) and for online browsing through the concordancer Korp.	Concordancer Download
NTAP English Size: 660,798,199 tokens	English	This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards. The corpus is available for searching online through the Corpuscle concordancer. For the relevant publication, see Salway et al. (2016)	Concordancer
DIDI – The DiDi corpus of South Tyrolean CMC 1.0.0 Size: 600,000 tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: ACA-BY-NC-NORED 1.0	English, German, Italian, Ladino	This corpus consists of Facebook posts gathered from 136 Facebook users from South Tyrol. All texts are anonymised. The corpus is available for download from the EURAC Research CLARIN repository. For the relevant publication, see Frey et al. (2016)	Download
The Mixed Corpus: New Media Size: 25 million tokens Annotation: tokenised	Estonian	This corpus contains chat room messages, forum posts and news comments from 2000 to 2008 The corpus is available for download from a dedicated webpage associated with CLARIN Estonia and through a dedicated concordancer.	Concordancer Download
SFNET Corpus Size: 100 million words Annotation: PoS-tagged, sentence and word segmentation Licence: CLARIN ACA – NC	Finnish	This corpus contains written posts from the SFNET forum in Finnish from 2002 to 2003. The PoS-tagging has been done with the FI-FDG Parser, which uses a computational implementation of Functional Dependency Grammar. The corpus is available for download from META-SHARE (the Finnish Language Bank)	Download
Suomi 24 Corpus Size: 2.6 billion tokens Annotation: tokenised, MSD-tagged Licence: CLARIN ACA	Finnish	This corpus contains forum posts from the Suomi24 website from 2001 to 2016. The corpus is available for download from the FIN-CLARIN repository and through the concordancer Korp.	Concordancer Download
The HS.fi News and Comments Corpus Size: 8 million tokens; 593,760 sentences; 93,602 texts Annotation: PoS-tagged, lemmatised, syntactically parsed Licence: CLARIN ACA – NC	Finnish	This corpus contains the domestic news of the Helsingin Sanomat website and their comments from 5 September 2011 to 4 September 2012. The corpus has been syntactically parsed using TDT alpha. The corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.	Concordancer Download
Ylilauta Corpus Size: 26.9 million words Annotation: PoS-tagged, lemmatised, syntactically parsed, named entities Licence: CC BY-NC	Finnish	The corpus contains text from discussions of the Ylilauta online discussion board from 2012 to 2014. The corpus has been syntactically annotated with the TDT alpha parser, while the named entities have been assigned using the FiNER tool. The corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.	Concordancer Download
CoMeRe repository Size: 80 million tokens Annotation: tokenised, mostly untagged Licence: CC-BY	French	This corpus contains e-mails, forum posts, online chats, tweets and SMS. The corpus is available for download from Ortolang. For the relevant publication, see Panckhurst (2017)	Download
NTAP French Size: 1,506,064,082 words	French	This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards. The corpus is available for searching online through the Corpuscle concordancer. For the relevant publication, see Salway et al. (2016)	Concordancer
Dortmund Chat Corpus Size: 1 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY	German	This corpus contains online chats from 2000 to 2006 The corpus is available for download from the repository of CLARIN-D For the relevant publication, see Beißwenger (2013)	Download
PAISÀ Corpus of Italian Web Text Size: 380,000 pages, 250 million words Licence: CC BY-NC-SA 4.0	Italian	This corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services. The corpus is available for download from the EURAC Research CLARIN repository.	Download
LITIS v.1 Size: 190,000 comments Licence: CLARIN_ACA	Lithuanian	This corpus contains forum posts from portals delfi.lt and lrytas.lt from 2010 to 2014. The corpus is available for download from the CLARIN-LT repository.	Download
Serbian Web Corpus PDRS 1.0 Size: 715 million tokens Annotation: tokenised, MSD-tagged (MULTEXT-East & UD), lemmatised, annotated for text source Licence: CC BY	Serbian	This corpus contains texts from the web obtained by crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a frequency between 5,000 and 500,000 in srWaC have been used. The texts are deduplicated, cyrillic texts have been transliterated into the Latin alphabet. The linguistic processing was done with the CLASSLA package for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies). In addition, some 80% of the URLs are manually tagged for 10 different types of sources ("area"): media (media outlets with several posts daily), inform (topic-centered sites with infrequent posts - maximum 3 per day), company (presentations of companies), state (websites of government bodies on nationa, regional and local level), forum (forum posts), portal (topic-centered portals without daily coverage), science (scientific publications), shop (with descriptions of products), database (knowledge bases, dictionaries, databases and similar) and community (NGOs, fan clubs, associations and other). The corpus is distributed in the CoNLL-U format in batches of appr. 2x50 mio. tokens. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through noSketchEngine and KonText concordancers.	Concordancer (noSketchEngine) Concordancer(KonText) Download
Blog post and comment corpus Janes-Blog 1.0 Size: 34 million tokens Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised Licence: CC-BY	Slovenian	This corpus contains blog posts from RTV Slovenija and Publishwall. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText For the relevant publication, see Fišer et al. (2018)	Concordancer Download
Forum corpus Janes-Forum 1.0 Size: 47 million tokens Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised Licence: CC-BY	Slovenian	This corpus contains forum posts from Avtomobilizem.com, MedOver.net and RTV Slovenija. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018)	Concordancer Download
Monitor corpus of Slovene Trendi 2023-02 Size: 700 million tokens Annotation: PoS-tagged, lemmatised, syntactically parsed, annotated for named entities and topics	Slovenian	This corpus contains news from 107 different media websites, published by 72 different publishers, and is a monitor corpus of Slovene. Trendi 2023-02 covers the period from January 2019 to February 2023, complementing the Gigafida 2.0 reference corpus of written Slovene. All the contents of the Trendi corpus are at the moment obtained using the Jožef Stefan Institute Newsfeed service. The texts have been annotated using the CLASSLA-Stanza pipeline, including syntactic parsing according to the Universal Dependencies and Named Entities. An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. Text classification models are available at Text classification model SloBERTa-Trendi-Topics 1.0, Text classification model fastText-Trendi-Topics 1.0, and SloBERTa model. At the moment, the corpus is not available as a dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus can be queried through noSketchEngine and KonText concordancers. For the relevant publication, see Kosem (2022)#SEPKosem et al. (2022)	Concordancer (noSketchEngine) Concordancer(KonText)
News comment corpus Janes-News 1.0 Size: 14 million tokens Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised Licence: CC-BY	Slovenian	This corpus contains news comments from RTV Slovenija, Mladina and Reporter. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018)	Concordancer Download
Twitter corpus Janes-Tweet 1.0 Size: 139 million tokens Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised Licence: CC-BY	Slovenian	This corpus contains tweets written by Slovenian Twitter users from 2013 to 2017. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018)	Concordancer Download
Wikipedia talk corpus Janes-Wiki 1.0 Size: 5 million tokens Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised Licence: CC-BY	Slovenian	This corpus contains Slovenian Wikipedia user and talk pages. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018)	Concordancer Download

Corpus

Language

Description

Availability

Web corpus MaCoCu

Annotation: annotated with extensive metadata
Licence: CC0-No Rights Reserved

Albanian, Bosnian, Bulgarian, Catalan, Croatian, Modern Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, Turkish, Ukrainian

These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. The crawler is available at MaCoCu GitHub channel. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate and near-duplicated paragraphs, discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed.

The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria, making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as 'short' or 'good', assigned based on paragraph length, URL and stopword density via the jusText tool) and fluency (score between 0 and 1, assigned with the Monocleaner tool), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be easily read with the prevert parser.

For the relevant publication, see Bañón et al. (2022)

Download (Modern Greek)

Download (Icelandic)

Download (Macedonian)

Download (Maltese)

Download (Montenegrin)

Corpus of contemporary blogs

Size: 1 million tokens
Annotation: tokenised, sentence tagging
Licence: CC-BY

Czech

This corpus contains blog posts.

The corpus is available for download from LINDAT.

Corpus	Language	Description	Availability
Flemish Online Teenage Talk Size: 2.9 million tokens Annotation: tokenised	Dutch	This corpus contains Facebook posts and WhatsApp messages from 2015 and 2016. For the relevant publication, see Hilte et al. (2016).
eBay petites annonces Size: 100,000 tokens Annotation: see here Licence: CC BY-NC-SA 4.0	French	This corpus contains eBay listings from 2005, 2017, and 2018. The corpus is manually annotated. The corpus is available for download from a dedicated webpage. For the relevant publication, see Gerstenberg, Hekkel, and Hewett (2019)	Download
Dereko – News and Wikipedia subcorpus Size: 670 million tokens Annotation: tokenised	German	This corpus contains content from newsgroup posts and Wikipedia. The corpus is available through a dedicated concordancer.	Concordancer
DWDS – Blogs Size: 102 million tokens Annotation: tokenised	German	This corpus contains blog posts. The corpus is available through a dedicated concordancer.	Concordancer
Monitor corpus of tweets from Austrian users Size: 40 million tweets Annotation: tokenised, lemmatised	German, English	The corpus contains tweets from 2007 to 2017. For the relevant publication, see Barbaresi (2016).
Corpus of Highly Emotive Internet Discussions Size: 160 milllion tokens Annotation: tokenised	Polish	The corpus contains tweets. For the relevant publication, see Sobkowicz (2016).	For access, contact the antoni.sobkowicz [at] opi.org.pl (authors.)
sms4science Size: 0.5 million tokens Annotation: tokenised, PoS-tagged, lemmatised	Swiss German, German, French, Italian, Romansh	This corpus contains around 25000 SMS from 2009. The corpus comes in two different versions which are available through separate concordancers - SMS Navigator and ANNIS. The version accessible through ANNIS is more richly annotated and includes PoS-tagging, normalization, annotation of nonce borrowings, etc. Access through the concordancers requires free registration. For the relevant publication, see Dürscheid and Stark (2011).	Concordancer
What's up, Switzerland? Size: 5 million tokens	Swiss German, German, French, Italian, Romansh	This corpus contains 216 WhatsApp chats from 2014. The corpus is accessible online through the ANNIS system. For the relevant publication, see Ueberwasser and Stark (2017).	Browse
The Corpus of Welsh Language Tweets Size: 7 million tokens Annotation: tokenised Licence: unclear	Welsh	The corpus contains tweets. The corpus is available for download from a dedicated webpage.	Download

CMC Corpora in the CLARIN Infrastructure

Other CMC Corpora

Additional Materials

Publications on CMC Corpora