Skip to main content

CMC Corpora

Computer-mediated communication (CMC) constitutes public and private communication online, for instance blogs and forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, or mobile phone applications such as WhatsApp, email and chat rooms. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also important for the development of robust tools that can deal with non-standard spelling, vocabulary and grammar. The compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as a resource to the scientific community, as well as rapidly changing terms of service by content providers.

The CLARIN infrastructure offers 23 CMC corpora - most are available for Slovenian, but also for Czech, Dutch, Estonian, Finnish, French, German, Italian, and Lithuanian. Most of the corpora are richly tagged and are available under public licences.

The table first provides an overview of the corpora that are already part of the CLARIN infrastructure and then lists those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

CMC Corpora in the CLARIN Infrastructure

Corpus Language Description Availability

Web corpus MaCoCu


Annotation: annotated with extensive metadata
Licence: CC0-No Rights Reserved

Albanian, Bosnian, Bulgarian, Catalan, Croatian, Modern Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, Turkish, Ukrainian

These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. The crawler is available at MaCoCu GitHub channel. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate and near-duplicated paragraphs, discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed.

The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria, making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as 'short' or 'good', assigned based on paragraph length, URL and stopword density via the jusText tool) and fluency (score between 0 and 1, assigned with the Monocleaner tool), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be easily read with the prevert parser.

For the relevant publication, see Bañón et al. (2022)

Download (Albanian)

Download (Bosnian)

Download (Bulgarian)

Download (Catalan)

Download (Croatian)

Download (Modern Greek)

Download (Icelandic)

Download (Macedonian)

Download (Maltese)

Download (Montenegrin)

Download (Serbian)

Download (Slovenian)

Download (Turkish)

Download (Ukrainian)

Corpus of contemporary blogs

Size: 1 million tokens
Annotation: tokenised, sentence tagging
Licence: CC-BY

Czech

This corpus contains blog posts.

The corpus is available for download from LINDAT.

Download

SoNaR New Media

Size: 35 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CLARIN ACA

Dutch

This corpus contains tweets, chats and SMS from 2005 to 2012.

The corpus is available for searching online through the OpenSONAR environment.

For the relevant publication, see Sanders (2012)

Concordancer

Corpus of Global Web-Based English

Size: 1.8 billion words; 1.8 million texts
Licence: CLARIN RES (download); CLARIN ACA (online)

English

This corpus contains texts from web-pages in United States, Great Britain, Australia, India, and 16 other countries. About 60% of the texts come from blogs.

The corpus is available for download from (the Finnish Language Bank) and for online browsing through the concordancer Korp.

Concordancer

Download

NTAP English

Size: 660,798,199 tokens

English

This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards.

The corpus is available for searching online through the Corpuscle concordancer.

For the relevant publication, see Salway et al. (2016)

Concordancer

DIDI – The DiDi corpus of South Tyrolean CMC 1.0.0

Size: 600,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: ACA-BY-NC-NORED 1.0

English, German, Italian, Ladino

This corpus consists of Facebook posts gathered from 136 Facebook users from South Tyrol. All texts are anonymised.

The corpus is available for download from the EURAC Research CLARIN repository.

For the relevant publication, see Frey et al. (2016)

Download

The Mixed Corpus: New Media

Size: 25 million tokens
Annotation: tokenised

Estonian

This corpus contains chat room messages, forum posts and news comments from 2000 to 2008

The corpus is available for download from a dedicated webpage associated with CLARIN Estonia and through a dedicated concordancer.

Concordancer

Download

SFNET Corpus

Size: 100 million words
Annotation: PoS-tagged, sentence and word segmentation
Licence: CLARIN ACA – NC

Finnish

This corpus contains written posts from the SFNET forum in Finnish from 2002 to 2003.

The PoS-tagging has been done with the FI-FDG Parser, which uses a computational implementation of Functional Dependency Grammar.

The corpus is available for download from META-SHARE (the Finnish Language Bank)

Download

Suomi 24 Corpus

Size: 2.6 billion tokens
Annotation: tokenised, MSD-tagged
Licence: CLARIN ACA

Finnish

This corpus contains forum posts from the Suomi24 website from 2001 to 2016.

The corpus is available for download from the FIN-CLARIN repository and through the concordancer Korp.

Concordancer

Download

The HS.fi News and Comments Corpus

Size: 8 million tokens; 593,760 sentences; 93,602 texts
Annotation: PoS-tagged, lemmatised, syntactically parsed
Licence: CLARIN ACA – NC

Finnish

This corpus contains the domestic news of the Helsingin Sanomat website and their comments from 5 September 2011 to 4 September 2012.

The corpus has been syntactically parsed using TDT alpha.

The corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.

Concordancer

Download

Ylilauta Corpus

Size: 26.9 million words
Annotation: PoS-tagged, lemmatised, syntactically parsed, named entities
Licence: CC BY-NC

Finnish

The corpus contains text from discussions of the Ylilauta online discussion board from 2012 to 2014.

The corpus has been syntactically annotated with the TDT alpha parser, while the named entities have been assigned using the FiNER tool.

The corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp.

Concordancer

Download

CoMeRe repository

Size: 80 million tokens
Annotation: tokenised, mostly untagged
Licence: CC-BY

French

This corpus contains e-mails, forum posts, online chats, tweets and SMS.

The corpus is available for download from Ortolang.

For the relevant publication, see Panckhurst (2017)

Download

NTAP French

Size: 1,506,064,082 words

French

This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards.

The corpus is available for searching online through the Corpuscle concordancer.

For the relevant publication, see Salway et al. (2016)

Concordancer

Dortmund Chat Corpus

Size: 1 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CC-BY

German

This corpus contains online chats from 2000 to 2006

The corpus is available for download from the repository of CLARIN-D

For the relevant publication, see Beißwenger (2013)

Download

PAISÀ Corpus of Italian Web Text

Size: 380,000 pages, 250 million words
Licence: CC BY-NC-SA 4.0

Italian

This corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.

The corpus is available for download from the EURAC Research CLARIN repository.

Download

LITIS v.1

Size: 190,000 comments
Licence: CLARIN_ACA

Lithuanian

This corpus contains forum posts from portals delfi.lt and lrytas.lt from 2010 to 2014.

The corpus is available for download from the CLARIN-LT repository.

Download

Serbian Web Corpus PDRS 1.0

Size: 715 million tokens
Annotation: tokenised, MSD-tagged (MULTEXT-East & UD), lemmatised, annotated for text source
Licence: CC BY

Serbian

This corpus contains texts from the web obtained by crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a frequency between 5,000 and 500,000 in srWaC have been used. The texts are deduplicated, cyrillic texts have been transliterated into the Latin alphabet. The linguistic processing was done with the CLASSLA package for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies).

In addition, some 80% of the URLs are manually tagged for 10 different types of sources ("area"): media (media outlets with several posts daily), inform (topic-centered sites with infrequent posts - maximum 3 per day), company (presentations of companies), state (websites of government bodies on nationa, regional and local level), forum (forum posts), portal (topic-centered portals without daily coverage), science (scientific publications), shop (with descriptions of products), database (knowledge bases, dictionaries, databases and similar) and community (NGOs, fan clubs, associations and other). The corpus is distributed in the CoNLL-U format in batches of appr. 2x50 mio. tokens.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through noSketchEngine and KonText concordancers.

Concordancer (noSketchEngine)

Concordancer(KonText)

Download

Blog post and comment corpus Janes-Blog 1.0

Size: 34 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains blog posts from RTV Slovenija and Publishwall.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText

For the relevant publication, see Fišer et al. (2018)

Concordancer

Download

Forum corpus Janes-Forum 1.0

Size: 47 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains forum posts from Avtomobilizem.com, MedOver.net and RTV Slovenija.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

For the relevant publication, see Fišer et al. (2018)

Concordancer

Download

Monitor corpus of Slovene Trendi 2023-02

Size: 700 million tokens
Annotation: PoS-tagged, lemmatised, syntactically parsed, annotated for named entities and topics

Slovenian

This corpus contains news from 107 different media websites, published by 72 different publishers, and is a monitor corpus of Slovene. Trendi 2023-02 covers the period from January 2019 to February 2023, complementing the Gigafida 2.0 reference corpus of written Slovene. All the contents of the Trendi corpus are at the moment obtained using the Jožef Stefan Institute Newsfeed service. The texts have been annotated using the CLASSLA-Stanza pipeline, including syntactic parsing according to the Universal Dependencies and Named Entities.

An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. Text classification models are available at Text classification model SloBERTa-Trendi-Topics 1.0, Text classification model fastText-Trendi-Topics 1.0, and SloBERTa model. At the moment, the corpus is not available as a dataset due to copyright restrictions but we hope to make at least some of it available in the near future.

The corpus can be queried through noSketchEngine and KonText concordancers.

For the relevant publication, see Kosem (2022)#SEPKosem et al. (2022)

Concordancer (noSketchEngine)

Concordancer(KonText)

News comment corpus Janes-News 1.0

Size: 14 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains news comments from RTV Slovenija, Mladina and Reporter.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

For the relevant publication, see Fišer et al. (2018)

Concordancer

Download

Twitter corpus Janes-Tweet 1.0

Size: 139 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains tweets written by Slovenian Twitter users from 2013 to 2017.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

For the relevant publication, see Fišer et al. (2018)

Concordancer

Download

Wikipedia talk corpus Janes-Wiki 1.0

Size: 5 million tokens
Annotation: tokenised, sentence segmented, MSD-tagged, lemmatised
Licence: CC-BY

Slovenian

This corpus contains Slovenian Wikipedia user and talk pages.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText.

For the relevant publication, see Fišer et al. (2018)

Concordancer

Download

Other CMC Corpora

Corpus Language Description Availability

Flemish Online Teenage Talk

Size: 2.9 million tokens
Annotation: tokenised

Dutch

This corpus contains Facebook posts and WhatsApp messages from 2015 and 2016.

For the relevant publication, see Hilte et al. (2016).

 

eBay petites annonces

Size: 100,000 tokens
Annotation: see here
Licence: CC BY-NC-SA 4.0

French

This corpus contains eBay listings from 2005, 2017, and 2018. The corpus is manually annotated.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Gerstenberg, Hekkel, and Hewett (2019)

Download

Dereko – News and Wikipedia subcorpus

Size: 670 million tokens
Annotation: tokenised

German

This corpus contains content from newsgroup posts and Wikipedia.

The corpus is available through a dedicated concordancer.

Concordancer

DWDS – Blogs

Size: 102 million tokens
Annotation: tokenised

German

This corpus contains blog posts.

The corpus is available through a dedicated concordancer.

Concordancer

Monitor corpus of tweets from Austrian users

Size: 40 million tweets
Annotation: tokenised, lemmatised

German, English

The corpus contains tweets from 2007 to 2017.

For the relevant publication, see Barbaresi (2016).

 

Corpus of Highly Emotive Internet Discussions

Size: 160 milllion tokens
Annotation: tokenised

Polish

The corpus contains tweets.

For the relevant publication, see Sobkowicz (2016).

For access, contact the antoni.sobkowicz [at] opi.org.pl (authors.)

sms4science

Size: 0.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised

Swiss German, German, French, Italian, Romansh

This corpus contains around 25000 SMS from 2009.

The corpus comes in two different versions which are available through separate concordancers - SMS Navigator and ANNIS. The version accessible through ANNIS is more richly annotated and includes PoS-tagging, normalization, annotation of nonce borrowings, etc. Access through the concordancers requires free registration.

For the relevant publication, see Dürscheid and Stark (2011).

Concordancer

What's up, Switzerland?

Size: 5 million tokens

Swiss German, German, French, Italian, Romansh

This corpus contains 216 WhatsApp chats from 2014.

The corpus is accessible online through the ANNIS system.

For the relevant publication, see Ueberwasser and Stark (2017).

Browse

The Corpus of Welsh Language Tweets

Size: 7 million tokens
Annotation: tokenised
Licence: unclear

Welsh

The corpus contains tweets.

The corpus is available for download from a dedicated webpage.

Download

Additional Materials

Tutorial at CMC-Corpora 2017: 'How to use for the annota­tion of CMC and social media resources: a prac­tical introduction', 4 October 2017, Bolzano, Italy. [html]

CLARIN-PLUS workshop 'Creation and Use of Social Media Resources', 18-19 May 2017, Kaunas, Lithuanian. [html]

Videolectures of the CLARIN-PLUS workshop. [html]

Publications on CMC Corpora

[Barbaresi 2016] Collection and Indexing of Tweets with a Geographical Focus.

[Beißwenger 2013] Michael Beißwenger. 2013. Das Dortmunder Chat-Korpus: ein annotiertes Korpus zur Sprachverwendung und sprachlichen Variation in der deutschsprachigen Chat-Kommunikation.

[Dürscheid and Stark 2011] Christa Dürscheid and Elisabeth Stark. 2011. sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland.

[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.

[Frey et al. 2016] Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W. Stemle. 2016. The DiDi Corpus of South Tyrolean CMC Data. A multilingual corpus of Facebook texts.

[Gerstenberg, Hekkel, and Hewett 2019]  Annette Gerstenberg, Valerie Hekkel, and Freya Hewett. 2019. Online Auction Listings Between Community and Commerce.

[Hilte et al. 2016] Lisa Hilte, Reinhild Vandekerckhove, Walter Daelemans. 2016. Expressiveness in Flemish Online Teenage Talk: A Corpus-Based Analysis of Social and Medium-Related Linguistic Variation.

[Kapočiūtė-Dzikienė et al. 2015] Jurgita Kapočiūtė-Dzikienė, Ligita Šarkuté, Andrius Utka. 2015. The Effect of Author Set Size in Authorship Attribution for Lithuanian.

[Panckhurst 2017] Rachel Panckhurst. 2017. A digital corpus resource of authentic anonymized French text messages: 88milSMS—What about transcoding and linguistic annotation?

[Salway et al. 2016] Andrew Salway, Dag Elgesem, Knut Hofland, Øystein Reigem, Lubos Steskal. 2016. Topically-focused Blog Corpora for Multiple Languages. 

[Sanders 2012] Eric Sanders. Collecting and Analysing Chats and Tweets in  SoNaR.

[Sobkowicz 2016] Antoni Sobkowicz. 2016. Political Discourse in Polish Internet - Corpus of Highly Emotive Internet Discussions. 

[Ueberwasser and Stark 2017] Simone Ueberwasser and Elisabeth Stark. 2017. What’s up, Switzerland? A corpus-based research project in a multilingual country.