Computer-mediated communication (CMC) constitutes public and private communication online, for instance blogs and forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, or mobile phone applications such as WhatsApp, email and chat rooms. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also important for the development of robust tools that can deal with non-standard spelling, vocabulary and grammar. The compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as a resource to the scientific community, as well as rapidly changing terms of service by content providers.
The CLARIN infrastructure offers 23 CMC corpora - most are available for Slovenian, but also for Czech, Dutch, Estonian, Finnish, French, German, Italian, and Lithuanian. Most of the corpora are richly tagged and are available under public licences.
The table first provides an overview of the corpora that are already part of the CLARIN infrastructure and then lists those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
CMC Corpora in the CLARIN Infrastructure
Corpus | Language | Description | Availability |
---|---|---|---|
|
Albanian, Bosnian, Bulgarian, Catalan, Croatian, Modern Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, Turkish, Ukrainian |
These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. The crawler is available at MaCoCu GitHub channel. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate and near-duplicated paragraphs, discarding very short texts as well as texts that are not in the target language. Furthermore, samples from the largest 1,500 domains were manually checked and bad domains, such as machine-translated domains, were removed. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria, making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as 'short' or 'good', assigned based on paragraph length, URL and stopword density via the jusText tool) and fluency (score between 0 and 1, assigned with the Monocleaner tool), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus. The corpus is available for download from the Slovenian repository CLARIN.SI and can be easily read with the prevert parser. For the relevant publication, see Bañón et al. (2022) |
|
Size: 1 million tokens |
Czech |
This corpus contains blog posts. The corpus is available for download from LINDAT. |
Download |
Size: 35 million tokens |
Dutch |
This corpus contains tweets, chats and SMS from 2005 to 2012. The corpus is available for searching online through the OpenSONAR environment. For the relevant publication, see Sanders (2012) |
Concordancer |
Corpus of Global Web-Based English Size: 1.8 billion words; 1.8 million texts |
English |
This corpus contains texts from web-pages in United States, Great Britain, Australia, India, and 16 other countries. About 60% of the texts come from blogs. The corpus is available for download from (the Finnish Language Bank) and for online browsing through the concordancer Korp. |
|
Size: 660,798,199 tokens |
English |
This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards. The corpus is available for searching online through the Corpuscle concordancer. For the relevant publication, see Salway et al. (2016) |
Concordancer |
DIDI – The DiDi corpus of South Tyrolean CMC 1.0.0 Size: 600,000 tokens |
English, German, Italian, Ladino |
This corpus consists of Facebook posts gathered from 136 Facebook users from South Tyrol. All texts are anonymised. The corpus is available for download from the EURAC Research CLARIN repository. For the relevant publication, see Frey et al. (2016) |
Download |
Size: 25 million tokens |
Estonian |
This corpus contains chat room messages, forum posts and news comments from 2000 to 2008 The corpus is available for download from a dedicated webpage associated with CLARIN Estonia and through a dedicated concordancer. |
|
Size: 100 million words |
Finnish |
This corpus contains written posts from the SFNET forum in Finnish from 2002 to 2003. The PoS-tagging has been done with the FI-FDG Parser, which uses a computational implementation of Functional Dependency Grammar. The corpus is available for download from META-SHARE (the Finnish Language Bank) |
Download |
Size: 2.6 billion tokens |
Finnish |
This corpus contains forum posts from the Suomi24 website from 2001 to 2016. The corpus is available for download from the FIN-CLARIN repository and through the concordancer Korp. |
|
The HS.fi News and Comments Corpus Size: 8 million tokens; 593,760 sentences; 93,602 texts |
Finnish |
This corpus contains the domestic news of the Helsingin Sanomat website and their comments from 5 September 2011 to 4 September 2012. The corpus has been syntactically parsed using TDT alpha. The corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp. |
|
Size: 26.9 million words |
Finnish |
The corpus contains text from discussions of the Ylilauta online discussion board from 2012 to 2014. The corpus has been syntactically annotated with the TDT alpha parser, while the named entities have been assigned using the FiNER tool. The corpus is available for download from META-SHARE (the Finnish Language Bank) and for online browsing through the concordancer Korp. |
|
Size: 80 million tokens |
French |
This corpus contains e-mails, forum posts, online chats, tweets and SMS. The corpus is available for download from Ortolang. For the relevant publication, see Panckhurst (2017) |
Download |
Size: 1,506,064,082 words |
French |
This corpus contains blog posts that are related to climate change issues across science, politics, and the environment. The vast majority of the posts are from 2005 onwards. The corpus is available for searching online through the Corpuscle concordancer. For the relevant publication, see Salway et al. (2016) |
Concordancer |
Size: 1 million tokens |
German |
This corpus contains online chats from 2000 to 2006 The corpus is available for download from the repository of CLARIN-D For the relevant publication, see Beißwenger (2013) |
Download |
PAISÀ Corpus of Italian Web Text Size: 380,000 pages, 250 million words |
Italian |
This corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services. The corpus is available for download from the EURAC Research CLARIN repository. |
Download |
Size: 190,000 comments |
Lithuanian |
This corpus contains forum posts from portals delfi.lt and lrytas.lt from 2010 to 2014. The corpus is available for download from the CLARIN-LT repository. |
Download |
Size: 715 million tokens |
Serbian |
This corpus contains texts from the web obtained by crawling the .rs domain. Crawling has been done in September and October 2022 with BootCat. As search terms, appr. 2,800 word forms with a frequency between 5,000 and 500,000 in srWaC have been used. The texts are deduplicated, cyrillic texts have been transliterated into the Latin alphabet. The linguistic processing was done with the CLASSLA package for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies). In addition, some 80% of the URLs are manually tagged for 10 different types of sources ("area"): media (media outlets with several posts daily), inform (topic-centered sites with infrequent posts - maximum 3 per day), company (presentations of companies), state (websites of government bodies on nationa, regional and local level), forum (forum posts), portal (topic-centered portals without daily coverage), science (scientific publications), shop (with descriptions of products), database (knowledge bases, dictionaries, databases and similar) and community (NGOs, fan clubs, associations and other). The corpus is distributed in the CoNLL-U format in batches of appr. 2x50 mio. tokens. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through noSketchEngine and KonText concordancers. |
|
Blog post and comment corpus Janes-Blog 1.0 Size: 34 million tokens |
Slovenian |
This corpus contains blog posts from RTV Slovenija and Publishwall. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText For the relevant publication, see Fišer et al. (2018) |
|
Size: 47 million tokens |
Slovenian |
This corpus contains forum posts from Avtomobilizem.com, MedOver.net and RTV Slovenija. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
|
Monitor corpus of Slovene Trendi 2023-02 Size: 700 million tokens |
Slovenian |
This corpus contains news from 107 different media websites, published by 72 different publishers, and is a monitor corpus of Slovene. Trendi 2023-02 covers the period from January 2019 to February 2023, complementing the Gigafida 2.0 reference corpus of written Slovene. All the contents of the Trendi corpus are at the moment obtained using the Jožef Stefan Institute Newsfeed service. The texts have been annotated using the CLASSLA-Stanza pipeline, including syntactic parsing according to the Universal Dependencies and Named Entities. An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. Text classification models are available at Text classification model SloBERTa-Trendi-Topics 1.0, Text classification model fastText-Trendi-Topics 1.0, and SloBERTa model. At the moment, the corpus is not available as a dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus can be queried through noSketchEngine and KonText concordancers. For the relevant publication, see Kosem (2022)#SEPKosem et al. (2022) |
|
News comment corpus Janes-News 1.0 Size: 14 million tokens |
Slovenian |
This corpus contains news comments from RTV Slovenija, Mladina and Reporter. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
|
Twitter corpus Janes-Tweet 1.0 Size: 139 million tokens |
Slovenian |
This corpus contains tweets written by Slovenian Twitter users from 2013 to 2017. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
|
Wikipedia talk corpus Janes-Wiki 1.0 Size: 5 million tokens |
Slovenian |
This corpus contains Slovenian Wikipedia user and talk pages. The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through KonText. For the relevant publication, see Fišer et al. (2018) |
Other CMC Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 2.9 million tokens |
Dutch |
This corpus contains Facebook posts and WhatsApp messages from 2015 and 2016. For the relevant publication, see Hilte et al. (2016). |
|
Size: 100,000 tokens |
French |
This corpus contains eBay listings from 2005, 2017, and 2018. The corpus is manually annotated. The corpus is available for download from a dedicated webpage. For the relevant publication, see Gerstenberg, Hekkel, and Hewett (2019) |
Download |
Dereko – News and Wikipedia subcorpus Size: 670 million tokens |
German |
This corpus contains content from newsgroup posts and Wikipedia. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 102 million tokens |
German |
This corpus contains blog posts. The corpus is available through a dedicated concordancer. |
Concordancer |
Monitor corpus of tweets from Austrian users Size: 40 million tweets |
German, English |
The corpus contains tweets from 2007 to 2017. For the relevant publication, see Barbaresi (2016). |
|
Corpus of Highly Emotive Internet Discussions Size: 160 milllion tokens |
Polish |
The corpus contains tweets. For the relevant publication, see Sobkowicz (2016). |
For access, contact the antoni.sobkowicz [at] opi.org.pl (authors.) |
Size: 0.5 million tokens |
Swiss German, German, French, Italian, Romansh |
This corpus contains around 25000 SMS from 2009. The corpus comes in two different versions which are available through separate concordancers - SMS Navigator and ANNIS. The version accessible through ANNIS is more richly annotated and includes PoS-tagging, normalization, annotation of nonce borrowings, etc. Access through the concordancers requires free registration. For the relevant publication, see Dürscheid and Stark (2011). |
Concordancer |
Size: 5 million tokens |
Swiss German, German, French, Italian, Romansh |
This corpus contains 216 WhatsApp chats from 2014. The corpus is accessible online through the ANNIS system. For the relevant publication, see Ueberwasser and Stark (2017). |
Browse |
The Corpus of Welsh Language Tweets Size: 7 million tokens |
Welsh |
The corpus contains tweets. The corpus is available for download from a dedicated webpage. |
Download |
Additional Materials
Tutorial at CMC-Corpora 2017: 'How to use for the annotation of CMC and social media resources: a practical introduction', 4 October 2017, Bolzano, Italy. [html]
CLARIN-PLUS workshop 'Creation and Use of Social Media Resources', 18-19 May 2017, Kaunas, Lithuanian. [html]
Videolectures of the CLARIN-PLUS workshop. [html]
Publications on CMC Corpora
[Barbaresi 2016] Collection and Indexing of Tweets with a Geographical Focus.
[Beißwenger 2013] Michael Beißwenger. 2013. Das Dortmunder Chat-Korpus: ein annotiertes Korpus zur Sprachverwendung und sprachlichen Variation in der deutschsprachigen Chat-Kommunikation.
[Dürscheid and Stark 2011] Christa Dürscheid and Elisabeth Stark. 2011. sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland.
[Fišer et al. 2018] Darja Fišer, Nikola Ljubešić, and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.
[Frey et al. 2016] Jennifer-Carmen Frey, Aivars Glaznieks, and Egon W. Stemle. 2016. The DiDi Corpus of South Tyrolean CMC Data. A multilingual corpus of Facebook texts.
[Gerstenberg, Hekkel, and Hewett 2019] Annette Gerstenberg, Valerie Hekkel, and Freya Hewett. 2019. Online Auction Listings Between Community and Commerce.
[Hilte et al. 2016] Lisa Hilte, Reinhild Vandekerckhove, Walter Daelemans. 2016. Expressiveness in Flemish Online Teenage Talk: A Corpus-Based Analysis of Social and Medium-Related Linguistic Variation.
[Kapočiūtė-Dzikienė et al. 2015] Jurgita Kapočiūtė-Dzikienė, Ligita Šarkuté, Andrius Utka. 2015. The Effect of Author Set Size in Authorship Attribution for Lithuanian.
[Panckhurst 2017] Rachel Panckhurst. 2017. A digital corpus resource of authentic anonymized French text messages: 88milSMS—What about transcoding and linguistic annotation?
[Salway et al. 2016] Andrew Salway, Dag Elgesem, Knut Hofland, Øystein Reigem, Lubos Steskal. 2016. Topically-focused Blog Corpora for Multiple Languages.
[Sanders 2012] Eric Sanders. Collecting and Analysing Chats and Tweets in SoNaR.
[Sobkowicz 2016] Antoni Sobkowicz. 2016. Political Discourse in Polish Internet - Corpus of Highly Emotive Internet Discussions.
[Ueberwasser and Stark 2017] Simone Ueberwasser and Elisabeth Stark. 2017. What’s up, Switzerland? A corpus-based research project in a multilingual country.