Introduction
Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.
The CLARIN infrastructure gives access to 34 newspaper corpora, 7 of which are multilingual and 27 monolingual. The available corpora contain newspaper articles in the following 11 languages: Arabic, Czech, Finnish, French, German, Greek, Italian, Norwegian, Polish and Swedish. Almost a third of the newspaper corpora are historical, with the oldest articles from the 18th century. The majority of them richly tagged and are available under public licences.
We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an email.
This website was last updated on 7 September 2021.
Newspaper corpora in the CLARIN infrastructure
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
An-Nahar Newspaper Text Corpus Size: 24 million tokens |
Arabic |
This corpus contains articles from the Arabic newspaper An-Nahar from 1995 to 2000. The corpus is available for download from the ELRA catalogue. |
|
SYN2006PUB: corpus of Czech newspapers Size: 300 million tokens |
Czech |
This corpus contains articles from 11 Czech newspapers from 1989 to 2004. The corpus is available for download from the Czech repository LINDAT. |
|
SYN2013PUB: corpus of written Czech newspapers Size: 935 million tokens |
Czech |
This corpus contains articles from Czech newspapers from 2005 to 2009. The corpus is available for download from the Czech repository LINDAT. |
|
The Karelian Finnish Newspaper Corpus Size: 500,000 tokens |
Finnish |
This corpus contains articles from the Finnish newspaper Karjalan Sanomat from 2012 to 2014. The corpus is available through the concordancer Korp. |
|
Size: 100 hours of speech materials |
French |
This corpus contains recorded readings of articles from the French newspaper Le Monde. The corpus is available for download from the ELRA catalogue. |
|
Corpus journalistique issu de l'Est Républicain Annotation: MSD-tagged, lemmatised |
French |
This corpus contains articles from the French newspaper l'Est Républicain from 1999 to 2003. The corpus is available for download from Ortolang. |
|
Tübingen Treebank of Written German / Newspaper Corpus Size: 1.8 million tokens |
German |
This corpus contains articles from the German newspaper Die Tageszeitung. The corpus is available through a dedicated concordancer with an institutional account. |
|
Size: 900,000 tokens |
German |
This corpus contains articles from the German newspaper Frankfurter Rundschau. The corpus is available for download from a dedicated webpage. |
|
MTP Annotated German corpus - tagged version Size: 500,000 tokens |
German |
This corpus contains articles from two German newspapers Die Frankfurter Allgemeine Zeitung and Die Zeit from 1992. The corpus can be downloaded from the ELRA catalogue. |
|
Mannheim Corpus of Historical Newspapers and Magazines Size: 4.1 million tokens |
German |
This corpus contains articles from 21 German newspapers from the 18th and 19th century. The corpus is available for download from the CLARIN-D repository. |
|
Corpus "Library and Information Centre - Newspapers" Size: 20 units |
Greek |
The corpus contains newspaper articles. The corpus is available for download from the CLARIN:EL repository. |
|
Modern Greek Texts Corpus - "Makedonia" newspaper Size: 3 million tokens |
Greek |
This corpus contains newspaper articles in various topics (politics, economy, sports). The corpus is available for download from the CLARIN:EL repository. |
|
Modern Greek Texts Corpus - "Ta Nea" newspaper Size: 2 million words |
Greek |
This corpus contains newspaper articles in various topics (politics, economy, sports). The corpus is available for download from the CLARIN:EL repository. |
|
Size: 3.1 million words |
Ialian |
This corpus contains texts collected from four different domains:
|
|
The Norwegian Newspaper Corpus Size: 700 million tokens |
Norwegian |
This corpus contains articles from 24 Norwegian newspapers from 1998 onwards. The corpus is available through the concordancer Corpuscle. |
|
ChronoPress Corpus of Polish Press Texts Size: 20 million tokens |
Polish |
This corpus contains articles from various Polish newspapers from 1945 and 1962. The corpus is available through a dedicated concordancer. |
|
Size: 678,000 tokens |
Swedish |
This corpus contains articles from the Swedish newspaper 8 sidor from 2003 to 2012. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 8.1 million tokens |
Swedish |
This corpus contains articles from the newspaper Dagny from 1886 to 1913. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 5 million tokens |
Swedish |
This corpus contains articles from the Swedish newspaper Dagens Nyheter from 1987. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 271 million tokens |
Swedish |
This corpus contains articles from the Swedish newspaper Göteborgsposten from 1994 and from 2001 to 2011. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 3.8 million tokens |
Swedish |
This corpus contains articles from the newspaper Hertha from 1914 to 2015. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 2 million tokens |
Swedish |
This corpus contains articles from the newspaper Idun from 1887 to 1917. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 5.5 million tokens |
Swedish |
This corpus contains articles from the newspaper Kvinnornas Tidning for the period between 1921 and 1925. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 3.5 million tokens |
Swedish |
This corpus contains articles from the newspaper Morgonbris from 1904 to 1924. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 2.2 million tokens |
Swedish |
This corpus contains articles from the newspaper Rösträtt för Kvinnor from 1912 to 1919. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 691,000 tokens |
Swedish |
This corpus contains articles from the newspaper Smittskyd from 2002 to 2010. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 272 million tokens |
Swedish |
This corpus contains articles from various Swedish online newspapers from 2001 to 2013. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 8 million units |
40 languages |
This corpus contains articles from the https://globalvoices.org/ website, where volunteers publish and translate news stories in more than 40 languages. |
|
MLCC Multilingual and Parallel Corpora Size: 100 million tokens |
Dutch, English, French, German, Italian, Spanish |
This corpus contains articles from newspapers in Dutch, English, French, German, Italian and Spanish from 1986 to 1994. The corpus is available for download from the ELRA catalogue. |
|
ACCURAT corpus of comparable sentences Size: 23,820 sentences |
English-Croatian, English- Greek, English-Estonian, English-Latvian, English-Lithuanian, English-Romanian, English-Slovenian, Greek-Romanian, Latvian-Lithuanian, Romanian-German, Romanian-Lithuanian and German-English |
This comparable corpus contains sentence pairs extracted from news comparable corpora. The corpus is available for download from the CLARIN:EL repository. |
|
SETIMES - A parallel corpus of the Balkan languages Size: 341.83 million tokens |
Romanian, Turkish, Serbian, English, Bulgarian, Macedonian, Croatian, Greek, Albanian |
This parallel corpus contains online news articles extracted from the SETimes webpage. The corpus is available for download from the CLARIN:EL repository. |
|
The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 8.8 billion tokens |
Swedish and Finnish |
This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1770 to 2011. The corpus can be accessed through the concordancer Korp. |
|
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Licence: CC-BY |
Swedish and Finnish |
This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1771 to 1874. The corpus can be downloaded from FIN-CLARIN. |
|
Size: 435 million tokens |
Swedish, English and Finnish |
This corpus contains articles from a variety of Swedish, English and Finnish newspapers. The corpus can be found in the FIN-CLARIN repository although its availability and licence are still under negotiation. |
Other newspaper corpora
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Zurich English Newspaper Corpus Size: 1.6 million tokens |
English |
This corpus contains articles from various English newspapers (mainly newspapers from London) from the 17th and 18th century. |
For access, contact the authors. |
Size: 426 million tokens |
German |
This corpus contains articles from various German newspapers from 2011. The corpus is available through a dedicated concordancer. |
|
Size: 43,000 documents |
Italian |
This corpus contains articles from the Italian newspaper L’Adige from 1999 to 2006. The corpus is available for download through META-SHARE. |
|
Size: 380 million tokens |
Italian |
The corpus contains articles from the Italian newspaper La Repubblica. The corpus is available through the noSketch Engine concordancer. |
|
WItaC - NewsReader Wikinews Italian Corpus Size: 40,231 tokens |
Italian |
This corpus contains Italian translations of 120 English Wikinews articles. The corpus is available for download from a dedicated website. For the relevant publication, see Minard et al. (2016). |
|
Corpus of Contemporary Serbian Newspapers and Magazines Size: 916 million tokens |
Serbian |
This corpus contains articles from over a 100 Serbian newspapers from 2004 to 2012. |
For access, contact the resource manager. |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Europeana Newspapers NER Corpora Size: 500, 000 tokens (182,483 Dutch; 207,000 French; 96,735 German) |
Dutch, French and German |
This corpus contains articles from Europeana newspapers for the following time periods: 1811-1856 for the Dutch subcorpus, 1871-1916 for the French subcorpus, and 1926 for the German subcorpus. The corpus is available for download from the KB Lab. For the relevant publication, see Neudecker (2016). |
|
Size: 35 billion tokens |
18 languages |
This corpus contains articles from newsfeed from 2014 to 2017. The corpus is available through noSketchEingine. For the relevant publication, see Bušta et al. (2017). |
Additional materials
CLARIN-PLUS workshop: "Working with Digital Collections of Newspapers", 19-21 September 2016, Leuven, Belgium. [html]
Videolectures of the CLARIN-PLUS workshop. [html]
Workshop "Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining", 5-6 March 2018, Helsinki, Finland. [html]
Slides for "Hacking the News" workshop. [gdoc]
Publications on the newspaper corpora
[Bušta et al. 2017] Jan Bušta, Ondřej Herman, Miloš Jakubíček, Simon Krek, Blaž Novak. JSI Newsfeed Corpus. [pdf]
[Minard et al. 2016] Anne-Lyse Minard , Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, Chantal van Son. 2016. MEANTIME, the NewsReader Multilingual Event and Time Corpus.
[Neudecker 2016] Clemens Neudecker. An Open Corpus for Named Entity Recognition in Historic Newspapers.