Newspaper Corpora

Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.

The CLARIN infrastructure gives access to 31 newspaper corpora, 6 of which are multilingual and 25 monolingual. The available corpora contain newspaper articles in languages such as Arabic, Czech, Finnish, French, German, Greek, Italian, Norwegian, Polish and Swedish. Almost a third of the newspaper corpora are historical, with the oldest articles from the 18th century. The majority of them richly tagged and are available under public licences. We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

Additionally, the CLARIN infrastructure gives access to the entire Europeana historical newspaper collection, which is here listed under the section The Europeana collection. The collection is divided into 9 subsets by country. Each subset corresponds to a CLARIN Virtual Collection, which includes a link to the with parameters to select the relevant country’s newspaper records, a link to the full metadata archive and links to the metadata records for all the newspaper titles. The latter provide access to the records for specific years, where you can directly browse the individual newspaper issues.

The Europeana collection can be accessed directly through the VLO. For instance, the newspaper Sakala, which is part of the Estonian collection, consists of 64 annual issues published between 1878 and 1944; each issue has its own VLO entry that is part of a nested hierarchy with the main newspaper issue, from which the individual issues can be both browsed in the form of scans as well as downloaded.

The newspaper issues included in the Europeana Newspapers collection can also be browsed and viewed through the thematic collection on Europeana’s portal.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Newspaper Corpora in the CLARIN Infrastructure

Monolingual Corpora

Corpus	Language	Description	Availability
SYN2006PUB: corpus of Czech newspapers Size: 300 million tokens Annotation: tokenised, lemmatised, PoS-tagged Licence: CC-BY	Czech	This corpus contains articles from 11 Czech newspapers from 1989 to 2004. The corpus is available for download from the Czech repository LINDAT.	Download
SYN2013PUB: corpus of written Czech newspapers Size: 935 million tokens Annotation: tokenised, lemmatised, MSD-tagged Licence: Czech National Corpus (Shuffled Corpus Data)	Czech	This corpus contains articles from Czech newspapers from 2005 to 2009. The corpus is available for download from the Czech repository LINDAT.	Download
The Karelian Finnish Newspaper Corpus Size: 500,000 tokens Licence: CLARIN ACA	Finnish	This corpus contains articles from the Finnish newspaper Karjalan Sanomat from 2012 to 2014. The corpus is available through the concordancer Korp.	Concordancer
Corpus journalistique issu de l'Est Républicain Annotation: MSD-tagged, lemmatised Licence: CC-BY	French	This corpus contains articles from the French newspaper l'Est Républicain from 1999 to 2003. The corpus is available for download from Ortolang.	Download
Tübingen Treebank of Written German / Newspaper Corpus Size: 1.8 million tokens Annotation: tokenised, MSD tagged, lemmatised, syntactic constituency, named-entities Licence: CLARIN RES	German	This corpus contains articles from the German newspaper Die Tageszeitung. The corpus is available through a dedicated concordancer with an institutional account.	Concordancer
TIGER Corpus Size: 900,000 tokens Annotation: tokenised, PoS-tagged, parsed, lemmatised Licence: CLARIN PUB	German	This corpus contains articles from the German newspaper Frankfurter Rundschau. The corpus is available for download from a dedicated webpage.	Download
Mannheim Corpus of Historical Newspapers and Magazines Size: 4.1 million tokens Annotation: tokenised	German	This corpus contains articles from 21 German newspapers from the 18th and 19th century. The corpus is available for download from the CLARIN-D repository.	Download
Corpus "Library and Information Centre - Newspapers" Size: 20 units Licence: CC-BY-NC-SA	Greek	The corpus contains newspaper articles. The corpus is available for download from the CLARIN:EL repository.	Download
The image of Germany in the Greek press Size: 3.5 million tokens, 7650 texts Annotation: tokenised, lemmatised	Greek	The corpus consists of newspaper articles from three Greek newspapers (Ta Nea, Risospastis, and To Vima) dealing with Germany from the Greek perspective. Bibliographical information is encoded in the path to the file: It is composed of title of the newspaper, year, month, day, and rubric. The lemmata are stored in a separate tree of the same structure, the text files in that tree contain one lemma per line. The corpus is available for download from CLARIN-D (Saarland University B-centre). For the relevant publication, see Tsotsou (2019)	Download
Modern Greek Texts Corpus - "Makedonia" newspaper Size: 3 million tokens Licence: CC-BY-NC-SA	Greek	This corpus contains newspaper articles in various topics (politics, economy, sports). The corpus is available for download from the CLARIN:EL repository.	Download
Modern Greek Texts Corpus - "Ta Nea" newspaper Size: 2 million words Licence: CC-BY-NC-SA	Greek	This corpus contains newspaper articles in various topics (politics, economy, sports). The corpus is available for download from the CLARIN:EL repository.	Download
The Norwegian Newspaper Corpus Size: 700 million tokens Annotation: multitagged Licence: CC-BY	Norwegian	This corpus contains articles from 24 Norwegian newspapers from 1998 onwards. The corpus is available through the concordancer Corpuscle.	Concordancer
ChronoPress Corpus of Polish Press Texts Size: 20 million tokens Annotation: tokenised, PoS-tagged, named entities Licence: CLARIN PUB	Polish	This corpus contains articles from various Polish newspapers from 1945 and 1962. The corpus is available through a dedicated concordancer.	Concordancer
8 sidor Size: 678,000 tokens Annotation: tokenised, PoS-tagged, parsed, compounds Licence: CC-BY	Swedish	This corpus contains articles from the Swedish newspaper 8 sidor from 2003 to 2012. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Dagny Size: 8.1 million tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from the newspaper Dagny from 1886 to 1913. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
DN 1987 Size: 5 million tokens Annotation: tokenised, PoS-tagged, parsed, compounds Licence: CC-BY	Swedish	This corpus contains articles from the Swedish newspaper Dagens Nyheter from 1987. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
GP 1994 and 2001-2011 Size: 271 million tokens Annotation: tokenised, PoS-tagged, parsed, compounds Licence: CC-BY	Swedish	This corpus contains articles from the Swedish newspaper Göteborgsposten from 1994 and from 2001 to 2011. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Hertha Size: 3.8 million tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from the newspaper Hertha from 1914 to 2015. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Idun Size: 2 million tokens Annotation: tokenized, PoS-tagged, parsed	Swedish	This corpus contains articles from the newspaper Idun from 1887 to 1917. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Kvinnornas Tidning Size: 5.5 million tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from the newspaper Kvinnornas Tidning for the period between 1921 and 1925. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Morgonbris Size: 3.5 million tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from the newspaper Morgonbris from 1904 to 1924. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Rösträtt för Kvinnor Size: 2.2 million tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from the newspaper Rösträtt för Kvinnor from 1912 to 1919. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
Smittskydd Size: 691,000 tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from the newspaper Smittskyd from 2002 to 2010. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download
The Webbnyheter corpus Size: 272 million tokens Annotation: tokenized, PoS-tagged, parsed Licence: CC-BY	Swedish	This corpus contains articles from various Swedish online newspapers from 2001 to 2013. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.	Concordancer Download

Corpus

Language

Description

Availability

SYN2006PUB: corpus of Czech newspapers

Size: 300 million tokens
Annotation: tokenised, lemmatised, PoS-tagged
Licence: CC-BY

Czech

This corpus contains articles from 11 Czech newspapers from 1989 to 2004.

The corpus is available for download from the Czech repository LINDAT.

Corpus	Language	Description	Availability
Parallel Global Voices Size: 8 million units Licence: CC BY	40 languages	This corpus contains articles from the https://globalvoices.org/ website, where volunteers publish and translate news stories in more than 40 languages.	Download
ACCURAT corpus of comparable sentences Size: 23,820 sentences Licence: CC BY	English-Croatian, English- Greek, English-Estonian, English-Latvian, English-Lithuanian, English-Romanian, English-Slovenian, Greek-Romanian, Latvian-Lithuanian, Romanian-German, Romanian-Lithuanian and German-English	This comparable corpus contains sentence pairs extracted from news comparable corpora. The corpus is available for download from the CLARIN:EL repository.	Download
SETIMES - A parallel corpus of the Balkan languages Size: 341.83 million tokens Annotation: sentence-aligned Licence: Open For Reuse With Restrictions	Romanian, Turkish, Serbian, English, Bulgarian, Macedonian, Croatian, Greek, Albanian	This parallel corpus contains online news articles extracted from the SETimes webpage. The corpus is available for download from the CLARIN:EL repository.	Download
The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 8.8 billion tokens Annotation: tokenised, MSD-tagged, syntactically parsed Licence: CC-BY	Swedish and Finnish	This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1770 to 2011. The corpus can be accessed through the concordancer Korp.	Concordancer
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Licence: CC-BY	Swedish and Finnish	This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1771 to 1874. The corpus can be downloaded from FIN-CLARIN.	Download
Corpora of Newspaper Texts Size: 435 million tokens Annotation: tokenised Licence: under negotiation	Swedish, English and Finnish	This corpus contains articles from a variety of Swedish, English and Finnish newspapers. The corpus can be found in the FIN-CLARIN repository although its availability and licence are still under negotiation.

Corpus	Language	Description	Availability
Europeana historical newspapers: Netherlands Size: 2,869,483,985 words Licence: Public	Dutch, French, English, Spanish (Castilian), Hebrew, Western Frisian, German, Punjabi, Arabic	This corpus contains 4266 issues of 164 newspapers r published in the Netherlands between 1618 and 1940.	VLO
Europeana historical newspapers: Estonia Size: 351,656,185 words Licence: Public	Estonian, Russian, German	This corpus contains 92,558 issues of 40 newspapers published in Estonia between 1852 and 1946.	VLO
Europeana historical newspapers: Finland Size: 393,776,815 words Licence: Public	Finnish, Swedish	This corpus contains 24,164 issues of 10 newspapers published in Finland between 1900 and 1910.	VLO
Europeana historical newspapers: Luxembourg Size: 29,266,765 words Licence: Public	French	This corpus contains 1225 issues of 2 newspapers published in Luxembourg between 1704 and 1794.	VLO
Europeana historical newspapers: Germany Size: 5,593,768,847 words Licence: Public	German, English	This corpus contains 126,564 issues of 11 newspapers published in Germany (chiefly Berlin and Hamburg) between 1792 and 1945.	VLO
Europeana historical newspapers: Austria Size: 2,351,079,191 words Licence: Public	German, Modern Greek, Croatian	This corpus contains 147,515 issues of 77 newspapers published in Austria between 1683 and 1930.	VLO
Europeana historical newspapers: Latvia Size: 964,243,746 words Licence: Public	Latvian, Russian, German, Polish, Estonian	This corpus contains 67,870 issues of 77 newspapers published in Latvia between 1868 and 1955.	VLO
Europeana historical newspapers: Poland Size: 181,102,489 words Licence: Public	Polish, German, Ukranian, Russian	This corpus contains 15,130 issues of 10 newspapers published in Poland between 1866 and 1939.	VLO
Europeana historical newspapers: Serbia Size: 338,080,416 words Licence: Public	Serbian	This corpus contains 22,087 issues of 44 newspapers published in Serbia between 1830 and 1944.	VLO

Corpus	Language	Description	Availability
Zurich English Newspaper Corpus Size: 1.6 million tokens Annotation: tokenised Licence: public	English	This corpus contains articles from various English newspapers (mainly newspapers from London) from the 17th and 18th century.	For access, contact the authors.
deu_newscrawl_2011 Size: 426 million tokens Annotation: tokenised	German	This corpus contains articles from various German newspapers from 2011. The corpus is available through a dedicated concordancer.	Concordancer
CRIPCO Size: 43,000 documents Annotation: coreference resolution Licence: proprietary	Italian	This corpus contains articles from the Italian newspaper L’Adige from 1999 to 2006. The corpus is available for download through META-SHARE.	Download
"LA REPUBBLICA" CORPUS Size: 380 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY	Italian	The corpus contains articles from the Italian newspaper La Repubblica. The corpus is available through the noSketch Engine concordancer.	Concordancer
WItaC - NewsReader Wikinews Italian Corpus Size: 40,231 tokens Annotation: entities, events, event factuality, temporal information, semantic roles, and intra-document and cross-document event and entity coreference Licence: CC-BY	Italian	This corpus contains Italian translations of 120 English Wikinews articles. The corpus is available for download from a dedicated website. For the relevant publication, see Minard et al. (2016).	Download
Corpus of Contemporary Serbian Newspapers and Magazines Size: 916 million tokens Annotation: tokenised, PoS-tagged and lemmatised Licence: CC-BY	Serbian	This corpus contains articles from over a 100 Serbian newspapers from 2004 to 2012.	For access, contact the resource manager.

Newspaper Corpora in the CLARIN Infrastructure

Monolingual Corpora

Multilingual corpora

The Europeana Collection

Other Newspaper Corpora

Monolingual Corpora

Multilingual Corpora

Additional Materials

Publications on the Newspaper Corpora