Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.
The CLARIN infrastructure gives access to 31 newspaper corpora, 6 of which are multilingual and 25 monolingual. The available corpora contain newspaper articles in languages such as Arabic, Czech, Finnish, French, German, Greek, Italian, Norwegian, Polish and Swedish. Almost a third of the newspaper corpora are historical, with the oldest articles from the 18th century. The majority of them richly tagged and are available under public licences. We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
Additionally, the CLARIN infrastructure gives access to the entire Europeana historical newspaper collection, which is here listed under the section The Europeana collection. The collection is divided into 9 subsets by country. Each subset corresponds to a CLARIN Virtual Collection, which includes a link to the with parameters to select the relevant country’s newspaper records, a link to the full metadata archive and links to the metadata records for all the newspaper titles. The latter provide access to the records for specific years, where you can directly browse the individual newspaper issues.
The Europeana collection can be accessed directly through the VLO. For instance, the newspaper Sakala, which is part of the Estonian collection, consists of 64 annual issues published between 1878 and 1944; each issue has its own VLO entry that is part of a nested hierarchy with the main newspaper issue, from which the individual issues can be both browsed in the form of scans as well as downloaded.
The newspaper issues included in the Europeana Newspapers collection can also be browsed and viewed through the thematic collection on Europeana’s portal.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Newspaper Corpora in the CLARIN Infrastructure
Monolingual Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
SYN2006PUB: corpus of Czech newspapers Size: 300 million tokens |
Czech |
This corpus contains articles from 11 Czech newspapers from 1989 to 2004. The corpus is available for download from the Czech repository LINDAT. |
Download |
SYN2013PUB: corpus of written Czech newspapers Size: 935 million tokens |
Czech |
This corpus contains articles from Czech newspapers from 2005 to 2009. The corpus is available for download from the Czech repository LINDAT. |
Download |
The Karelian Finnish Newspaper Corpus Size: 500,000 tokens |
Finnish |
This corpus contains articles from the Finnish newspaper Karjalan Sanomat from 2012 to 2014. The corpus is available through the concordancer Korp. |
Concordancer |
Corpus journalistique issu de l'Est Républicain Annotation: MSD-tagged, lemmatised |
French |
This corpus contains articles from the French newspaper l'Est Républicain from 1999 to 2003. The corpus is available for download from Ortolang. |
Download |
Tübingen Treebank of Written German / Newspaper Corpus Size: 1.8 million tokens |
German |
This corpus contains articles from the German newspaper Die Tageszeitung. The corpus is available through a dedicated concordancer with an institutional account. |
Concordancer |
Size: 900,000 tokens |
German |
This corpus contains articles from the German newspaper Frankfurter Rundschau. The corpus is available for download from a dedicated webpage. |
Download |
Mannheim Corpus of Historical Newspapers and Magazines Size: 4.1 million tokens |
German |
This corpus contains articles from 21 German newspapers from the 18th and 19th century. The corpus is available for download from the CLARIN-D repository. |
Download |
Corpus "Library and Information Centre - Newspapers" Size: 20 units |
Greek |
The corpus contains newspaper articles. The corpus is available for download from the CLARIN:EL repository. |
Download |
The image of Germany in the Greek press Size: 3.5 million tokens, 7650 texts |
Greek |
The corpus consists of newspaper articles from three Greek newspapers (Ta Nea, Risospastis, and To Vima) dealing with Germany from the Greek perspective. Bibliographical information is encoded in the path to the file: It is composed of title of the newspaper, year, month, day, and rubric. The lemmata are stored in a separate tree of the same structure, the text files in that tree contain one lemma per line. The corpus is available for download from CLARIN-D (Saarland University B-centre). For the relevant publication, see Tsotsou (2019) |
Download |
Modern Greek Texts Corpus - "Makedonia" newspaper Size: 3 million tokens |
Greek |
This corpus contains newspaper articles in various topics (politics, economy, sports). The corpus is available for download from the CLARIN:EL repository. |
Download |
Modern Greek Texts Corpus - "Ta Nea" newspaper Size: 2 million words |
Greek |
This corpus contains newspaper articles in various topics (politics, economy, sports). The corpus is available for download from the CLARIN:EL repository. |
Download |
The Norwegian Newspaper Corpus Size: 700 million tokens |
Norwegian |
This corpus contains articles from 24 Norwegian newspapers from 1998 onwards. The corpus is available through the concordancer Corpuscle. |
Concordancer |
ChronoPress Corpus of Polish Press Texts Size: 20 million tokens |
Polish |
This corpus contains articles from various Polish newspapers from 1945 and 1962. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 678,000 tokens |
Swedish |
This corpus contains articles from the Swedish newspaper 8 sidor from 2003 to 2012. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 8.1 million tokens |
Swedish |
This corpus contains articles from the newspaper Dagny from 1886 to 1913. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 5 million tokens |
Swedish |
This corpus contains articles from the Swedish newspaper Dagens Nyheter from 1987. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 271 million tokens |
Swedish |
This corpus contains articles from the Swedish newspaper Göteborgsposten from 1994 and from 2001 to 2011. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 3.8 million tokens |
Swedish |
This corpus contains articles from the newspaper Hertha from 1914 to 2015. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 2 million tokens |
Swedish |
This corpus contains articles from the newspaper Idun from 1887 to 1917. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 5.5 million tokens |
Swedish |
This corpus contains articles from the newspaper Kvinnornas Tidning for the period between 1921 and 1925. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 3.5 million tokens |
Swedish |
This corpus contains articles from the newspaper Morgonbris from 1904 to 1924. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 2.2 million tokens |
Swedish |
This corpus contains articles from the newspaper Rösträtt för Kvinnor from 1912 to 1919. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 691,000 tokens |
Swedish |
This corpus contains articles from the newspaper Smittskyd from 2002 to 2010. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
|
Size: 272 million tokens |
Swedish |
This corpus contains articles from various Swedish online newspapers from 2001 to 2013. The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp. |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 8 million units |
40 languages | This corpus contains articles from the https://globalvoices.org/ website, where volunteers publish and translate news stories in more than 40 languages. | Download |
ACCURAT corpus of comparable sentences Size: 23,820 sentences |
English-Croatian, English- Greek, English-Estonian, English-Latvian, English-Lithuanian, English-Romanian, English-Slovenian, Greek-Romanian, Latvian-Lithuanian, Romanian-German, Romanian-Lithuanian and German-English |
This comparable corpus contains sentence pairs extracted from news comparable corpora. The corpus is available for download from the CLARIN:EL repository. |
Download |
SETIMES - A parallel corpus of the Balkan languages Size: 341.83 million tokens |
Romanian, Turkish, Serbian, English, Bulgarian, Macedonian, Croatian, Greek, Albanian |
This parallel corpus contains online news articles extracted from the SETimes webpage. The corpus is available for download from the CLARIN:EL repository. |
Download |
The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 8.8 billion tokens |
Swedish and Finnish |
This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1770 to 2011. The corpus can be accessed through the concordancer Korp. |
Concordancer |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Licence: CC-BY |
Swedish and Finnish |
This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1771 to 1874. The corpus can be downloaded from FIN-CLARIN. |
Download |
Size: 435 million tokens |
Swedish, English and Finnish |
This corpus contains articles from a variety of Swedish, English and Finnish newspapers. The corpus can be found in the FIN-CLARIN repository although its availability and licence are still under negotiation. |
The Europeana Collection
Corpus | Language | Description | Availability |
---|---|---|---|
Europeana historical newspapers: Netherlands Size: 2,869,483,985 words |
Dutch, French, English, Spanish (Castilian), Hebrew, Western Frisian, German, Punjabi, Arabic | This corpus contains 4266 issues of 164 newspapers r published in the Netherlands between 1618 and 1940. | VLO |
Europeana historical newspapers: Estonia Size: 351,656,185 words |
Estonian, Russian, German | This corpus contains 92,558 issues of 40 newspapers published in Estonia between 1852 and 1946. | VLO |
Europeana historical newspapers: Finland Size: 393,776,815 words |
Finnish, Swedish | This corpus contains 24,164 issues of 10 newspapers published in Finland between 1900 and 1910. | VLO |
Europeana historical newspapers: Luxembourg Size: 29,266,765 words |
French | This corpus contains 1225 issues of 2 newspapers published in Luxembourg between 1704 and 1794. | VLO |
Europeana historical newspapers: Germany Size: 5,593,768,847 words |
German, English | This corpus contains 126,564 issues of 11 newspapers published in Germany (chiefly Berlin and Hamburg) between 1792 and 1945. | VLO |
Europeana historical newspapers: Austria Size: 2,351,079,191 words |
German, Modern Greek, Croatian | This corpus contains 147,515 issues of 77 newspapers published in Austria between 1683 and 1930. | VLO |
Europeana historical newspapers: Latvia Size: 964,243,746 words |
Latvian, Russian, German, Polish, Estonian | This corpus contains 67,870 issues of 77 newspapers published in Latvia between 1868 and 1955. | VLO |
Europeana historical newspapers: Poland Size: 181,102,489 words |
Polish, German, Ukranian, Russian | This corpus contains 15,130 issues of 10 newspapers published in Poland between 1866 and 1939. | VLO |
Europeana historical newspapers: Serbia Size: 338,080,416 words |
Serbian | This corpus contains 22,087 issues of 44 newspapers published in Serbia between 1830 and 1944. | VLO |
Other Newspaper Corpora
Monolingual Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Zurich English Newspaper Corpus Size: 1.6 million tokens |
English | This corpus contains articles from various English newspapers (mainly newspapers from London) from the 17th and 18th century. | For access, contact the authors. |
Size: 426 million tokens |
German |
This corpus contains articles from various German newspapers from 2011. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 43,000 documents |
Italian |
This corpus contains articles from the Italian newspaper L’Adige from 1999 to 2006. The corpus is available for download through META-SHARE. |
Download |
Size: 380 million tokens |
Italian |
The corpus contains articles from the Italian newspaper La Repubblica. The corpus is available through the noSketch Engine concordancer. |
Concordancer |
WItaC - NewsReader Wikinews Italian Corpus Size: 40,231 tokens |
Italian |
This corpus contains Italian translations of 120 English Wikinews articles. The corpus is available for download from a dedicated website. For the relevant publication, see Minard et al. (2016). |
Download |
Corpus of Contemporary Serbian Newspapers and Magazines Size: 916 million tokens |
Serbian | This corpus contains articles from over a 100 Serbian newspapers from 2004 to 2012. | For access, contact the resource manager. |
Multilingual Corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Europeana Newspapers NER Corpora Size: 500, 000 tokens (182,483 Dutch; 207,000 French; 96,735 German) |
Dutch, French and German |
This corpus contains articles from Europeana newspapers for the following time periods: 1811-1856 for the Dutch subcorpus, 1871-1916 for the French subcorpus, and 1926 for the German subcorpus. The corpus is available for download from the KB Lab. For the relevant publication, see Neudecker (2016). |
Download |
Size: 35 billion tokens |
18 languages |
This corpus contains articles from newsfeed from 2014 to 2017. The corpus is available through noSketchEingine. For the relevant publication, see Bušta et al. (2017). |
Concordancer |
Additional Materials
CLARIN-PLUS workshop: 'Working with Digital Collections of Newspapers', 19-21 September 2016, Leuven, Belgium. [html]
Videolectures of the CLARIN-PLUS workshop. [html]
Workshop 'Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining', 5-6 March 2018, Helsinki, Finland. [html]
Slides for 'Hacking the News' workshop. [gdoc]
Publications on the Newspaper Corpora
[Bušta et al. 2017] Jan Bušta, Ondřej Herman, Miloš Jakubíček, Simon Krek, Blaž Novak. JSI Newsfeed Corpus. [pdf]
[Minard et al. 2016] Anne-Lyse Minard , Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, Chantal van Son. 2016. MEANTIME, the NewsReader Multilingual Event and Time Corpus.
[Neudecker 2016] Clemens Neudecker. An Open Corpus for Named Entity Recognition in Historic Newspapers.