Historical corpora

Introduction

The CLARIN infrastructure offers access to 74 historical corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN ERIC. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 27 July 2021.

Historical corpora in the CLARIN infrastructure

Monolingual corpora

Corpus Language Description Availability

Open Richly Annotated Cuneiform Corpus, Korp Version

Size: 741,100 tokens
Annotation: tokenised
Licence: CC-BY-SA

Akkadian

This corpus contains cuneiform texts from Ancient history.

The corpus is available through the concordancer Korp.

Concordancer

Greek Medieval Texts

Size: 3.4 million words
Licence: CC-BY

Ancient Greek

This corpus contains texts from the 4th to the 16th century.

The corpus is available for download from the clarin:el repository.

Download

Sheffield Corpus of Chinese

Size: 148,876 words
Annotation: no annotation
Licence: CC-BY-NC-SA 3.0

Chinese

This corpus contains fictional and non-fictional texts from the Medieval and Modern Chinese periods.

The corpus is available for download from the Oxford Text Archive.

Download

Brieven als buit (Letters as loot)

Size: 460,000 words
Annotation: lemmatised, PoS-tagged, grammatically tagged
Licence: CLARIN PUB

Dutch

This corpus contains 1,000 letters from the 17th to the 18th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rutten and van der Wal (2014).

Concordancer

Corpus Gysseling

Size: 1.5 million words
Annotation: PoS-tagged, lemmatised
Licence: INT Licence for researchers

Dutch

This corpus contains texts from the 13th century.

The corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer.

Concordancer

Download

A Corpus of English Dialogues 1560-1760 (CED)

Size: 1.2 million words
Annotation: no annotation
Licence: Oxford Text Archive licence

English

This corpus contains dialogues from literary and didactic works from 1560 to 1760.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Early English Correspondence Sampler (CEECS)

Size: 450,000 words
Annotation: no annotation
Licence: Oxford Text Archive licence

English

This corpus contains 1147 letters from 1418 to 1680.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Late Modern English prose / David Denison

Size: 580,056 words
Annotation: no annotation
Licence: Oxford Text Archive licence

English

This corpus contains fictional texts from 1837 to 1926.

The corpus is available for download from the Oxford Text Archive.

Download

Hansard Corpus

Size: 1.6 billion tokens
Annotation: tokenised, PoS-tagged, lemmatised, semantic tags

English

This corpus contains parliamentary debates from 1803 to 2005.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rayson et al. (2015).

Concordancer

Helsinki Corpus of Scottish Correspondence (1540-1750)

Size: 500,000 tokens
Annotation: tokenised
Licence: CLARIN ACA

English

This corpus contains personal correspondence from 1540 to 1750.

The corpus is available through the concordancer Korp.

Concordancer

Older Scottish texts : the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-Smith

Size: 877,000 tokens
Annotation: tokenised
Licence: CC-BY-NC-SA 3.0

English

This corpus contains texts from 1450 to 1600.

The corpus is available for download from the Oxford Text Archive.

Download

Pamphlets of the American Revolution : [selections] / edited by Bernard Bailyn

Size: 431,013 words
Licence: CC-BY-NC-SA 3.0

English

This corpus contains pamphlets of the American Revolution from 1750 to 1776.

The corpus is available for download from the Oxford Text Archive.

Download

Parsed Corpus of Early English Correspondence (PCEEC)

Size: 2.2 million words
Annotation: tokenised, PoS-tagged, syntactically parsed
Licence: Oxford Text Archive licence

English

This corpus contains correspondence from around 1410 to 1681.

This corpus is available for download from the Oxford Text Archive.

Download

Royal Society Corpus (Version 4.0)

Size: 35 million tokens
Annotation: PoS-tagged using PennTreebank tagset, lemmatised, normalised
Licence: CC-BY-NC-SA-4.0

English

This corpus contains articles from the  Philosophical Transactions of the Royal Society of London journal from 1665 to 1869.

The corpus is available for download from the CLARIN-D repository as well as through a concordancer.

Concordancer

Download

The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose

Size: 300,000 words
Annotation: COCOA-style
Licence: Oxford Text Archive licence

English

This corpus contains texts from 1761 to 1790.

The corpus is available for download from the Oxford Text Archive.

Download

The Lampeter Corpus of Early Modern English Tracts

Size: 50,797,916 words
Annotation: no linguistic annotation
Licence: CC-BY-NC-SA 3.0

English

This corpus contains tracts from 1640 to 1740.

The corpus is available for download from the Oxford Text Archive.

Download

The Lancaster Newsbooks Corpus

Size: 3,001,604 words
Licence: CC-BY-NC-SA 3.0

English

This corpus contains two collections of English newsbooks from 1654 to 1655.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Historical American English - Kielipankki Korp version 2017H1

Size: 385 million tokens
Annotation: tokenised
Licence: CLARN ACA

English (American)

This corpus contains texts from 1810 to 2009.

The corpus is available through the concordancer Korp.

Concordancer

The Corpus of Late Modern English Texts, version 3.1

Size: 34 million words
Annotation: PoS-tagged
Licence: CC-BY-NC-SA 4.0

English (Late Modern)

This corpus contains texts written by British and Irish authors from 1710 to 1920.

The corpus is available for download from a CLARIN-D repository.

Download

The Old Bailey Corpus

Size: 134 million words
Annotation: detailed sociobiographical, pragmatic and textual annotation
Licence: CC-BY-NC-SA 4.0

English (Late Modern)

This corpus contains proceedings of the Old Bailey (i.e., legal documents) from 1674 to 1913.

The corpus is available for download from the CLARIN-D repository and through the CQPConcordancer.

For the corpus manual, see Huber et al. (2016).

Concordancer

Download

Helsinki corpus of English texts

Size: 240,000 words
Licence: Oxford Text Archive licence

English (Old and Middle)

This corpus contains Biblical and fictional texts from 730 to 1710.

The corpus is available for download from the Oxford Text Archive.

Download

The York-Helsinki parsed corpus of Old English poetry (YCOEP)

Size: 71,500 words
Annotation: syntactically-parsed
Licence: Oxford Text Archive licence

English (Old)

This corpus contains poems from 730 to 1710.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Old Written Estonian

Size: 2 million tokens
Annotation: tokenised, 16.-18. century texts have been tagged with contemporary Estonian, morphological and language information. 19. century texts are unannotated.
Licence: CC-BY

Estonian

This corpus covers secular and religious texts from the 16th to the 18th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Kingisepp et al. (2004).

Concordancer

Classics of Finnish Literature, Kielipankki Version

Size: 1.5 million words
Licence: EUPL v.1.1 SA

Finnish

This corpus contains literary texts from 1880 to 1949.

The corpus is available through the concordancer Korp.

Concordancer

Corpus of Old Literary Finnish

Size: 4.1 million words
Annotation: MSD-tagged, syntactically parsed
Licence: EUPL v.1.1 SA

Finnish

This corpus contains literary texts from 1543 to 1810.

The corpus is available through the concordancer Korp.

Concordancer

The Finnish Gutenberg Corpus

Size: 34.5 million words
Licence: CC-BY

Finnish

This corpus contains books published up to 1925 that are made available through the Gutenberg project.

The corpus is available through the concordancer Korp.

Concordancer

The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 5.2 billion tokens
Annotation: tokenised
Licence: CC_BY-SA

Finnish

This corpus contains newspaper articles from 1840 to 2011.

The corpus is available through the concordancer Korp.

Concordancer

The Morpho-Syntactic Database of Mikael Agricola's Works

Size: 428,300 tokens
Annotation: tokenised, PoS-tagged, morphological components and syntactic function
Licence: CC-BY-ND

Finnish

This corpus contains texts from 1544 to 1551 written by the clergyman Mikael Agricola.

The corpus is available through the concordancer Korp.

Concordancer

Virtual Old Literary Finnish (VVKS) - Kielipankki Korp version

Size: 48 texts
Licence: CC-BY-NC-ND

Finnish

This corpus contains literary texts from 1543 to 1791.

 

Partonopeus de Blois: transcriptions of all manuscripts and fragments

Size: 21,736,766 words
Annotation: no linguistic annotation
Licence: CC BY-NC-SA 3.0

French (Old)

This corpus contains transcriptions of the manuscripts and fragments of the romance Partonopeus de Blois.

The corpus is available for download from the Oxford Text Archive.

Download

Syntactic Reference Corpus of Medieval French

Size: 245,000 tokens
Annotation: tokenised, syntactically-parsed
Licence: CLARIN ACA

French (Old)

This corpus contains texts from the 9th to the 13th century.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Stein (2013).

Download

Austrian Baroque Corpus

Size: 200,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This corpus contains sermons from 1650 to 1750.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al. (2016).

Concordancer

DDR-Presseportal (GDR press portal)

 

German

This corpus contains newspaper texts from 1945 to 1994.

The corpus is available through a concordancer provided by CLARIN-D.

Concordancer

Deutsches Textarchiv (DTA)

Size: 215,168,761 tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CLARIN PUB

German

This corpus contains texts from the 17th to the 20th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Haaf and Thomas (2016).

Concordancer

Die Grenzboten (journal)

Size: 89 million tokens
Annotation: tokenised, lemmatised, PoS-tagged, normalised orthography
Licence: CC-BY-NC-SA 3.0

German

This corpus contains texts from 1842 to 1921.

The corpus is available for download from the Deutsches Text Archiv and through a concordancer.

Concordancer

Download

Dinglers Polytechnisches Journal (Polytechnical Journal of Dingler)

Size: 77.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised, normalized orthography
Licence: CC-BY-NC-SA 3.0

German

This corpus contains academic texts from 1820 to 1931.

The corpus is available for download from the Deutsches Text Archiv and through a concordancer.

Concordancer

Download

GeMi Corpus

Size: 120,000 tokens
Annotation: Lite markup, no linguistic annotation
Licence: CC-BY-NC-SA 3.0

German

This corpus contains medical writing from 1500 to 1700.

The corpus is available for download from the Oxford Text Archive.

Download

GerManC. A Historical Corpus of German Newspapers 1650-1800

Size: 700,000 words
Annotation: no annotation
Licence: CC-BY-NC-SA 3.0

German

This corpus contains personal letters, sermons and fictional, scholarly (i.e., humanities), scientific and legal texts from 1650 to 1800.

The corpus is available for download from the Oxford Text Archive.

Download

Mannheimer Korpus Historischer Zeitungen und Zeitschriften

Size: 3532 pages

German

This corpus contains texts from the 18th and 19th centuries.

The corpus is available for download directly through the VLO.

Download

Referenzkorpus Mittelhochdeutsch (Middle High German Reference Corpus)

Size: 2.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised, normalised, morphosyntactic description
Licence: CC-BY-SA 4.0

German

This corpus contains texts from 1050 to 1350.

The corpus is available for download from the Deutsches Text Archiv and through a concordancer.

For the relevant publication, see Klein and Dipper (2016).

Concordancer

Download

B4 Historisches Predigtenkorpus zum Nachfeld

Size: 92,500 tokens
Annotation: tokenised, syntactic and discursive annotation
Licence: CLARIN ACA

German (Middle High)

This corpus contains sermons from an Upper German (Balvarian-Alemannic) dialect area.

The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.

Concordancer

Download

B4 Ludolf

Size: 6,690 tokens
Annotation: tokenised, tagged for clause type and grammatical function
Licence: CLARIN ACA

German (Middle High)

This corpus contains texts from a journey diary from 1350.

The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.

Concordancer

Download

Reference Corpus Middle Low German/Low Rhenish (1200-1650)

Size: 200,700 tokens
Annotation: tokenised, MSD-tagged
Licence: CC-BY

German (Middle Low)

This corpus contains texts from the 13th century to the middle of the 17th century.

The corpus is available for download from the repository of the University of Hamburg through the ANNIS environment.

For the relevant publication, see Schröder (2014).

Concordancer

Download

SaCoCo—Saarbrücken Cookbook Corpus

Size: 436,000 tokens
Annotation: PoS-tagged using the STTS tagset, lemmatised, normalised
Licence: CC-BY-NC-SA-3.0

German

This corpus contains historical cookbook recipes from  1569 to 1800, as well as contemporary ones from 2012.

The corpus is available through the CQPweb concordancer provided by CLARIN-D.

Concordancer

OROSSIMO Corpus – History

Size: 553,000 tokens
Annotation: structural annotation (paragraph)
Licence: CC-BY

Greek

This corpus contains historic academic texts.

The corpus is available for download from the clarin:el repository.

Download

Hungarian Historical Corpus

Size: 30 million words

Hungarian

This corpus contains historical texts from the 18th century to the 2000s.

The corpus is available through a dedicated concordancer.

For the relevant publication, see lemma=

Concordancer

The Saga Corpus

Size: 1.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised, normalized orthography
Licence: CC-BY 4.0

Icelandic (Old)

This corpus contains Old Icelandic (Old Norse) Narrative texts from the 13th to the 15th century.

The corpus is available for download from CLARIN-IS and for search through the concordancer Korp.

For the relevant publication, see Rögnvaldsson and Helgadóttir (2011)

Concordancer

Download

ChroniclItaly

Size: 16.6 million words
Annotation: unannotated
Licence: ODC Attribution License (ODC-By)

Italian

This corpus contains Italian language newspapers published in the United States between 1898 and 1920. The corpus includes seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia. The collection includes the following titles: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia.

The corpus is available for download from the repository of the University of Utrecht.

Download

LatinISE corpus (version 4)

Size: 13.3 million tokens
Annotation: sentence segmented, PoS-tagged, lemmatized
Licence: CC BY-NC-SA 4.0

Latin

This corpus consists of Latin texts from the 2nd century B.C. to the 21st century. Non-linguistic metadata include information on genre, title, century and specific date.

The corpus is available for download from LINDAT and for search online through Sketch Engine.

For the relevant publication, see McGillivray and Kilgarriff (2015)

Concordancer

Download

Menota

Size: 1.6 million tokens
Annotation: tokenised, MSD-tagged, lemmatised
Licence: CC-BY

Old Norse

This corpus contains Medieval Nordic texts.

The corpus is available for download and through the concordancer Corpuscle.

Concordancer

Download

Chronopress

Size: 16 million tokens
Licence: CC-BY-SA

Polish

This corpus contains newspaper articles from 1945 to 1954.

The corpus is available through a dedicated concordancer.

Concordancer

Polish language of the 1960s

Size: 500,000 words
Annotation: MSD-tagged
Licence: CC-BY-NC-SA 3.0

Polish

This corpus contains essays, news articles, and scientific and literary texts from 1963 to 1967.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of biblical text in Scots / John Kirk

Size: 35,506 words
Annotation: no annotation
Licence: Oxford Text Archive licence

Scots

This corpus contains Biblical texts.

The corpus is available for download from the Oxford Text Archive.

Download

The Helsinki corpus of Older Scots : [1450-1700]

Size: 1,940,706 words
Annotation: no annotation
Licence: CC-BY-NC-SA 3.0

Scots

This corpus contains texts of different domains and genres (e.g., burgh records, diaries, pamphlets, scientific treatises, sermons) from 1450 to 1700.

The corpus is available for download from the Oxford Text Archive.

Download

Digital library and corpus of historical Slovene IMP 1.1

Size: 17.7 million tokens
Annotation: tokenised, lemmatised, PoS-tagged
Licence: CC-BY-SA 4.0

Slovenian

This corpus contains 658 unique texts from 1584 to 1919.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Erjavec (2015).

Concordancer

Download

Reference corpus of historical Slovene goo300k 1.2

Size: 300,000 tokens
Annotation: manually tokenised, lemmatised, PoS-tagged, modern synonyms for archaic words
Licence: CC-BY 4.0

Slovenian

This corpus contains 89 unique texts from 1584 to 1899.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Erjavec (2012).

Concordancer

Download

The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 3.5 billion tokens
Annotation: tokenised
Licence: CC-BY-SA.

Swedish

This corpus contains newspaper articles from 1770 to 1950.

The corpus is available through the concordancer Korp.

Concordancer

Historical Corpus of the Welsh Language 1500-1850

Size: 420,000 words

Welsh

This corpus contains 30 texts from 1500 to 1850.

The corpus is available for download from a dedicated website and through a dedicated concordancer.

 

Multilingual corpora

Corpus Language Description Availability

"PolDiLemma" Middle Polish Diachrone Lemmatised Corpus

Size: 7 million tokens
Annotation: tokenised, lemmatised
Licence: CC BY-NC-SA 4.0

Czech, German, Latin, Polish

This corpus contains political, religious and scientific texts from the 16th to the 18th century.

The corpus is available for download from the CLARIN-D repository.

Download

Medieval Charter Sections Corpus

Size: 57 chapters
Annotation: manually-tagged, named entities
Licence: CC-BY-NC-SA 4.0

Czech, Latin

This corpus contains Latin charters created in the era of John the Bling, King of Bohemia.

The corpus is available for download from LINDAT.

For the relevant publication, see Galuščáková and Neužilová (2018).

Download

Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo

Size: 4000 words
Annotation: no linguistic annotation
Licence: Oxford Text Archive licence

English (Middle), Hebrew

This corpus contains literary texts from 1100 to 1400.

The corpus is available for download from the Oxford Text Archive.

Download

Dictionary of Old English Corpus in Electronic Form (DOEC)

Annotation: no linguistic annotation
Licence: Oxford Text Archive licence

English (Old), Latin

This corpus contains 3037 texts from 600 to 1150.

The corpus is available for download from the Oxford Text Archive.

Download

The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)

Size: 1.5 million words
Annotation: syntactically-parsed
Licence: Oxford Text Archive licence

English (Old), Latin

This corpus contains fictional texts from 600 to 1150.

The corpus is available for download from the Oxford Text Archive.

Download

Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA)

Size: 128,000 words
Annotation: MSD-tagged, syntactically parsed
Licence: CLARIN RES

English, German, Latin, Old Norse, Swedish

This corpus contains texts written in the Late Old Swedish period (from 1375 to 1550).

The corpus is available for download from the repository of the University of Hamburg.

Download

The Electronic Text Corpus of Sumerian Literature. Revised edition

Size: 5,151,373 words
Annotation: Each word form in the composite transliterations has been assigned to a lexeme which is specified by a citation form, word class information and basic English translation.
Licence: CC-BY-NC-SA 3.0

English, Sumerian

This corpus contains transliterations and English translations of 394 Sumerian compositions from approximately 2100 to 1700 BCE.

The corpus is available for download from the Oxford Text Archive.

Download

Finnish Folk Poetry

Size: 7.1 million words
Annotation: normalised (added diacritics)
Licence: CC-BY-NC

Finnish, Karelian, Ludian, Latin, Swedish, Olonets, Izhorian, Votic

This corpus contains poems from 1564 to 1939.

The corpus is available through the concordancer Korp.

Concordancer

Corpus of Early Modern Finnish, Kielipankki Version

Size: 8.6 million words
Annotation: no linguistic annotation
Licence: EUPL v.1.1 SA

Finnish, Russian, German, Latin

This corpus contains texts from 1809 to 1899.

The corpus is available through the concordancer Korp.

Concordancer

Aleksis Kivi Corpus (SKS)

Size: 413,700 words
Annotation: MSD-tagged, syntactically parsed
Licence: CC-BY-NC

Finnish, Swedish

This corpus contains the works by Finnish author Aleksis Kivi from 1855 to 1871.

The corpus is available through the concordancer Korp.

Concordancer

Classics Library of the National Library of Finland - Kielipankki version

Licence: CC-BY

Finnish, Swedish

This corpus will contain literary texts from 1549 to 1944.

 

The Letters of Paul Sinebrychoff, Kielipankki Version

Size: 8.6 million words
Annotation: Finnish subset: MSD-tagged, syntactically parsed; Swedish subset: no linguistic annotation
Licence: CC-BY

Finnish, Swedish

This corpus contains letters from 1895 to 1909.

The corpus is available through a dedicated online search environment.

Concordancer

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 8.7 billion words
Licence: CC-BY

Finnish, Swedish

This corpus contains newspaper articles from 1770 to 2011.

The corpus is available through the concordancer Korp.

Concordancer

The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)

Licence: CC-BY

Finnish, Swedish

This corpus contains newspaper articles from 1771 to 1874.

Download

The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920)

Size: 8.7 billion tokens
Annotation: tokenised
Licence: CLARIN ACA

Finnish, Swedish

This corpus contains newspaper articles from 1875 to 1920.

The corpus is available for download from the Language Bank of Finland.

Download

B4 Tatian Corpus of Deviating Examples 2.1

Size: 11,300 tokens
Annotation: tokenised, MSD-tagged
Licence: CC-BY

Latin, German (Old High)

This corpus contains the OHG Tatian, which is one of the largest prose texts from the Old High German period.

The corpus is available for download and through a concordancer from the repository of the University of Hamburg.

Concordancer

Download

Språkbanken's historical corpora

Size: 1.34 billion tokens
Annotation: tokenised, PoS-tagged, lemmatised, syntactically parsed, word sense (for materials more recent than 1800)
Licence: CC-BY

Swedish, German, French and others

This collection of corpora contains – among others – diachronic legal texts, Bible translations, medieval letters, digitized newspapers from the Swedish National Library and 19th century fiction from the Swedish Literature Bank.

The corpora are available through the concordancer Korp.

Concordancer

Other historical corpora

Monolingual corpora

Corpus Language Description Availability

DIAKORP v6

Size: 4 million tokens
Annotation: basic structural markup
Licence: CC-BY-NC-SA

Czech

This corpus contains texts from the 14th to the 20th century.

The corpus is available through a dedicated concordancer.

Concordancer

ARCHER Corpus

 

English

The corpus contains texts from 1600 to 1999.

The corpus is available through the CQPConcordancer.

Concordancer

ECCO-TCP

Size: 74 million tokens
Annotation: no linguistic annotation
Licence: CC-0

English

This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1700 to 1800.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.

Concordancer

Download

EEBO-TCP

Size: 766 million tokens
Annotation: no linguistic annotation
Licence: CC-0

English

This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1450 to 1750.

The corpus is available through a dedicated concordancer.

Concordancer

EVANS-TCP

Size: 766 million tokens
Annotation: no linguistic annotation
Licence: CC-0

English

This corpus contains American texts from 1640 to 1821.

The corpus is available through a dedicated concordancer.

Concordancer

Historical Corpora at Lancaster University

Annotation: tokenised, PoS-tagged, partial semantic tagging (USAS system)

English

The corpus contains texts in various domains (e.g., fiction, newspaper texts, religious texts) from 1500 on.

The corpus is available through the CQPConcordancer.

Concordancer

Frantext

Size: 300 million words
Annotation: PoS-tagged, lemmatised

French

This corpus contains texts from the 10th to the 21st century.

The corpus is available through a dedicated concordancer (restricted access).

Concordancer

Corpus of Old and Middle Hungarian court records and private correspondence

Size: 850,000 words
Annotation: tokenised, MSD-tagged, lemmatised, sociolinguistic metadata

Hungarian

This corpus contains private letters and testimonies from the 16th to the 18th  century.

The corpus is available through a dedicated concordancer.

Concordancer

Old Hungarian Corpus

Size: 3 million tokens
Annotation: tokenised, partially normalized, partially MSD-tagged

Hungarian

This corpus contains texts (codices, letters) from the 12th to the 17th century.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.

Concordancer

Download

Corpus testuale del Tesoro della Lingua Italiana delle Origini

Size: 23 million tokens
Annotation: tokenised, lemmatised

Italian

This corpus contains early Italian texts before 1375.

The corpus is available through a dedicated concordancer.

Concordancer

DiaCORIS

 

Italian

This corpus contains texts from 1861 to 1945.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rossini Favretti et al. (2011).

Concordancer

M.I.DIA. (Morfologia dell'Italiano in DIAcronia)

Size: 7.5 million tokens
Annotation: tokenised
Licence: CC-BY-NC 4.0

Italian

This corpus contains texts from the 13th to the 20th century.

The corpus is available through a dedicated concordancer

Concordancer

Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej)

Size: 625,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, transliteration, transcripton

Polish

This corpus contains texts from 1830 to 1918.

The corpus is available for download through a dedicated webpage.

Download

The Electronic Corpus of 17th- and 18th-century Polish Texts (Elektroniczny Korpus Tekstów Polskich z XVII i XVIII w.)

Size: tokenised, partially PoS-tagged, structural annotation
Licence: This corpus contains texts from 1601 to 1772.#SEPThe corpus is available through a dedicated concordancer

13.5 million tokens

a manually annotated subset is available here.

Concordancer

IMPACT GT corpus (Korpus GT projektu IMPACT)

Size: 1.5 million tokens
Annotation: transcription

Polish

This corpus contains texts from 1570 to 1756.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Bień (2012).

Concordancer

Corpus Informatizado do Português Medieval

Size: 2 million tokens
Annotation: tokenised, PoS-tagged

Portuguese

This corpus contains texts from the 9th to the 16th century.

The corpus is available through a dedicated concordancer (restricted access).

Concordancer

Parsed Corpus of Historical Portuguese

Size: 3.3 million
Annotation: tokenised, PoS-tagged (2 million), treebanked (1.2 million)

Portuguese

This corpus contains 76 texts written by authors born between 1380 and 1881.

The corpus is available for download and through a dedicated concordancer.

Concordancer

Download

Multilingual corpora

Corpus Language Description Availability

Bundesblatt/Feuille fédérale/Foglio federale

Size: 203,585,806 tokens (German), 239,125,036 tokens (French), 85,223,085 tokens (Italian
Annotation: tokenised, syntactically-parsed

German, French, Italian

This corpus contains texts from 1849 to 2014.

The corpus is available through the CQPWeb concordancer.

Concordancer

Corpus of old Polish texts until 1500 (Korpus tekstów staropolskich do roku 1500)

Size: 620,000 tokens
Annotation: tokenised

Polish, Latin

This corpus contains texts until 1500.

The corpus is available for download from a dedicated webpage.

Download

Corpus of the 16. century Polish (Korpus polszczyzny XVI wieku)

Annotation: lemmatised, transliteration

Polish, Latin

This corpus contains texts from the 16th century.

The corpus is available through a dedicated concordancer.

Concordancer

eFontes Mediae et Infimae Latinitatis Polonorum (Elektroniczny korpus polskiej łaciny średniowiecznej)

Size: 5 million tokens
Annotation: tokenised, lemmatised

Polish, Latin

This corpus contains texts from the 11th to the middle of the 16th century.

The corpus is available through a dedicated concordancer.

Concordancer

XV century New Testament translations (Piętnastowieczne przekłady Nowego Testamentu – elektroniczna konkordancja staropolska)

Size: 400,000 tokens
Annotation: tokenised

Polish, Latin

This corpus contains Biblical texts from 1380 to 1500.

This corpus is available through a dedicated concordancer.

Concordancer

Additional materials

  • Presentations on historical newspaper corpora t the CLARIN-PLUS workshop "Working with Digital Collections of Newspapers." 19-21 September 2016, Leuven, Belgium. [html]
  • Videolectures of the CLARIN-PLUS workshop. [html]

List of publications on historical corpora

[Bień 2012] Janusz Bień. 2012. Delivering the IMPACT project Polish Ground-Truth texts with Poliqarp for DjVu.

[Erjavec 2012] Tomaž Erjavec. 2012.  The goo300k corpus of historical Slovene.

[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources. 

[Galuščáková and Neužilová 2018]  Petra Galuščáková and Lucie Neužilová. Low Resource Methods for Medieval Document Sections Analysis.

[Haaf and Thomas 2016] Susanne Haaf and Christian Thomas. 2016. The Historical Corpora of the German Text Archive as a basis for research into linguistic history.

[Huber et al. 2016] Magnus Huber, Magnus Nissel, Karin Puga. 2016. The Old Bailey Corpus 2.0, 1720-1913 Manual. 

[Kingisepp et al. 2004] Valve-Liivi Kingisepp, Külli Prillop, Külli Habicht. 2004. EESTI VANA KIRJAKEELE KORPUS: MIS TEHTUD, MIS TEOKSIL.

[Klein and Dipper 2016] Thomas Klein and Stefanie Dipper. 2016. Handbuch zum Referenzkorpus Mittelhochdeutsch.

[McGillivray and Kilgarriff 2015] Barbara McGillivray and Adam Kilgarriff. 2015. Tools for historical corpus research, and a corpus of Latin.

[Rayson et al. 2015] Paul Rayson, Alistair Baron, Scott Piao, Steve Wattam. 2015. Large-scale Time-sensitive Semantic Analysis of Historical Corpora. 

[Rossini Favretti et al. 2011] Rema Rossini Favretti, Fabio Tamburini, Andrea Zaninello. 2011.  Exploiting corpus evidence for automatic sense induction.

[Rögnvaldsson and Helgadóttir 2011] Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In C. Sporleder, A.P.J. van den Bosch and K.A. Zervanou (eds.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, pp. 63-76. Springer, Berlin.

[Rutten and van der Wal 2014] Gijsbert Rutten and Marijke van der Wal. 2014. Letters as Loot. A sociolinguistic approach to seventeenth- and eighteenth-century Dutch

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[Schröder 2014] Ingrid Schröder. 2014. The Reference Corpus: New Perspectives for Middle Low German Grammar.

[Stein 2013] Achim Stein. 2013. Diachronic syntax based on constituency and dependency annotated corpora: theoretical and methodological issues.