Historical Corpora | CLARIN ERIC

The CLARIN infrastructure offers access to 76 historical corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN ERIC. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Historical corpora in the CLARIN infrastructure

Monolingual corpora

Corpus	Language	Description	Availability
Open Richly Annotated Cuneiform Corpus, Korp Version Size: 1,600,563 tokens Annotation: tokenised, lemmatised, PoS-tagged, semantically annotated Licence: CC-BY-SA	Akkadian	This corpus contains cuneiform texts from Ancient history. The texts come from the Oracc project and include collections such as the Corpus of Ancient Mesopotamian Scholarship, The Digital Corpus of Cuneiform Lexical Texts, and Royal Inscriptions of Babylonia online. The corpus is available through the concordancer Korp and for download from the repository of FIN-CLARIN.	Concordancer Download
The Diorisis Ancient Greek Corpus Size: 10.2 million words Annotation: PoS-tagged, lemmatised Licence: CC BY 4.0	Ancient Greek	This corpus consists of 820 texts spanning between the beginnings of the Ancient Greek literary tradition (Homer) to the fifth century AD. The texts are sourced from the Perseus Canonical Greek Lit Repository, "The Little Sailing" digital library, and the Bibliotheca Augustana digital library. The corpus is available for download from Figshare. For the relevant publication, see Vatri and McGillivray (2018)	Download
Greek Medieval Texts Size: 3.4 million words Licence: CC-BY	Ancient Greek	This corpus contains texts from the 4th to the 16th century. The texts belong to the following categories: religious, poetical-literary, political, and historical texts, as well as hymns and epigrams. The corpus is available for download from the clarin:el repository.	Download
Sheffield Corpus of Chinese Size: 148,876 words Annotation: no annotation Licence: CC-BY-NC-SA 3.0	Chinese	This corpus contains three texts (two non-fictional and one fictional) from the Medieval and Modern Chinese periods. The text "Zhuzi Yulei is genre-wise similar to sermons and vernacular dialogues, and is representative of Medieval Chinese. The two other texts are the novel "Shuihu Zhuan", which is from the Ming Dynasty (1368–1644), and the novel "Rulin Waishi", which is from the Quing Dynasty (1644–1911). The corpus is available for download from the Oxford Text Archive.	Download
Brieven als buit (Letters as loot) Size: 460,000 words Annotation: lemmatised, PoS-tagged, grammatically tagged Licence: CLARIN PUB	Dutch	This corpus contains 40,000 letters from the 17th to the 19th century. These letters were sent home by sailors and others from abroad but also vice versa by those staying behind who needed to keep in touch with their loved ones. Many letters did not reach their destinations: they were taken as loot by privateers and confiscated by the High Court of Admiralty during the wars fought between The Netherlands and England The corpus is available through a dedicated concordancer. For the relevant publication, see Rutten and van der Wal (2014).	Concordancer
Corpus Gysseling Size: 1.5 million words Annotation: PoS-tagged, lemmatised Licence: INT Licence for researchers	Dutch	This corpus contains texts from the 13th century. The texts were prepared and originally published in the 1970s and 1980s by the Ghent linguist Maurits Gysseling. The corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer.	Concordancer Download
A Corpus of English Dialogues 1560-1760 (CED) Size: 1.2 million words Annotation: no annotation Licence: Oxford Text Archive licence	English	This corpus contains dialogues from literary and didactic works from 1560 to 1760. There are five text-types in the CED. The text-types representative of constructed dialogue are drama comedy, didactic works (language manuals and other handbooks) and fiction; the text-types representative of authentic dialogue are trial proceedings and witness depositions. In addition, a small group of miscellaneous dialogic texts is included in the collection. The corpus is available for download from the Oxford Text Archive.	Download
Corpus of Early English Correspondence Sampler (CEECS) Size: 450,000 words Annotation: no annotation Licence: Oxford Text Archive licence	English	This corpus contains 1147 letters from 1418 to 1680. The corpus was created from the larger Corpus of Early English Correspondence. The corpus is available for download from the Oxford Text Archive.	Download
Corpus of Late Modern English prose / David Denison Size: 580,056 words Annotation: no annotation Licence: Oxford Text Archive licence	English	This corpus contains fictional texts from 1837 to 1926. The corpus is available for download from the Oxford Text Archive.	Download
Hansard Corpus Size: 1.6 billion tokens Annotation: tokenised, PoS-tagged, lemmatised, semantic tags	English	This corpus contains parliamentary debates from 1803 to 2005. The corpus is available through a dedicated concordancer. For the relevant publication, see Rayson et al. (2015).	Concordancer
Helsinki Corpus of Scottish Correspondence (1540-1750) Size: 500,000 tokens Annotation: tokenised Licence: CLARIN ACA	English	This corpus contains personal correspondence from 1540 to 1750. the corpus consists of transcripts of original letter manuscripts. The texts are reproduced without any modernisation or normalisation. Language-external variables such as date, region, gender, addressee, hand and script type have been coded. The writers originate from fifteen different regions of Scotland. A fifth of the correspondents in the corpus are women. The corpus is available through the concordancer Korp.	Concordancer
Older Scottish texts: the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-Smith Size: 877,000 tokens Annotation: tokenised Licence: CC-BY-NC-SA 3.0	English	This corpus contains texts from 1450 to 1600. The corpus is available for download from the Oxford Text Archive.	Download
Pamphlets of the American Revolution : [selections] / edited by Bernard Bailyn Size: 431,013 words Licence: CC-BY-NC-SA 3.0	English	This corpus contains pamphlets of the American Revolution from 1750 to 1776. The corpus is available for download from the Oxford Text Archive.	Download
Parsed Corpus of Early English Correspondence (PCEEC) Size: 2.2 million words Annotation: tokenised, PoS-tagged, syntactically parsed Licence: Oxford Text Archive licence	English	This corpus contains correspondence from around 1410 to 1681. There are 4970 personal letters by 666 writers. The letters have been selected to be as socially representative of the literate social ranks of the time as possible. This corpus is available for download from the Oxford Text Archive.	Download
Royal Society Corpus (Version 4.0) Size: 35 million tokens Annotation: PoS-tagged using PennTreebank tagset, lemmatised, normalised Licence: CC-BY-NC-SA-4.0	English	This corpus contains articles from the Philosophical Transactions of the Royal Society of London journal from 1665 to 1869. The corpus is available for download from the CLARIN-D repository as well as through a concordancer.	Concordancer Download
The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose Size: 300,000 words Annotation: COCOA-style Licence: Oxford Text Archive licence	English	This corpus contains texts from 1761 to 1790. The corpus is available for download from the Oxford Text Archive.	Download
The Lampeter Corpus of Early Modern English Tracts Size: 50,797,916 words Annotation: no linguistic annotation Licence: CC-BY-NC-SA 3.0	English	This corpus contains tracts from 1640 to 1740. The corpus is available for download from the Oxford Text Archive.	Download
The Lancaster Newsbooks Corpus Size: 3,001,604 words Licence: CC-BY-NC-SA 3.0	English	This corpus contains two collections of English printed pamphlets, books, and newspapers from 1654 to 1655. The corpus is available for download from the Oxford Text Archive.	Download
Corpus of Historical American English - Kielipankki Korp version 2017H1 Size: 385 million tokens Annotation: tokenised Licence: CLARN ACA	English (American)	This corpus contains texts from 1810 to 2009. Each decade has roughly the same balance of fiction, popular magazine, newspaper, and non-fiction books. The corpus is available through the concordancer Korp.	Concordancer
The Corpus of Late Modern English Texts, version 3.1 Size: 34 million words Annotation: PoS-tagged Licence: CC-BY-NC-SA 4.0	English (Late Modern)	This corpus contains texts written by British and Irish authors from 1710 to 1920. In terms of genre, the texts correspond to narrative fiction and non-fiction, drama, letters, treatises, and miscellaneous written works. The corpus is available for download from a CLARIN-D repository.	Download
The Old Bailey Corpus Size: 134 million words Annotation: detailed sociobiographical, pragmatic and textual annotation Licence: CC-BY-NC-SA 4.0	English (Late Modern)	This corpus contains proceedings of the Old Bailey (i.e., legal documents) from 1674 to 1913. The corpus is available for download from the CLARIN-D repository and through the CQPConcordancer. For the corpus manual, see Huber et al. (2016).	Concordancer Download
Helsinki corpus of English texts Size: 240,000 words Licence: Oxford Text Archive licence	English (Old and Middle)	This corpus contains religious and fictional texts from 730 to 1710. See the project page for a list of all the texts included in the corpus. The corpus is available for download from the Oxford Text Archive.	Download
The York-Helsinki parsed corpus of Old English poetry (YCOEP) Size: 71,500 words Annotation: syntactically-parsed Licence: Oxford Text Archive licence	English (Old)	This corpus contains poems from 730 to 1710. The corpus contains a selection of poems taken from the Old English subpart of the Helsinki Corpus of English Texts. The corpus is available for download from the Oxford Text Archive.	Download
Corpus of Old Written Estonian Size: 2 million tokens Annotation: tokenised, 16.-18. century texts have been tagged with contemporary Estonian, morphological and language information. 19. century texts are unannotated. Licence: CC-BY	Estonian	This corpus covers secular and religious texts from the 16th to the 18th century. The corpus is available through a dedicated concordancer. For the relevant publication, see Kingisepp et al. (2004).	Concordancer
Classics of Finnish Literature, Kielipankki Version Size: 1.5 million words Licence: EUPL v.1.1 SA	Finnish	This corpus contains literary texts from 1880 to 1949. In terms of genre, the texts correspond to prose fiction, plays, poetry and aphorisms. The corpus is available through the concordancer Korp (FIN-CLARIN).	Concordancer
Corpus of Old Literary Finnish Size: 4.1 million words Annotation: MSD-tagged, syntactically parsed Licence: EUPL v.1.1 SA	Finnish	This corpus contains both literary and non-literary texts from 1543 to 1810. In terms of genre, the texts correspond to bible translations and religious texts (for instance, all of the clergyman Mikael Agricola's Finnish works), legal texts, poems, and texts concerning agriculture, nature, health, and so on. The corpus is available through the concordancer Korp.	Concordancer
The Finnish Gutenberg Corpus Size: 34.5 million words Licence: CC-BY	Finnish	This corpus contains books published up to 1925 that are made available through the Gutenberg project. The corpus is available through the concordancer Korp.	Concordancer
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 5.2 billion tokens Annotation: tokenised Licence: CC_BY-SA	Finnish	This corpus contains newspaper articles from 1840 to 2011. For a comprehensive list of newspapers included in the corpus, see here. The corpus is available through the concordancer Korp.	Concordancer
The Morpho-Syntactic Database of Mikael Agricola's Works Size: 428,300 tokens Annotation: tokenised, PoS-tagged, morphological components and syntactic function Licence: CC-BY-ND	Finnish	This corpus contains texts from 1544 to 1551 written by the clergyman Mikael Agricola. The corpus is available through the concordancer Korp.	Concordancer
Virtual Old Literary Finnish (VVKS) - Kielipankki Korp version Size: 48 texts Licence: CC-BY-NC-ND	Finnish	This corpus contains literary texts from 1543 to 1791. This corpus complements the Corpus of Old Literary Finnish available through FIN-CLARIN.
Partonopeus de Blois: transcriptions of all manuscripts and fragments Size: 21,736,766 words Annotation: no linguistic annotation Licence: CC BY-NC-SA 3.0	French (Old)	This corpus contains transcriptions of the manuscripts and fragments of the romance Partonopeus de Blois. The corpus is available for download from the Oxford Text Archive.	Download
Syntactic Reference Corpus of Medieval French Size: 245,000 tokens Annotation: tokenised, syntactically-parsed Licence: CLARIN ACA	French (Old)	This corpus contains texts from the 9th to the 13th century. The syntactic categories of the SRCMF annotation and the grammatical principles of the annotation are explained in detail in the documentation. The corpus is available for download from a dedicated webpage. For the relevant publication, see Stein (2013).	Download
Austrian Baroque Corpus Size: 200,000 tokens Annotation: tokenised, PoS-tagged, lemmatised, named entities	German	This corpus contains sermons from 1650 to 1750. The corpus is available through a dedicated concordancer. For the relevant publication, see Resch et al. (2016).	Concordancer
DDR-Presseportal (GDR press portal)	German	This corpus contains newspaper texts from 1945 to 1994. The corpus is available through a concordancer provided by CLARIN-D.	Concordancer
Deutsches Textarchiv (DTA) Size: 215,168,761 tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CLARIN PUB	German	This corpus contains texts from the 17th to the 20th century. The corpus is available through a dedicated concordancer. For the relevant publication, see Haaf and Thomas (2016).	Concordancer
The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (ca. 1500-1700) Size: 120,000 tokens Annotation: Lite markup, no linguistic annotation Licence: CC-BY-NC-SA 3.0	German	This corpus contains medical writing from 1500 to 1700. The texts are taken primarily from digital facsimile copies available online via the University of Würzburg’s library interface, particularly from the subcategory of pertaining to gynaecology. The corpus is available for download from the Oxford Text Archive.	Download
GerManC. A Historical Corpus of German Newspapers 1650-1800 Size: 700,000 words Annotation: no annotation Licence: CC-BY-NC-SA 3.0	German	This corpus contains personal letters, sermons and fictional, scholarly (i.e., humanities), scientific and legal texts from 1650 to 1800. The corpus is available for download from the Oxford Text Archive.	Download
Mannheimer Korpus Historischer Zeitungen und Zeitschriften Size: 3532 pages	German	This corpus contains texts from the 18th and 19th centuries. The corpus is available for download directly through the VLO.	Download
Referenzkorpus Mittelhochdeutsch (Middle High German Reference Corpus) Size: 2.5 million tokens Annotation: tokenised, PoS-tagged, lemmatised, normalised, morphosyntactic description Licence: CC-BY-SA 4.0	German	This corpus contains texts from 1050 to 1350. The corpus is available for download from the Deutsches Text Archiv and through a concordancer. For the relevant publication, see Klein and Dipper (2016).	Concordancer Download
B4 Historisches Predigtenkorpus zum Nachfeld Size: 92,500 tokens Annotation: tokenised, syntactic and discursive annotation Licence: CLARIN ACA	German (Middle High)	This corpus contains sermons from an Upper German (Balvarian-Alemannic) dialect area. The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.	Concordancer Download
B4 Ludolf Size: 6,690 tokens Annotation: tokenised, tagged for clause type and grammatical function Licence: CLARIN ACA	German (Middle High)	This corpus contains texts from a journey diary from 1350. The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.	Concordancer Download
Reference Corpus Middle Low German/Low Rhenish (1200-1650) Size: 200,700 tokens Annotation: tokenised, MSD-tagged Licence: CC-BY	German (Middle Low)	This corpus contains texts from the 13th century to the middle of the 17th century. The corpus is available for download from the repository of the University of Hamburg through the ANNIS environment. For the relevant publication, see Schröder (2014).	Concordancer Download
SaCoCo—Saarbrücken Cookbook Corpus Size: 436,000 tokens Annotation: PoS-tagged using the STTS tagset, lemmatised, normalised Licence: CC-BY-NC-SA-3.0	German	This corpus contains historical cookbook recipes from 1569 to 1800, as well as contemporary ones from 2012. The corpus is available through the CQPweb concordancer provided by CLARIN-D.	Concordancer
OROSSIMO Corpus – History Size: 553,000 tokens Annotation: structural annotation (paragraph) Licence: CC-BY	Greek	This corpus contains historic academic texts. The corpus is available for download from the clarin:el repository.	Download
Hungarian Historical Corpus Size: 30 million words	Hungarian	This corpus contains historical texts from the 18th century to the 2000s. The corpus is available through a dedicated concordancer. For the relevant publication, see lemma=	Concordancer
The Saga Corpus Size: 1.5 million tokens Annotation: tokenised, PoS-tagged, lemmatised, normalized orthography Licence: CC-BY 4.0	Icelandic (Old)	This corpus contains Old Icelandic (Old Norse) Narrative texts from the 13th to the 15th century. The corpus is available for download from CLARIN-IS and for search through the concordancer Korp. For the relevant publication, see Rögnvaldsson and Helgadóttir (2011)	Concordancer Download
ChroniclItaly Size: 16.6 million words Annotation: unannotated Licence: ODC Attribution License (ODC-By)	Italian	This corpus contains Italian language newspapers published in the United States between 1898 and 1920. The corpus includes seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia. The collection includes the following titles: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia. The corpus is available for download from the repository of the University of Utrecht.	Download
LatinISE corpus (version 4) Size: 13.3 million tokens Annotation: sentence segmented, PoS-tagged, lemmatized Licence: CC BY-NC-SA 4.0	Latin	This corpus consists of Latin texts from the 2nd century B.C. to the 21st century. Non-linguistic metadata include information on genre, title, century and specific date. The corpus is available for download from LINDAT and for search online through Sketch Engine. For the relevant publication, see McGillivray and Kilgarriff (2015)	Concordancer Download
Menota Size: 1.6 million tokens Annotation: tokenised, MSD-tagged, lemmatised Licence: CC-BY	Old Norse	This corpus contains Medieval Nordic texts. The corpus is available for download and through the concordancer Corpuscle.	Concordancer Download
Chronopress Size: 16 million tokens Licence: CC-BY-SA	Polish	This corpus contains newspaper articles from 1945 to 1954. The corpus is available through a dedicated concordancer.	Concordancer
Polish language of the 1960s Size: 500,000 words Annotation: MSD-tagged Licence: CC-BY-NC-SA 3.0	Polish	This corpus contains essays, news articles, and scientific and literary texts from 1963 to 1967. The corpus is available for download from the Oxford Text Archive.	Download
Corpus of biblical text in Scots / John Kirk Size: 35,506 words Annotation: no annotation Licence: Oxford Text Archive licence	Scots	This corpus contains Biblical texts. The corpus is available for download from the Oxford Text Archive.	Download
The Helsinki corpus of Older Scots : [1450-1700] Size: 1,940,706 words Annotation: no annotation Licence: CC-BY-NC-SA 3.0	Scots	This corpus contains texts of different domains and genres (e.g., burgh records, diaries, pamphlets, scientific treatises, sermons) from 1450 to 1700. The corpus is available for download from the Oxford Text Archive.	Download
Digital library and corpus of historical Slovene IMP 1.1 Size: 17.7 million tokens Annotation: tokenised, lemmatised, PoS-tagged Licence: CC-BY-SA 4.0	Slovenian	This corpus contains 658 unique texts from 1584 to 1919. The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText. For the relevant publication, see Erjavec (2015).	Concordancer Download
Reference corpus of historical Slovene goo300k 1.2 Size: 300,000 tokens Annotation: manually tokenised, lemmatised, PoS-tagged, modern synonyms for archaic words Licence: CC-BY 4.0	Slovenian	This corpus contains 89 unique texts from 1584 to 1899. The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText. For the relevant publication, see Erjavec (2012).	Concordancer Download
The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 3.5 billion tokens Annotation: tokenised Licence: CC-BY-SA.	Swedish	This corpus contains newspaper articles from 1770 to 1950. The corpus is available through the concordancer Korp.	Concordancer
Historical Corpus of the Welsh Language 1500-1850 Size: 420,000 words	Welsh	This corpus contains 30 texts from 1500 to 1850. The corpus is available for download from a dedicated website and through a dedicated concordancer.

Corpus

Language

Description

Availability

Open Richly Annotated Cuneiform Corpus, Korp Version

Size: 1,600,563 tokens
Annotation: tokenised, lemmatised, PoS-tagged, semantically annotated
Licence: CC-BY-SA

Akkadian

This corpus contains cuneiform texts from Ancient history.

The texts come from the Oracc project and include collections such as the Corpus of Ancient Mesopotamian Scholarship, The Digital Corpus of Cuneiform Lexical Texts, and Royal Inscriptions of Babylonia online.

The corpus is available through the concordancer Korp and for download from the repository of FIN-CLARIN.

Concordancer

Download

The Diorisis Ancient Greek Corpus

Size: 10.2 million words
Annotation: PoS-tagged, lemmatised
Licence: CC BY 4.0

Ancient Greek

This corpus consists of 820 texts spanning between the beginnings of the Ancient Greek literary tradition (Homer) to the fifth century AD.

The texts are sourced from the Perseus Canonical Greek Lit Repository, "The Little Sailing" digital library, and the Bibliotheca Augustana digital library.

The corpus is available for download from Figshare.

For the relevant publication, see Vatri and McGillivray (2018)

Download

Greek Medieval Texts

Size: 3.4 million words
Licence: CC-BY

Ancient Greek

This corpus contains texts from the 4th to the 16th century.

The texts belong to the following categories: religious, poetical-literary, political, and historical texts, as well as hymns and epigrams.

The corpus is available for download from the clarin:el repository.

Download

Sheffield Corpus of Chinese

Size: 148,876 words
Annotation: no annotation
Licence: CC-BY-NC-SA 3.0

Chinese

This corpus contains three texts (two non-fictional and one fictional) from the Medieval and Modern Chinese periods.

The text "Zhuzi Yulei is genre-wise similar to sermons and vernacular dialogues, and is representative of Medieval Chinese. The two other texts are the novel "Shuihu Zhuan", which is from the Ming Dynasty (1368–1644), and the novel "Rulin Waishi", which is from the Quing Dynasty (1644–1911).

The corpus is available for download from the Oxford Text Archive.

Download

Brieven als buit (Letters as loot)

Size: 460,000 words
Annotation: lemmatised, PoS-tagged, grammatically tagged
Licence: CLARIN PUB

Dutch

This corpus contains 40,000 letters from the 17th to the 19th century.

These letters were sent home by sailors and others from abroad but also vice versa by those staying behind who needed to keep in touch with their loved ones. Many letters did not reach their destinations: they were taken as loot by privateers and confiscated by the High Court of Admiralty during the wars fought between The Netherlands and England

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rutten and van der Wal (2014).

Concordancer

Corpus Gysseling

Size: 1.5 million words
Annotation: PoS-tagged, lemmatised
Licence: INT Licence for researchers

Dutch

This corpus contains texts from the 13th century.

The texts were prepared and originally published in the 1970s and 1980s by the Ghent linguist Maurits Gysseling.

The corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer.

Concordancer

Download

A Corpus of English Dialogues 1560-1760 (CED)

Size: 1.2 million words
Annotation: no annotation
Licence: Oxford Text Archive licence

English

This corpus contains dialogues from literary and didactic works from 1560 to 1760.

There are five text-types in the CED. The text-types representative of constructed dialogue are drama comedy, didactic works (language manuals and other handbooks) and fiction; the text-types representative of authentic dialogue are trial proceedings and witness depositions. In addition, a small group of miscellaneous dialogic texts is included in the collection.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Early English Correspondence Sampler (CEECS)

Size: 450,000 words
Annotation: no annotation
Licence: Oxford Text Archive licence

English

This corpus contains 1147 letters from 1418 to 1680.

The corpus was created from the larger Corpus of Early English Correspondence.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Late Modern English prose / David Denison

Size: 580,056 words
Annotation: no annotation
Licence: Oxford Text Archive licence

English

This corpus contains fictional texts from 1837 to 1926.

The corpus is available for download from the Oxford Text Archive.

Download

Hansard Corpus

Size: 1.6 billion tokens
Annotation: tokenised, PoS-tagged, lemmatised, semantic tags

English

This corpus contains parliamentary debates from 1803 to 2005.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rayson et al. (2015).

Concordancer

Helsinki Corpus of Scottish Correspondence (1540-1750)

Size: 500,000 tokens
Annotation: tokenised
Licence: CLARIN ACA

English

This corpus contains personal correspondence from 1540 to 1750.

the corpus consists of transcripts of original letter manuscripts. The texts are reproduced without any modernisation or normalisation. Language-external variables such as date, region, gender, addressee, hand and script type have been coded.

The writers originate from fifteen different regions of Scotland. A fifth of the correspondents in the corpus are women.

The corpus is available through the concordancer Korp.

Concordancer

Older Scottish texts: the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-Smith

Size: 877,000 tokens
Annotation: tokenised
Licence: CC-BY-NC-SA 3.0

English

This corpus contains texts from 1450 to 1600.

The corpus is available for download from the Oxford Text Archive.

Download

Pamphlets of the American Revolution : [selections] / edited by Bernard Bailyn

Size: 431,013 words
Licence: CC-BY-NC-SA 3.0

English

This corpus contains pamphlets of the American Revolution from 1750 to 1776.

The corpus is available for download from the Oxford Text Archive.

Download

Parsed Corpus of Early English Correspondence (PCEEC)

Size: 2.2 million words
Annotation: tokenised, PoS-tagged, syntactically parsed
Licence: Oxford Text Archive licence

English

This corpus contains correspondence from around 1410 to 1681.

There are 4970 personal letters by 666 writers. The letters have been selected to be as socially representative of the literate social ranks of the time as possible.

This corpus is available for download from the Oxford Text Archive.

Download

Royal Society Corpus (Version 4.0)

Size: 35 million tokens
Annotation: PoS-tagged using PennTreebank tagset, lemmatised, normalised
Licence: CC-BY-NC-SA-4.0

English

This corpus contains articles from the Philosophical Transactions of the Royal Society of London journal from 1665 to 1869.

The corpus is available for download from the CLARIN-D repository as well as through a concordancer.

Concordancer

Download

The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose

Size: 300,000 words
Annotation: COCOA-style
Licence: Oxford Text Archive licence

English

This corpus contains texts from 1761 to 1790.

The corpus is available for download from the Oxford Text Archive.

Download

The Lampeter Corpus of Early Modern English Tracts

Size: 50,797,916 words
Annotation: no linguistic annotation
Licence: CC-BY-NC-SA 3.0

English

This corpus contains tracts from 1640 to 1740.

The corpus is available for download from the Oxford Text Archive.

Download

The Lancaster Newsbooks Corpus

Size: 3,001,604 words
Licence: CC-BY-NC-SA 3.0

English

This corpus contains two collections of English printed pamphlets, books, and newspapers from 1654 to 1655.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Historical American English - Kielipankki Korp version 2017H1

Size: 385 million tokens
Annotation: tokenised
Licence: CLARN ACA

English (American)

This corpus contains texts from 1810 to 2009.

Each decade has roughly the same balance of fiction, popular magazine, newspaper, and non-fiction books.

The corpus is available through the concordancer Korp.

Concordancer

The Corpus of Late Modern English Texts, version 3.1

Size: 34 million words
Annotation: PoS-tagged
Licence: CC-BY-NC-SA 4.0

English (Late Modern)

This corpus contains texts written by British and Irish authors from 1710 to 1920.

In terms of genre, the texts correspond to narrative fiction and non-fiction, drama, letters, treatises, and miscellaneous written works.

The corpus is available for download from a CLARIN-D repository.

Download

The Old Bailey Corpus

Size: 134 million words
Annotation: detailed sociobiographical, pragmatic and textual annotation
Licence: CC-BY-NC-SA 4.0

English (Late Modern)

This corpus contains proceedings of the Old Bailey (i.e., legal documents) from 1674 to 1913.

The corpus is available for download from the CLARIN-D repository and through the CQPConcordancer.

For the corpus manual, see Huber et al. (2016).

Concordancer

Download

Helsinki corpus of English texts

Size: 240,000 words
Licence: Oxford Text Archive licence

English (Old and Middle)

This corpus contains religious and fictional texts from 730 to 1710.

See the project page for a list of all the texts included in the corpus.

The corpus is available for download from the Oxford Text Archive.

Download

The York-Helsinki parsed corpus of Old English poetry (YCOEP)

Size: 71,500 words
Annotation: syntactically-parsed
Licence: Oxford Text Archive licence

English (Old)

This corpus contains poems from 730 to 1710.

The corpus contains a selection of poems taken from the Old English subpart of the Helsinki Corpus of English Texts.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of Old Written Estonian

Size: 2 million tokens
Annotation: tokenised, 16.-18. century texts have been tagged with contemporary Estonian, morphological and language information. 19. century texts are unannotated.
Licence: CC-BY

Estonian

This corpus covers secular and religious texts from the 16th to the 18th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Kingisepp et al. (2004).

Concordancer

Classics of Finnish Literature, Kielipankki Version

Size: 1.5 million words
Licence: EUPL v.1.1 SA

Finnish

This corpus contains literary texts from 1880 to 1949.

In terms of genre, the texts correspond to prose fiction, plays, poetry and aphorisms.

The corpus is available through the concordancer Korp (FIN-CLARIN).

Concordancer

Corpus of Old Literary Finnish

Size: 4.1 million words
Annotation: MSD-tagged, syntactically parsed
Licence: EUPL v.1.1 SA

Finnish

This corpus contains both literary and non-literary texts from 1543 to 1810.

In terms of genre, the texts correspond to bible translations and religious texts (for instance, all of the clergyman Mikael Agricola's Finnish works), legal texts, poems, and texts concerning agriculture, nature, health, and so on.

The corpus is available through the concordancer Korp.

Concordancer

The Finnish Gutenberg Corpus

Size: 34.5 million words
Licence: CC-BY

Finnish

This corpus contains books published up to 1925 that are made available through the Gutenberg project.

The corpus is available through the concordancer Korp.

Concordancer

The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 5.2 billion tokens
Annotation: tokenised
Licence: CC_BY-SA

Finnish

This corpus contains newspaper articles from 1840 to 2011.

For a comprehensive list of newspapers included in the corpus, see here.

The corpus is available through the concordancer Korp.

Concordancer

The Morpho-Syntactic Database of Mikael Agricola's Works

Size: 428,300 tokens
Annotation: tokenised, PoS-tagged, morphological components and syntactic function
Licence: CC-BY-ND

Finnish

This corpus contains texts from 1544 to 1551 written by the clergyman Mikael Agricola.

The corpus is available through the concordancer Korp.

Concordancer

Virtual Old Literary Finnish (VVKS) - Kielipankki Korp version

Size: 48 texts
Licence: CC-BY-NC-ND

Finnish

This corpus contains literary texts from 1543 to 1791.

This corpus complements the Corpus of Old Literary Finnish available through FIN-CLARIN.

Partonopeus de Blois: transcriptions of all manuscripts and fragments

Size: 21,736,766 words
Annotation: no linguistic annotation
Licence: CC BY-NC-SA 3.0

French (Old)

This corpus contains transcriptions of the manuscripts and fragments of the romance Partonopeus de Blois.

The corpus is available for download from the Oxford Text Archive.

Download

Syntactic Reference Corpus of Medieval French

Size: 245,000 tokens
Annotation: tokenised, syntactically-parsed
Licence: CLARIN ACA

French (Old)

This corpus contains texts from the 9th to the 13th century.

The syntactic categories of the SRCMF annotation and the grammatical principles of the annotation are explained in detail in the documentation.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Stein (2013).

Download

Austrian Baroque Corpus

Size: 200,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, named entities

German

This corpus contains sermons from 1650 to 1750.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al. (2016).

Concordancer

DDR-Presseportal (GDR press portal)

German

This corpus contains newspaper texts from 1945 to 1994.

The corpus is available through a concordancer provided by CLARIN-D.

Concordancer

Deutsches Textarchiv (DTA)

Size: 215,168,761 tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CLARIN PUB

German

This corpus contains texts from the 17th to the 20th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Haaf and Thomas (2016).

Concordancer

The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (ca. 1500-1700)

Size: 120,000 tokens
Annotation: Lite markup, no linguistic annotation
Licence: CC-BY-NC-SA 3.0

German

This corpus contains medical writing from 1500 to 1700.

The texts are taken primarily from digital facsimile copies available online via the University of Würzburg’s library interface, particularly from the subcategory of pertaining to gynaecology.

The corpus is available for download from the Oxford Text Archive.

Download

GerManC. A Historical Corpus of German Newspapers 1650-1800

Size: 700,000 words
Annotation: no annotation
Licence: CC-BY-NC-SA 3.0

German

This corpus contains personal letters, sermons and fictional, scholarly (i.e., humanities), scientific and legal texts from 1650 to 1800.

The corpus is available for download from the Oxford Text Archive.

Download

Mannheimer Korpus Historischer Zeitungen und Zeitschriften

Size: 3532 pages

German

This corpus contains texts from the 18th and 19th centuries.

The corpus is available for download directly through the VLO.

Download

Referenzkorpus Mittelhochdeutsch (Middle High German Reference Corpus)

Size: 2.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised, normalised, morphosyntactic description
Licence: CC-BY-SA 4.0

German

This corpus contains texts from 1050 to 1350.

The corpus is available for download from the Deutsches Text Archiv and through a concordancer.

For the relevant publication, see Klein and Dipper (2016).

Concordancer

Download

B4 Historisches Predigtenkorpus zum Nachfeld

Size: 92,500 tokens
Annotation: tokenised, syntactic and discursive annotation
Licence: CLARIN ACA

German (Middle High)

This corpus contains sermons from an Upper German (Balvarian-Alemannic) dialect area.

The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.

Concordancer

Download

B4 Ludolf

Size: 6,690 tokens
Annotation: tokenised, tagged for clause type and grammatical function
Licence: CLARIN ACA

German (Middle High)

This corpus contains texts from a journey diary from 1350.

The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.

Concordancer

Download

Reference Corpus Middle Low German/Low Rhenish (1200-1650)

Size: 200,700 tokens
Annotation: tokenised, MSD-tagged
Licence: CC-BY

German (Middle Low)

This corpus contains texts from the 13th century to the middle of the 17th century.

The corpus is available for download from the repository of the University of Hamburg through the ANNIS environment.

For the relevant publication, see Schröder (2014).

Concordancer

Download

SaCoCo—Saarbrücken Cookbook Corpus

Size: 436,000 tokens
Annotation: PoS-tagged using the STTS tagset, lemmatised, normalised
Licence: CC-BY-NC-SA-3.0

German

This corpus contains historical cookbook recipes from 1569 to 1800, as well as contemporary ones from 2012.

The corpus is available through the CQPweb concordancer provided by CLARIN-D.

Concordancer

OROSSIMO Corpus – History

Size: 553,000 tokens
Annotation: structural annotation (paragraph)
Licence: CC-BY

Greek

This corpus contains historic academic texts.

The corpus is available for download from the clarin:el repository.

Download

Hungarian Historical Corpus

Size: 30 million words

Hungarian

This corpus contains historical texts from the 18th century to the 2000s.

The corpus is available through a dedicated concordancer.

For the relevant publication, see lemma=

Concordancer

The Saga Corpus

Size: 1.5 million tokens
Annotation: tokenised, PoS-tagged, lemmatised, normalized orthography
Licence: CC-BY 4.0

Icelandic (Old)

This corpus contains Old Icelandic (Old Norse) Narrative texts from the 13th to the 15th century.

The corpus is available for download from CLARIN-IS and for search through the concordancer Korp.

For the relevant publication, see Rögnvaldsson and Helgadóttir (2011)

Concordancer

Download

ChroniclItaly

Size: 16.6 million words
Annotation: unannotated
Licence: ODC Attribution License (ODC-By)

Italian

This corpus contains Italian language newspapers published in the United States between 1898 and 1920. The corpus includes seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia. The collection includes the following titles: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia.

The corpus is available for download from the repository of the University of Utrecht.

Download

LatinISE corpus (version 4)

Size: 13.3 million tokens
Annotation: sentence segmented, PoS-tagged, lemmatized
Licence: CC BY-NC-SA 4.0

Latin

This corpus consists of Latin texts from the 2nd century B.C. to the 21st century. Non-linguistic metadata include information on genre, title, century and specific date.

The corpus is available for download from LINDAT and for search online through Sketch Engine.

For the relevant publication, see McGillivray and Kilgarriff (2015)

Concordancer

Download

Menota

Size: 1.6 million tokens
Annotation: tokenised, MSD-tagged, lemmatised
Licence: CC-BY

Old Norse

This corpus contains Medieval Nordic texts.

The corpus is available for download and through the concordancer Corpuscle.

Concordancer

Download

Chronopress

Size: 16 million tokens
Licence: CC-BY-SA

Polish

This corpus contains newspaper articles from 1945 to 1954.

The corpus is available through a dedicated concordancer.

Concordancer

Polish language of the 1960s

Size: 500,000 words
Annotation: MSD-tagged
Licence: CC-BY-NC-SA 3.0

Polish

This corpus contains essays, news articles, and scientific and literary texts from 1963 to 1967.

The corpus is available for download from the Oxford Text Archive.

Download

Corpus of biblical text in Scots / John Kirk

Size: 35,506 words
Annotation: no annotation
Licence: Oxford Text Archive licence

Scots

This corpus contains Biblical texts.

The corpus is available for download from the Oxford Text Archive.

Download

The Helsinki corpus of Older Scots : [1450-1700]

Size: 1,940,706 words
Annotation: no annotation
Licence: CC-BY-NC-SA 3.0

Scots

This corpus contains texts of different domains and genres (e.g., burgh records, diaries, pamphlets, scientific treatises, sermons) from 1450 to 1700.

The corpus is available for download from the Oxford Text Archive.

Download

Digital library and corpus of historical Slovene IMP 1.1

Size: 17.7 million tokens
Annotation: tokenised, lemmatised, PoS-tagged
Licence: CC-BY-SA 4.0

Slovenian

This corpus contains 658 unique texts from 1584 to 1919.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Erjavec (2015).

Concordancer

Download

Reference corpus of historical Slovene goo300k 1.2

Size: 300,000 tokens
Annotation: manually tokenised, lemmatised, PoS-tagged, modern synonyms for archaic words
Licence: CC-BY 4.0

Slovenian

This corpus contains 89 unique texts from 1584 to 1899.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Erjavec (2012).

Concordancer

Download

The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 3.5 billion tokens
Annotation: tokenised
Licence: CC-BY-SA.

Swedish

This corpus contains newspaper articles from 1770 to 1950.

The corpus is available through the concordancer Korp.

Concordancer

Historical Corpus of the Welsh Language 1500-1850

Size: 420,000 words

Welsh

This corpus contains 30 texts from 1500 to 1850.

The corpus is available for download from a dedicated website and through a dedicated concordancer.

Multilingual corpora

Corpus	Language	Description	Availability
"PolDiLemma" Middle Polish Diachrone Lemmatised Corpus Size: 7 million tokens Annotation: tokenised, lemmatised Licence: CC BY-NC-SA 4.0	Czech, German, Latin, Polish	This corpus contains political, religious and scientific texts from the 16th to the 18th century. The corpus is available for download from the CLARIN-D repository.	Download
Medieval Charter Sections Corpus Size: 57 chapters Annotation: manually-tagged, named entities Licence: CC-BY-NC-SA 4.0	Czech, Latin	This corpus contains Latin charters created in the era of John the Bling, King of Bohemia. The corpus is available for download from LINDAT. For the relevant publication, see Galuščáková and Neužilová (2018).	Download
Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo Size: 4000 words Annotation: no linguistic annotation Licence: Oxford Text Archive licence	English (Middle), Hebrew	This corpus contains literary texts from 1100 to 1400. The corpus is available for download from the Oxford Text Archive.	Download
Dictionary of Old English Corpus in Electronic Form (DOEC) Annotation: no linguistic annotation Licence: Oxford Text Archive licence	English (Old), Latin	This corpus contains 3037 texts from 600 to 1150. The corpus is available for download from the Oxford Text Archive.	Download
The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE) Size: 1.5 million words Annotation: syntactically-parsed Licence: Oxford Text Archive licence	English (Old), Latin	This corpus contains fictional texts from 600 to 1150. The corpus is available for download from the Oxford Text Archive.	Download
Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA) Size: 128,000 words Annotation: MSD-tagged, syntactically parsed Licence: CLARIN RES	English, German, Latin, Old Norse, Swedish	This corpus contains texts written in the Late Old Swedish period (from 1375 to 1550). The corpus is available for download from the repository of the University of Hamburg.	Download
The Electronic Text Corpus of Sumerian Literature. Revised edition Size: 5,151,373 words Annotation: Each word form in the composite transliterations has been assigned to a lexeme which is specified by a citation form, word class information and basic English translation. Licence: CC-BY-NC-SA 3.0	English, Sumerian	This corpus contains transliterations and English translations of 394 Sumerian compositions from approximately 2100 to 1700 BCE. The corpus is available for download from the Oxford Text Archive.	Download
Finnish Folk Poetry Size: 7.1 million words Annotation: normalised (added diacritics) Licence: CC-BY-NC	Finnish, Karelian, Ludian, Latin, Swedish, Olonets, Izhorian, Votic	This corpus contains poems from 1564 to 1939. The corpus is available through the concordancer Korp.	Concordancer
Corpus of Early Modern Finnish, Kielipankki Version Size: 8.6 million words Annotation: no linguistic annotation Licence: EUPL v.1.1 SA	Finnish, Russian, German, Latin	This corpus contains texts from 1809 to 1899. The corpus is available through the concordancer Korp.	Concordancer
Aleksis Kivi Corpus (SKS) Size: 413,700 words Annotation: MSD-tagged, syntactically parsed Licence: CC-BY-NC	Finnish, Swedish	This corpus contains the works by Finnish author Aleksis Kivi from 1855 to 1871. The corpus is available through the concordancer Korp.	Concordancer
Classics Library of the National Library of Finland - Kielipankki version Licence: CC-BY	Finnish, Swedish	This corpus will contain literary texts from 1549 to 1944.
The Letters of Paul Sinebrychoff, Kielipankki Version Size: 8.6 million words Annotation: Finnish subset: MSD-tagged, syntactically parsed; Swedish subset: no linguistic annotation Licence: CC-BY	Finnish, Swedish	This corpus contains letters from 1895 to 1909. The corpus is available through a dedicated online search environment.	Concordancer
The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 8.7 billion words Licence: CC-BY	Finnish, Swedish	This corpus contains newspaper articles from 1770 to 2011. The corpus is available through the concordancer Korp.	Concordancer
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Licence: CC-BY	Finnish, Swedish	This corpus contains newspaper articles from 1771 to 1874.	Download
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920) Size: 8.7 billion tokens Annotation: tokenised Licence: CLARIN ACA	Finnish, Swedish	This corpus contains newspaper articles from 1875 to 1920. The corpus is available for download from the Language Bank of Finland.	Download
Carniolan Provincial Assembly corpus Kranjska 1.0 Size: 10.9 million words Annotation: tokenised, MSD-tagged, lemmatised Licence: CC-BY 4.0	German, Slovenian	The corpus contains meeting proceedings of 694 sessions of the Carniolan Provincial Assembly from 1861 to 1913. The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si and History of Slovenia - SIstory portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit for Slovenian and German, while Lingua is used for language detection. The documents are in the Parla-CLARIN compliant TEI XML format. Each session in one file. For the relevant publication, see Marolt et al. (2023)	Download
B4 Tatian Corpus of Deviating Examples 2.1 Size: 11,300 tokens Annotation: tokenised, MSD-tagged Licence: CC-BY	Latin, German (Old High)	This corpus contains the OHG Tatian, which is one of the largest prose texts from the Old High German period. The corpus is available for download and through a concordancer from the repository of the University of Hamburg.	Concordancer Download
Språkbanken's historical corpora Size: 1.34 billion tokens Annotation: tokenised, PoS-tagged, lemmatised, syntactically parsed, word sense (for materials more recent than 1800) Licence: CC-BY	Swedish, German, French and others	This collection of corpora contains – among others – diachronic legal texts, Bible translations, medieval letters, digitized newspapers from the Swedish National Library and 19th century fiction from the Swedish Literature Bank. The corpora are available through the concordancer Korp.	Concordancer
Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0 Size: 34,542 utterances; 578,958 sentences; 13,271,885 words; 15,403 pages Annotation: tokenised, MSD-tagged, lemmatised Licence: CC BY 4.0	Croatian, Serbian, Slovenian	This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions. The source data (scanned images of printed Stenographic Minutes) come from the History of Slovenia - SIstory portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Lingua was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using CLASSLA for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.	Concordancer (noSketch) Concordancer (KonText) Download

Other historical corpora

Monolingual corpora

Corpus	Language	Description	Availability
DIAKORP v6 Size: 4 million tokens Annotation: basic structural markup Licence: CC-BY-NC-SA	Czech	This corpus contains texts from the 14th to the 20th century. The corpus is available through a dedicated concordancer.	Concordancer
ARCHER Corpus	English	The corpus contains texts from 1600 to 1999. The corpus is available through the CQPConcordancer.	Concordancer
ECCO-TCP Size: 74 million tokens Annotation: no linguistic annotation Licence: CC-0	English	This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1700 to 1800. The corpus is available for download from a dedicated webpage and through a dedicated concordancer.	Concordancer Download
EEBO-TCP Size: 766 million tokens Annotation: no linguistic annotation Licence: CC-0	English	This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1450 to 1750. The corpus is available through a dedicated concordancer.	Concordancer
EVANS-TCP Size: 766 million tokens Annotation: no linguistic annotation Licence: CC-0	English	This corpus contains American texts from 1640 to 1821. The corpus is available through a dedicated concordancer.	Concordancer
Historical Corpora at Lancaster University Annotation: tokenised, PoS-tagged, partial semantic tagging (USAS system)	English	The corpus contains texts in various domains (e.g., fiction, newspaper texts, religious texts) from 1500 on. The corpus is available through the CQPConcordancer.	Concordancer
Frantext Size: 300 million words Annotation: PoS-tagged, lemmatised	French	This corpus contains texts from the 10th to the 21st century. The corpus is available through a dedicated concordancer (restricted access).	Concordancer
Corpus of Old and Middle Hungarian court records and private correspondence Size: 850,000 words Annotation: tokenised, MSD-tagged, lemmatised, sociolinguistic metadata	Hungarian	This corpus contains private letters and testimonies from the 16th to the 18th century. The corpus is available through a dedicated concordancer.	Concordancer
Old Hungarian Corpus Size: 3 million tokens Annotation: tokenised, partially normalized, partially MSD-tagged	Hungarian	This corpus contains texts (codices, letters) from the 12th to the 17th century. The corpus is available for download from a dedicated webpage and through a dedicated concordancer.	Concordancer Download
Corpus testuale del Tesoro della Lingua Italiana delle Origini Size: 23 million tokens Annotation: tokenised, lemmatised	Italian	This corpus contains early Italian texts before 1375. The corpus is available through a dedicated concordancer.	Concordancer
DiaCORIS	Italian	This corpus contains texts from 1861 to 1945. The corpus is available through a dedicated concordancer. For the relevant publication, see Rossini Favretti et al. (2011).	Concordancer
M.I.DIA. (Morfologia dell'Italiano in DIAcronia) Size: 7.5 million tokens Annotation: tokenised Licence: CC-BY-NC 4.0	Italian	This corpus contains texts from the 13th to the 20th century. The corpus is available through a dedicated concordancer	Concordancer
Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej) Size: 625,000 tokens Annotation: tokenised, PoS-tagged, lemmatised, transliteration, transcripton	Polish	This corpus contains texts from 1830 to 1918. The corpus is available for download through a dedicated webpage.	Download
The Electronic Corpus of 17th- and 18th-century Polish Texts (Elektroniczny Korpus Tekstów Polskich z XVII i XVIII w.) Size: 13.5 million tokens Annotation: tokenised, partially PoS-tagged, structural annotation	Polish	This corpus contains texts from 1601 to 1772. The corpus is available through a dedicated concordancer. A manually annotated subset is available here. For the relevant publication, see Gruszczyński et al. (2021)	Concordancer
IMPACT GT corpus (Korpus GT projektu IMPACT) Size: 1.5 million tokens Annotation: transcription	Polish	This corpus contains texts from 1570 to 1756. The corpus is available through a dedicated concordancer. For the relevant publication, see Bień (2012).	Concordancer
Corpus Informatizado do Português Medieval Size: 2 million tokens Annotation: tokenised, PoS-tagged	Portuguese	This corpus contains texts from the 9th to the 16th century. The corpus is available through a dedicated concordancer (restricted access).	Concordancer
Parsed Corpus of Historical Portuguese Size: 3.3 million Annotation: tokenised, PoS-tagged (2 million), treebanked (1.2 million)	Portuguese	This corpus contains 76 texts written by authors born between 1380 and 1881. The corpus is available for download and through a dedicated concordancer.	Concordancer Download

Corpus

Language

Description

Availability

DIAKORP v6

Size: 4 million tokens
Annotation: basic structural markup
Licence: CC-BY-NC-SA

Czech

This corpus contains texts from the 14th to the 20th century.

The corpus is available through a dedicated concordancer.

Concordancer

ARCHER Corpus

English

The corpus contains texts from 1600 to 1999.

The corpus is available through the CQPConcordancer.

Concordancer

ECCO-TCP

Size: 74 million tokens
Annotation: no linguistic annotation
Licence: CC-0

English

This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1700 to 1800.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.

Concordancer

Download

EEBO-TCP

Size: 766 million tokens
Annotation: no linguistic annotation
Licence: CC-0

English

This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1450 to 1750.

The corpus is available through a dedicated concordancer.

Concordancer

EVANS-TCP

Size: 766 million tokens
Annotation: no linguistic annotation
Licence: CC-0

English

This corpus contains American texts from 1640 to 1821.

The corpus is available through a dedicated concordancer.

Concordancer

Historical Corpora at Lancaster University

Annotation: tokenised, PoS-tagged, partial semantic tagging (USAS system)

English

The corpus contains texts in various domains (e.g., fiction, newspaper texts, religious texts) from 1500 on.

The corpus is available through the CQPConcordancer.

Concordancer

Frantext

Size: 300 million words
Annotation: PoS-tagged, lemmatised

French

This corpus contains texts from the 10th to the 21st century.

The corpus is available through a dedicated concordancer (restricted access).

Concordancer

Corpus of Old and Middle Hungarian court records and private correspondence

Size: 850,000 words
Annotation: tokenised, MSD-tagged, lemmatised, sociolinguistic metadata

Hungarian

This corpus contains private letters and testimonies from the 16th to the 18th century.

The corpus is available through a dedicated concordancer.

Concordancer

Old Hungarian Corpus

Size: 3 million tokens
Annotation: tokenised, partially normalized, partially MSD-tagged

Hungarian

This corpus contains texts (codices, letters) from the 12th to the 17th century.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.

Concordancer

Download

Corpus testuale del Tesoro della Lingua Italiana delle Origini

Size: 23 million tokens
Annotation: tokenised, lemmatised

Italian

This corpus contains early Italian texts before 1375.

The corpus is available through a dedicated concordancer.

Concordancer

DiaCORIS

Italian

This corpus contains texts from 1861 to 1945.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rossini Favretti et al. (2011).

Concordancer

M.I.DIA. (Morfologia dell'Italiano in DIAcronia)

Size: 7.5 million tokens
Annotation: tokenised
Licence: CC-BY-NC 4.0

Italian

This corpus contains texts from the 13th to the 20th century.

The corpus is available through a dedicated concordancer

Concordancer

Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej)

Size: 625,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised, transliteration, transcripton

Polish

This corpus contains texts from 1830 to 1918.

The corpus is available for download through a dedicated webpage.

Download

The Electronic Corpus of 17th- and 18th-century Polish Texts (Elektroniczny Korpus Tekstów Polskich z XVII i XVIII w.)

Size: 13.5 million tokens
Annotation: tokenised, partially PoS-tagged, structural annotation

Polish

This corpus contains texts from 1601 to 1772.

The corpus is available through a dedicated concordancer.

A manually annotated subset is available here.

For the relevant publication, see Gruszczyński et al. (2021)

Concordancer

IMPACT GT corpus (Korpus GT projektu IMPACT)

Size: 1.5 million tokens
Annotation: transcription

Polish

This corpus contains texts from 1570 to 1756.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Bień (2012).

Concordancer

Corpus Informatizado do Português Medieval

Size: 2 million tokens
Annotation: tokenised, PoS-tagged

Portuguese

This corpus contains texts from the 9th to the 16th century.

The corpus is available through a dedicated concordancer (restricted access).

Concordancer

Parsed Corpus of Historical Portuguese

Size: 3.3 million
Annotation: tokenised, PoS-tagged (2 million), treebanked (1.2 million)

Portuguese

This corpus contains 76 texts written by authors born between 1380 and 1881.

The corpus is available for download and through a dedicated concordancer.

Concordancer

Download

Multilingual corpora

Corpus	Language	Description	Availability
Bundesblatt/Feuille fédérale/Foglio federale Size: 203,585,806 tokens (German), 239,125,036 tokens (French), 85,223,085 tokens (Italian Annotation: tokenised, syntactically-parsed	German, French, Italian	This corpus contains texts from 1849 to 2014. The corpus is available through the CQPWeb concordancer.	Concordancer
Corpus of old Polish texts until 1500 (Korpus tekstów staropolskich do roku 1500) Size: 620,000 tokens Annotation: tokenised	Polish, Latin	This corpus contains texts until 1500. The corpus is available for download from a dedicated webpage.	Download
Corpus of the 16. century Polish (Korpus polszczyzny XVI wieku) Annotation: lemmatised, transliteration	Polish, Latin	This corpus contains texts from the 16th century. The corpus is available through a dedicated concordancer.	Concordancer
eFontes Mediae et Infimae Latinitatis Polonorum (Elektroniczny korpus polskiej łaciny średniowiecznej) Size: 5 million tokens Annotation: tokenised, lemmatised	Polish, Latin	This corpus contains texts from the 11th to the middle of the 16th century. The corpus is available through a dedicated concordancer.	Concordancer
XV century New Testament translations (Piętnastowieczne przekłady Nowego Testamentu – elektroniczna konkordancja staropolska) Size: 400,000 tokens Annotation: tokenised	Polish, Latin	This corpus contains Biblical texts from 1380 to 1500. This corpus is available through a dedicated concordancer.	Concordancer

Additional Materials

Presentations on historical newspaper corpora t the CLARIN-PLUS workshop 'Working with Digital Collections of Newspapers.' 19-21 September 2016, Leuven, Belgium. [html]
Videolectures of the CLARIN-PLUS workshop. [html]

List of Publications on Historical Corpora

[Bień 2012] Janusz Bień. 2012. Delivering the IMPACT project Polish Ground-Truth texts with Poliqarp for DjVu.

[Erjavec 2012] Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene.

[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources.

[Galuščáková and Neužilová 2018] Petra Galuščáková and Lucie Neužilová. Low Resource Methods for Medieval Document Sections Analysis.

[Gruszczyński et al. 2021] Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska, Witold Kieraś, Emanuel Modrzejewski, Aleksandra Wieczorek, and Marcin Woliński. 2021. The Electronic Corpus of 17th- and 18th-century Polish Texts

[Haaf and Thomas 2016] Susanne Haaf and Christian Thomas. 2016. The Historical Corpora of the German Text Archive as a basis for research into linguistic history.

[Huber et al. 2016] Magnus Huber, Magnus Nissel, Karin Puga. 2016. The Old Bailey Corpus 2.0, 1720-1913 Manual.

[Kingisepp et al. 2004] Valve-Liivi Kingisepp, Külli Prillop, Külli Habicht. 2004. EESTI VANA KIRJAKEELE KORPUS: MIS TEHTUD, MIS TEOKSIL.

[Klein and Dipper 2016] Thomas Klein and Stefanie Dipper. 2016. Handbuch zum Referenzkorpus Mittelhochdeutsch.

[McGillivray and Kilgarriff 2015] Barbara McGillivray and Adam Kilgarriff. 2015. Tools for historical corpus research, and a corpus of Latin.

[Rayson et al. 2015] Paul Rayson, Alistair Baron, Scott Piao, Steve Wattam. 2015. Large-scale Time-sensitive Semantic Analysis of Historical Corpora.

[Rossini Favretti et al. 2011] Rema Rossini Favretti, Fabio Tamburini, Andrea Zaninello. 2011. Exploiting corpus evidence for automatic sense induction.

[Rögnvaldsson and Helgadóttir 2011] Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In C. Sporleder, A.P.J. van den Bosch and K.A. Zervanou (eds.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, pp. 63-76. Springer, Berlin.

[Rutten and van der Wal 2014] Gijsbert Rutten and Marijke van der Wal. 2014. Letters as Loot. A sociolinguistic approach to seventeenth- and eighteenth-century Dutch.

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[Schröder 2014] Ingrid Schröder. 2014. The Reference Corpus: New Perspectives for Middle Low German Grammar.

[Stein 2013] Achim Stein. 2013. Diachronic syntax based on constituency and dependency annotated corpora: theoretical and methodological issues.

[Vatri and McGillivray 2018] Alessandro Vatri and Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus.