The CLARIN infrastructure offers access to 76 historical corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN ERIC. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.
We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Historical corpora in the CLARIN infrastructure
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Open Richly Annotated Cuneiform Corpus, Korp Version Size: 1,600,563 tokens |
Akkadian |
This corpus contains cuneiform texts from Ancient history. The texts come from the Oracc project and include collections such as the Corpus of Ancient Mesopotamian Scholarship, The Digital Corpus of Cuneiform Lexical Texts, and Royal Inscriptions of Babylonia online. The corpus is available through the concordancer Korp and for download from the repository of FIN-CLARIN. |
|
The Diorisis Ancient Greek Corpus Size: 10.2 million words |
Ancient Greek |
This corpus consists of 820 texts spanning between the beginnings of the Ancient Greek literary tradition (Homer) to the fifth century AD. The texts are sourced from the Perseus Canonical Greek Lit Repository, "The Little Sailing" digital library, and the Bibliotheca Augustana digital library. The corpus is available for download from Figshare. For the relevant publication, see Vatri and McGillivray (2018) |
Download |
Size: 3.4 million words |
Ancient Greek |
This corpus contains texts from the 4th to the 16th century. The texts belong to the following categories: religious, poetical-literary, political, and historical texts, as well as hymns and epigrams. The corpus is available for download from the clarin:el repository. |
Download |
Size: 148,876 words |
Chinese |
This corpus contains three texts (two non-fictional and one fictional) from the Medieval and Modern Chinese periods. The text "Zhuzi Yulei is genre-wise similar to sermons and vernacular dialogues, and is representative of Medieval Chinese. The two other texts are the novel "Shuihu Zhuan", which is from the Ming Dynasty (1368–1644), and the novel "Rulin Waishi", which is from the Quing Dynasty (1644–1911). The corpus is available for download from the Oxford Text Archive. |
Download |
Brieven als buit (Letters as loot) Size: 460,000 words |
Dutch |
This corpus contains 40,000 letters from the 17th to the 19th century. These letters were sent home by sailors and others from abroad but also vice versa by those staying behind who needed to keep in touch with their loved ones. Many letters did not reach their destinations: they were taken as loot by privateers and confiscated by the High Court of Admiralty during the wars fought between The Netherlands and England The corpus is available through a dedicated concordancer. For the relevant publication, see Rutten and van der Wal (2014). |
Concordancer |
Size: 1.5 million words |
Dutch |
This corpus contains texts from the 13th century. The texts were prepared and originally published in the 1970s and 1980s by the Ghent linguist Maurits Gysseling. The corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer. |
|
A Corpus of English Dialogues 1560-1760 (CED) Size: 1.2 million words |
English |
This corpus contains dialogues from literary and didactic works from 1560 to 1760. There are five text-types in the CED. The text-types representative of constructed dialogue are drama comedy, didactic works (language manuals and other handbooks) and fiction; the text-types representative of authentic dialogue are trial proceedings and witness depositions. In addition, a small group of miscellaneous dialogic texts is included in the collection. The corpus is available for download from the Oxford Text Archive. |
Download |
Corpus of Early English Correspondence Sampler (CEECS) Size: 450,000 words |
English |
This corpus contains 1147 letters from 1418 to 1680. The corpus was created from the larger Corpus of Early English Correspondence. The corpus is available for download from the Oxford Text Archive. |
Download |
Corpus of Late Modern English prose / David Denison Size: 580,056 words |
English |
This corpus contains fictional texts from 1837 to 1926. The corpus is available for download from the Oxford Text Archive. |
Download |
Size: 1.6 billion tokens |
English |
This corpus contains parliamentary debates from 1803 to 2005. The corpus is available through a dedicated concordancer. For the relevant publication, see Rayson et al. (2015). |
Concordancer |
Helsinki Corpus of Scottish Correspondence (1540-1750) Size: 500,000 tokens |
English |
This corpus contains personal correspondence from 1540 to 1750. the corpus consists of transcripts of original letter manuscripts. The texts are reproduced without any modernisation or normalisation. Language-external variables such as date, region, gender, addressee, hand and script type have been coded. The writers originate from fifteen different regions of Scotland. A fifth of the correspondents in the corpus are women. The corpus is available through the concordancer Korp. |
Concordancer |
Older Scottish texts: the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-Smith Size: 877,000 tokens |
English |
This corpus contains texts from 1450 to 1600. The corpus is available for download from the Oxford Text Archive. |
Download |
Pamphlets of the American Revolution : [selections] / edited by Bernard Bailyn Size: 431,013 words |
English |
This corpus contains pamphlets of the American Revolution from 1750 to 1776. The corpus is available for download from the Oxford Text Archive. |
Download |
Parsed Corpus of Early English Correspondence (PCEEC) Size: 2.2 million words |
English |
This corpus contains correspondence from around 1410 to 1681. There are 4970 personal letters by 666 writers. The letters have been selected to be as socially representative of the literate social ranks of the time as possible. This corpus is available for download from the Oxford Text Archive. |
Download |
Royal Society Corpus (Version 4.0) Size: 35 million tokens |
English |
This corpus contains articles from the Philosophical Transactions of the Royal Society of London journal from 1665 to 1869. The corpus is available for download from the CLARIN-D repository as well as through a concordancer. |
|
The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose Size: 300,000 words |
English |
This corpus contains texts from 1761 to 1790. The corpus is available for download from the Oxford Text Archive. |
Download |
The Lampeter Corpus of Early Modern English Tracts Size: 50,797,916 words |
English |
This corpus contains tracts from 1640 to 1740. The corpus is available for download from the Oxford Text Archive. |
Download |
The Lancaster Newsbooks Corpus Size: 3,001,604 words |
English |
This corpus contains two collections of English printed pamphlets, books, and newspapers from 1654 to 1655. The corpus is available for download from the Oxford Text Archive. |
Download |
Corpus of Historical American English - Kielipankki Korp version 2017H1 Size: 385 million tokens |
English (American) |
This corpus contains texts from 1810 to 2009. Each decade has roughly the same balance of fiction, popular magazine, newspaper, and non-fiction books. The corpus is available through the concordancer Korp. |
Concordancer |
The Corpus of Late Modern English Texts, version 3.1 Size: 34 million words |
English (Late Modern) |
This corpus contains texts written by British and Irish authors from 1710 to 1920. In terms of genre, the texts correspond to narrative fiction and non-fiction, drama, letters, treatises, and miscellaneous written works. The corpus is available for download from a CLARIN-D repository. |
Download |
Size: 134 million words |
English (Late Modern) |
This corpus contains proceedings of the Old Bailey (i.e., legal documents) from 1674 to 1913. The corpus is available for download from the CLARIN-D repository and through the CQPConcordancer. For the corpus manual, see Huber et al. (2016). |
|
Helsinki corpus of English texts Size: 240,000 words |
English (Old and Middle) |
This corpus contains religious and fictional texts from 730 to 1710. See the project page for a list of all the texts included in the corpus. The corpus is available for download from the Oxford Text Archive. |
Download |
The York-Helsinki parsed corpus of Old English poetry (YCOEP) Size: 71,500 words |
English (Old) |
This corpus contains poems from 730 to 1710. The corpus contains a selection of poems taken from the Old English subpart of the Helsinki Corpus of English Texts. The corpus is available for download from the Oxford Text Archive. |
Download |
Corpus of Old Written Estonian Size: 2 million tokens |
Estonian |
This corpus covers secular and religious texts from the 16th to the 18th century. The corpus is available through a dedicated concordancer. For the relevant publication, see Kingisepp et al. (2004). |
Concordancer |
Classics of Finnish Literature, Kielipankki Version Size: 1.5 million words |
Finnish |
This corpus contains literary texts from 1880 to 1949. In terms of genre, the texts correspond to prose fiction, plays, poetry and aphorisms. The corpus is available through the concordancer Korp (FIN-CLARIN). |
Concordancer |
Corpus of Old Literary Finnish Size: 4.1 million words |
Finnish |
This corpus contains both literary and non-literary texts from 1543 to 1810. In terms of genre, the texts correspond to bible translations and religious texts (for instance, all of the clergyman Mikael Agricola's Finnish works), legal texts, poems, and texts concerning agriculture, nature, health, and so on. The corpus is available through the concordancer Korp. |
Concordancer |
Size: 34.5 million words |
Finnish |
This corpus contains books published up to 1925 that are made available through the Gutenberg project. The corpus is available through the concordancer Korp. |
Concordancer |
Size: 5.2 billion tokens |
Finnish |
This corpus contains newspaper articles from 1840 to 2011. For a comprehensive list of newspapers included in the corpus, see here. The corpus is available through the concordancer Korp. |
Concordancer |
The Morpho-Syntactic Database of Mikael Agricola's Works Size: 428,300 tokens |
Finnish |
This corpus contains texts from 1544 to 1551 written by the clergyman Mikael Agricola. The corpus is available through the concordancer Korp. |
Concordancer |
Virtual Old Literary Finnish (VVKS) - Kielipankki Korp version Size: 48 texts |
Finnish |
This corpus contains literary texts from 1543 to 1791. This corpus complements the Corpus of Old Literary Finnish available through FIN-CLARIN. |
|
Partonopeus de Blois: transcriptions of all manuscripts and fragments Size: 21,736,766 words |
French (Old) |
This corpus contains transcriptions of the manuscripts and fragments of the romance Partonopeus de Blois. The corpus is available for download from the Oxford Text Archive. |
Download |
Syntactic Reference Corpus of Medieval French Size: 245,000 tokens |
French (Old) |
This corpus contains texts from the 9th to the 13th century. The syntactic categories of the SRCMF annotation and the grammatical principles of the annotation are explained in detail in the documentation. The corpus is available for download from a dedicated webpage. For the relevant publication, see Stein (2013). |
Download |
Size: 200,000 tokens |
German |
This corpus contains sermons from 1650 to 1750. The corpus is available through a dedicated concordancer. For the relevant publication, see Resch et al. (2016). |
Concordancer |
DDR-Presseportal (GDR press portal)
|
German |
This corpus contains newspaper texts from 1945 to 1994. The corpus is available through a concordancer provided by CLARIN-D. |
Concordancer |
Size: 215,168,761 tokens |
German |
This corpus contains texts from the 17th to the 20th century. The corpus is available through a dedicated concordancer. For the relevant publication, see Haaf and Thomas (2016). |
Concordancer |
The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (ca. 1500-1700) Size: 120,000 tokens |
German |
This corpus contains medical writing from 1500 to 1700. The texts are taken primarily from digital facsimile copies available online via the University of Würzburg’s library interface, particularly from the subcategory of pertaining to gynaecology. The corpus is available for download from the Oxford Text Archive. |
Download |
GerManC. A Historical Corpus of German Newspapers 1650-1800 Size: 700,000 words |
German |
This corpus contains personal letters, sermons and fictional, scholarly (i.e., humanities), scientific and legal texts from 1650 to 1800. The corpus is available for download from the Oxford Text Archive. |
Download |
Mannheimer Korpus Historischer Zeitungen und Zeitschriften Size: 3532 pages |
German |
This corpus contains texts from the 18th and 19th centuries. The corpus is available for download directly through the VLO. |
Download |
Referenzkorpus Mittelhochdeutsch (Middle High German Reference Corpus) Size: 2.5 million tokens |
German |
This corpus contains texts from 1050 to 1350. The corpus is available for download from the Deutsches Text Archiv and through a concordancer. For the relevant publication, see Klein and Dipper (2016). |
|
B4 Historisches Predigtenkorpus zum Nachfeld Size: 92,500 tokens |
German (Middle High) |
This corpus contains sermons from an Upper German (Balvarian-Alemannic) dialect area. The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment. |
|
Size: 6,690 tokens |
German (Middle High) |
This corpus contains texts from a journey diary from 1350. The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment. |
|
Reference Corpus Middle Low German/Low Rhenish (1200-1650) Size: 200,700 tokens |
German (Middle Low) |
This corpus contains texts from the 13th century to the middle of the 17th century. The corpus is available for download from the repository of the University of Hamburg through the ANNIS environment. For the relevant publication, see Schröder (2014). |
|
SaCoCo—Saarbrücken Cookbook Corpus Size: 436,000 tokens |
German |
This corpus contains historical cookbook recipes from 1569 to 1800, as well as contemporary ones from 2012. The corpus is available through the CQPweb concordancer provided by CLARIN-D. |
Concordancer |
Size: 553,000 tokens |
Greek |
This corpus contains historic academic texts. The corpus is available for download from the clarin:el repository. |
Download |
Size: 30 million words |
Hungarian |
This corpus contains historical texts from the 18th century to the 2000s. The corpus is available through a dedicated concordancer. For the relevant publication, see lemma= |
Concordancer |
Size: 1.5 million tokens |
Icelandic (Old) |
This corpus contains Old Icelandic (Old Norse) Narrative texts from the 13th to the 15th century. The corpus is available for download from CLARIN-IS and for search through the concordancer Korp. For the relevant publication, see Rögnvaldsson and Helgadóttir (2011) |
|
Size: 16.6 million words |
Italian |
This corpus contains Italian language newspapers published in the United States between 1898 and 1920. The corpus includes seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia. The collection includes the following titles: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia. The corpus is available for download from the repository of the University of Utrecht. |
Download |
Size: 13.3 million tokens |
Latin |
This corpus consists of Latin texts from the 2nd century B.C. to the 21st century. Non-linguistic metadata include information on genre, title, century and specific date. The corpus is available for download from LINDAT and for search online through Sketch Engine. For the relevant publication, see McGillivray and Kilgarriff (2015) |
|
Size: 1.6 million tokens |
Old Norse |
This corpus contains Medieval Nordic texts. The corpus is available for download and through the concordancer Corpuscle. |
|
Size: 16 million tokens |
Polish |
This corpus contains newspaper articles from 1945 to 1954. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 500,000 words |
Polish |
This corpus contains essays, news articles, and scientific and literary texts from 1963 to 1967. The corpus is available for download from the Oxford Text Archive. |
Download |
Corpus of biblical text in Scots / John Kirk Size: 35,506 words |
Scots |
This corpus contains Biblical texts. The corpus is available for download from the Oxford Text Archive. |
Download |
The Helsinki corpus of Older Scots : [1450-1700] Size: 1,940,706 words |
Scots |
This corpus contains texts of different domains and genres (e.g., burgh records, diaries, pamphlets, scientific treatises, sermons) from 1450 to 1700. The corpus is available for download from the Oxford Text Archive. |
Download |
Digital library and corpus of historical Slovene IMP 1.1 Size: 17.7 million tokens |
Slovenian |
This corpus contains 658 unique texts from 1584 to 1919. The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText. For the relevant publication, see Erjavec (2015). |
|
Reference corpus of historical Slovene goo300k 1.2 Size: 300,000 tokens |
Slovenian |
This corpus contains 89 unique texts from 1584 to 1899. The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText. For the relevant publication, see Erjavec (2012). |
|
Size: 3.5 billion tokens |
Swedish |
This corpus contains newspaper articles from 1770 to 1950. The corpus is available through the concordancer Korp. |
Concordancer |
Historical Corpus of the Welsh Language 1500-1850 Size: 420,000 words |
Welsh |
This corpus contains 30 texts from 1500 to 1850. The corpus is available for download from a dedicated website and through a dedicated concordancer. |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
"PolDiLemma" Middle Polish Diachrone Lemmatised Corpus Size: 7 million tokens |
Czech, German, Latin, Polish |
This corpus contains political, religious and scientific texts from the 16th to the 18th century. The corpus is available for download from the CLARIN-D repository. |
Download |
Medieval Charter Sections Corpus Size: 57 chapters |
Czech, Latin |
This corpus contains Latin charters created in the era of John the Bling, King of Bohemia. The corpus is available for download from LINDAT. For the relevant publication, see Galuščáková and Neužilová (2018). |
Download |
Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo Size: 4000 words |
English (Middle), Hebrew |
This corpus contains literary texts from 1100 to 1400. The corpus is available for download from the Oxford Text Archive. |
Download |
Dictionary of Old English Corpus in Electronic Form (DOEC) Annotation: no linguistic annotation |
English (Old), Latin |
This corpus contains 3037 texts from 600 to 1150. The corpus is available for download from the Oxford Text Archive. |
Download |
The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE) Size: 1.5 million words |
English (Old), Latin |
This corpus contains fictional texts from 600 to 1150. The corpus is available for download from the Oxford Text Archive. |
Download |
Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA) Size: 128,000 words |
English, German, Latin, Old Norse, Swedish |
This corpus contains texts written in the Late Old Swedish period (from 1375 to 1550). The corpus is available for download from the repository of the University of Hamburg. |
Download |
The Electronic Text Corpus of Sumerian Literature. Revised edition Size: 5,151,373 words |
English, Sumerian |
This corpus contains transliterations and English translations of 394 Sumerian compositions from approximately 2100 to 1700 BCE. The corpus is available for download from the Oxford Text Archive. |
Download |
Size: 7.1 million words |
Finnish, Karelian, Ludian, Latin, Swedish, Olonets, Izhorian, Votic |
This corpus contains poems from 1564 to 1939. The corpus is available through the concordancer Korp. |
Concordancer |
Corpus of Early Modern Finnish, Kielipankki Version Size: 8.6 million words |
Finnish, Russian, German, Latin |
This corpus contains texts from 1809 to 1899. The corpus is available through the concordancer Korp. |
Concordancer |
Size: 413,700 words |
Finnish, Swedish |
This corpus contains the works by Finnish author Aleksis Kivi from 1855 to 1871. The corpus is available through the concordancer Korp. |
Concordancer |
Classics Library of the National Library of Finland - Kielipankki version Licence: CC-BY |
Finnish, Swedish | This corpus will contain literary texts from 1549 to 1944. | |
The Letters of Paul Sinebrychoff, Kielipankki Version Size: 8.6 million words |
Finnish, Swedish |
This corpus contains letters from 1895 to 1909. The corpus is available through a dedicated online search environment. |
Concordancer |
The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Size: 8.7 billion words |
Finnish, Swedish |
This corpus contains newspaper articles from 1770 to 2011. The corpus is available through the concordancer Korp. |
Concordancer |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Licence: CC-BY |
Finnish, Swedish | This corpus contains newspaper articles from 1771 to 1874. | Download |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920) Size: 8.7 billion tokens |
Finnish, Swedish |
This corpus contains newspaper articles from 1875 to 1920. The corpus is available for download from the Language Bank of Finland. |
Download |
Carniolan Provincial Assembly corpus Kranjska 1.0 Size: 10.9 million words |
German, Slovenian |
The corpus contains meeting proceedings of 694 sessions of the Carniolan Provincial Assembly from 1861 to 1913. The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si and History of Slovenia - SIstory portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit for Slovenian and German, while Lingua is used for language detection. The documents are in the Parla-CLARIN compliant TEI XML format. Each session in one file. For the relevant publication, see Marolt et al. (2023) |
Download |
B4 Tatian Corpus of Deviating Examples 2.1 Size: 11,300 tokens |
Latin, German (Old High) |
This corpus contains the OHG Tatian, which is one of the largest prose texts from the Old High German period. The corpus is available for download and through a concordancer from the repository of the University of Hamburg. |
|
Språkbanken's historical corpora Size: 1.34 billion tokens |
Swedish, German, French and others |
This collection of corpora contains – among others – diachronic legal texts, Bible translations, medieval letters, digitized newspapers from the Swedish National Library and 19th century fiction from the Swedish Literature Bank. The corpora are available through the concordancer Korp. |
Concordancer |
Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0 Size: 34,542 utterances; 578,958 sentences; 13,271,885 words; 15,403 pages |
Croatian, Serbian, Slovenian |
This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions. The source data (scanned images of printed Stenographic Minutes) come from the History of Slovenia - SIstory portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Lingua was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using CLASSLA for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers. |
Other historical corpora
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 4 million tokens |
Czech |
This corpus contains texts from the 14th to the 20th century. The corpus is available through a dedicated concordancer. |
Concordancer |
|
English |
The corpus contains texts from 1600 to 1999. The corpus is available through the CQPConcordancer. |
Concordancer |
Size: 74 million tokens |
English |
This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1700 to 1800. The corpus is available for download from a dedicated webpage and through a dedicated concordancer. |
|
Size: 766 million tokens |
English |
This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1450 to 1750. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 766 million tokens |
English |
This corpus contains American texts from 1640 to 1821. The corpus is available through a dedicated concordancer. |
Concordancer |
Historical Corpora at Lancaster University Annotation: tokenised, PoS-tagged, partial semantic tagging (USAS system) |
English |
The corpus contains texts in various domains (e.g., fiction, newspaper texts, religious texts) from 1500 on. The corpus is available through the CQPConcordancer. |
Concordancer |
Size: 300 million words |
French |
This corpus contains texts from the 10th to the 21st century. The corpus is available through a dedicated concordancer (restricted access). |
Concordancer |
Corpus of Old and Middle Hungarian court records and private correspondence Size: 850,000 words |
Hungarian |
This corpus contains private letters and testimonies from the 16th to the 18th century. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 3 million tokens |
Hungarian |
This corpus contains texts (codices, letters) from the 12th to the 17th century. The corpus is available for download from a dedicated webpage and through a dedicated concordancer. |
|
Corpus testuale del Tesoro della Lingua Italiana delle Origini Size: 23 million tokens |
Italian |
This corpus contains early Italian texts before 1375. The corpus is available through a dedicated concordancer. |
Concordancer |
|
Italian |
This corpus contains texts from 1861 to 1945. The corpus is available through a dedicated concordancer. For the relevant publication, see Rossini Favretti et al. (2011). |
Concordancer |
M.I.DIA. (Morfologia dell'Italiano in DIAcronia) Size: 7.5 million tokens |
Italian |
This corpus contains texts from the 13th to the 20th century. The corpus is available through a dedicated concordancer |
Concordancer |
Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej) Size: 625,000 tokens |
Polish |
This corpus contains texts from 1830 to 1918. The corpus is available for download through a dedicated webpage. |
Download |
Size: 13.5 million tokens |
Polish |
This corpus contains texts from 1601 to 1772. The corpus is available through a dedicated concordancer. A manually annotated subset is available here. For the relevant publication, see Gruszczyński et al. (2021) |
Concordancer |
IMPACT GT corpus (Korpus GT projektu IMPACT) Size: 1.5 million tokens |
Polish |
This corpus contains texts from 1570 to 1756. The corpus is available through a dedicated concordancer. For the relevant publication, see Bień (2012). |
Concordancer |
Corpus Informatizado do Português Medieval Size: 2 million tokens |
Portuguese |
This corpus contains texts from the 9th to the 16th century. The corpus is available through a dedicated concordancer (restricted access). |
Concordancer |
Parsed Corpus of Historical Portuguese Size: 3.3 million |
Portuguese |
This corpus contains 76 texts written by authors born between 1380 and 1881. The corpus is available for download and through a dedicated concordancer. |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Bundesblatt/Feuille fédérale/Foglio federale Size: 203,585,806 tokens (German), 239,125,036 tokens (French), 85,223,085 tokens (Italian |
German, French, Italian |
This corpus contains texts from 1849 to 2014. The corpus is available through the CQPWeb concordancer. |
Concordancer |
Corpus of old Polish texts until 1500 (Korpus tekstów staropolskich do roku 1500) Size: 620,000 tokens |
Polish, Latin |
This corpus contains texts until 1500. The corpus is available for download from a dedicated webpage. |
Download |
Corpus of the 16. century Polish (Korpus polszczyzny XVI wieku) Annotation: lemmatised, transliteration |
Polish, Latin |
This corpus contains texts from the 16th century. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 5 million tokens |
Polish, Latin |
This corpus contains texts from the 11th to the middle of the 16th century. The corpus is available through a dedicated concordancer. |
Concordancer |
Size: 400,000 tokens |
Polish, Latin |
This corpus contains Biblical texts from 1380 to 1500. This corpus is available through a dedicated concordancer. |
Concordancer |
Additional Materials
- Presentations on historical newspaper corpora t the CLARIN-PLUS workshop 'Working with Digital Collections of Newspapers.' 19-21 September 2016, Leuven, Belgium. [html]
- Videolectures of the CLARIN-PLUS workshop. [html]
List of Publications on Historical Corpora
[Bień 2012] Janusz Bień. 2012. Delivering the IMPACT project Polish Ground-Truth texts with Poliqarp for DjVu.
[Erjavec 2012] Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene.
[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources.
[Galuščáková and Neužilová 2018] Petra Galuščáková and Lucie Neužilová. Low Resource Methods for Medieval Document Sections Analysis.
[Gruszczyński et al. 2021] Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska, Witold Kieraś, Emanuel Modrzejewski, Aleksandra Wieczorek, and Marcin Woliński. 2021. The Electronic Corpus of 17th- and 18th-century Polish Texts
[Haaf and Thomas 2016] Susanne Haaf and Christian Thomas. 2016. The Historical Corpora of the German Text Archive as a basis for research into linguistic history.
[Huber et al. 2016] Magnus Huber, Magnus Nissel, Karin Puga. 2016. The Old Bailey Corpus 2.0, 1720-1913 Manual.
[Kingisepp et al. 2004] Valve-Liivi Kingisepp, Külli Prillop, Külli Habicht. 2004. EESTI VANA KIRJAKEELE KORPUS: MIS TEHTUD, MIS TEOKSIL.
[Klein and Dipper 2016] Thomas Klein and Stefanie Dipper. 2016. Handbuch zum Referenzkorpus Mittelhochdeutsch.
[McGillivray and Kilgarriff 2015] Barbara McGillivray and Adam Kilgarriff. 2015. Tools for historical corpus research, and a corpus of Latin.
[Rayson et al. 2015] Paul Rayson, Alistair Baron, Scott Piao, Steve Wattam. 2015. Large-scale Time-sensitive Semantic Analysis of Historical Corpora.
[Rossini Favretti et al. 2011] Rema Rossini Favretti, Fabio Tamburini, Andrea Zaninello. 2011. Exploiting corpus evidence for automatic sense induction.
[Rögnvaldsson and Helgadóttir 2011] Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In C. Sporleder, A.P.J. van den Bosch and K.A. Zervanou (eds.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, pp. 63-76. Springer, Berlin.
[Rutten and van der Wal 2014] Gijsbert Rutten and Marijke van der Wal. 2014. Letters as Loot. A sociolinguistic approach to seventeenth- and eighteenth-century Dutch.
[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.
[Schröder 2014] Ingrid Schröder. 2014. The Reference Corpus: New Perspectives for Middle Low German Grammar.
[Stein 2013] Achim Stein. 2013. Diachronic syntax based on constituency and dependency annotated corpora: theoretical and methodological issues.
[Vatri and McGillivray 2018] Alessandro Vatri and Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus.