Literary Corpora | CLARIN ERIC

Literary corpora comprise poetry and fictional prose texts, such as novels, short stories and plays. They bring together the collected works of a single author or representative texts from a specific literary period. Since literary corpora are often available through powerful concordancers, they are especially well suited for a quantitative and qualitative approach to comparative literary analysis, within or across different genres and historical periods.

The CLARIN infrastructure gives access to 45 literary corpora. The majority of the corpora are monolingual and cover 16 languages (Croatian, Danish, English, Estonian, Finnish, French, Greek, Latvian, North Saami, Norwegian, Polish, Portuguese, Spanish, Slovenian, Sumerian and Swedish).

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Literary corpora in the CLARIN infrastructure

Monolingual corpora

Corpus	Language	Description	Availability
One-million Corpus of Croatian Literary Language Size: 1 million tokens	Croatian	The corpus is listed in the LINDAT repository.
Johannes V. Jensen Corpus Size: 1,760,093 words, 8,489 pages Annotation: unannotated Licence: CC BY-SA 4.0	Danish	This corpus presents the collected works of the Danish author Johannes Jensen. The corpus is available for download from CLARIN-DK and for online browsing through a dedicated concordancer.	Browse Download
Complete Corpus of Anglo-Saxon Poetry Annotation: none	English (Old)	This corpus is available for online browsing through an external interface.	Browse
York-Helsinki Parsed Corpus of Old English Poetry Size: 71,490 words Annotation: MSD-tagged, syntactically parsed Licence: Restricted	English (Old)	This corpus contains a selection of poetic texts (71,490 words) from the Old English Section of the Helsinki Corpus of English Texts. The corpus is available for download from the Oxford Text Archive.	Download
Collection of older original Estonian-language works of fiction Size: 173 texts Licence: CLARIN ACA	Estonian	This corpus collects older Estonian literary texts published on "Kreutzwald's Century: the Estonian Cultural History Web". The electronically republished books, included in the collection, are based on the first editions of works by more important Estonian authors, published in 1854-1944. The corpus is available for online browsing through an external interface.	Browse
Corpus of Estonian fiction Size: 5,768,504 words Licence: CLARIN ACA - NC	Estonian	This corpus contains texts from 1990 onwards. The corpus is available for download from CELR.	Download
Estonian Runic Songs' Database Size: 92,134 texts Licence: CLARIN ACA	Estonian	These are the oldest text recordings of Estonian runic songs (the text recordings were created in the 19th century and in the first decades of the 20th century). In addition to the runic songs, the database also has songs of transitional form and end-rhymed songs (about 6000). The corpus is available for online browsing through an external interface.	Browse
Classics of English and American Literature in Finnish (CEAL) Size: 3 novels, 484,010 tokens Annotation: MSD-tagged, syntactically parsed Licence: CLARIN RES + NC	Finnish	This corpus contains Finnish translations of the following three texts: Jane Austen: Ylpeys ja ennakkoluulo (Pride and Prejudice), translated by Kersti Juva, Teos 2013; Henry James: Washingtonin aukio (Washington Square), translated by Kersti Juva, Otava 2003; Charles Dickens: Kolea talo (Bleak House), translated by Kersti Juva, Tammi, 2006. The corpus is available for online browsing through Korp in two versions - Version 1 (Sentences and Paragraphs in the Original Order) and Version 2 (Scrambled Paragraphs))	Browse (original) Browse (scrambled)
Classics of Finnish Literature, Kielipankki Version Size: 1,500,000 words Annotation: syntactically parsed (TDT alpha), named entities (FiNER), MSD-tagged, lemmatized Licence: EUPL v.1.1 SA	Finnish	This corpus contains prose fiction, plays, poetry and aphorisms (some written originally in Swedish) of established Finnish authors published from 1880s to 1949. The corpus is available for online browsing through Korp.	Browse
Corpus of Early Literary Finnish	Finnish	The corpus of Early Modern Finnish contains Finnish-language works in various fields published during the 19th century, annual issues of the oldest periodicals and newspapers, almanac and decree texts, and some dictionaries. An effort has been made to include the earliest, most important and (based on the number of reprints, for example) most widely distributed works. The selection of publications has also been made with a view to achieving the widest possible thematic coverage, although more works originally written in Finnish have been included than translations. These have been alphabetised by the name of their translator, seasonal publications by their title, and other works by their author. The Finnish translations of unknown authors are in the Anonymous folder, the texts of unknown authors in the Other folder. The materials cover the period between Old and Modern English and a little beyond. The earliest book dates from 1809, the latest from 1891, but there are texts of the regulations right up to the end of the century. However, most of the material is from 1810-1880. This later material can also be found in the Classics corpus.
Corpus of Finnish Literary Classics Size: 1,456,658 words	Finnish	This corpus contains works by established Finnish fiction writers from the 1880s to the 1930s. There are different types of prose and plays, as well as lyrics and aphorisms. This corpus is available for online browsing through an external interface.	Browse
Corpus of Old Literary Finnish Size: 3,428,618 words	Finnish	This corpus contains various works published during the Swedish rule (from the 16th century to about 1810), extensive manuscripts from that period (most of which were later printed), as well as individual almanac and decree texts, sermons and poetry. This corpus is available for online browsing through an external interface.	Browse
Finnish Corpus (Literature) (UHLCS) Size: 68,425 words Annotation: tagged Licence: CLARIN RES	Finnish	This corpus contains samples of Finnish literature published by the WSOY publishing company in the 1990. The corpus is available online through FIN-CLARIN.	Browse
The Finnish Gutenberg Corpus Size: 34,487,420 words Licence: CC-BY	Finnish	This corpus contains Finnish books made available by the Gutenberg project. The texts have not been linguistically annotated. The corpus is available for online browsing through Korp.	Browse
The Morpho-Syntactic Database of Mikael Agricola's Works Size: 83,678 sentences; 428,314 tokens; 38,308 words Annotation: MSD-tagged, syntactically parsed Licence: CC-BY-ND	Finnish	This corpus contains the Finnish parts of Mikael Agricola’s works (Abckiria, Rukouskiria, Se Wsi testamenti, Käsikiria, Messu, Piina, Psaltari, Veisut, Profeetat). The corpus is available for online browsing through Korp.	Browse
République-Bastille (1948-1949) Size: 37,965 words Licence: CC-BY	French	This corpus contains République-Bastille, a novel by Melpo Axioti. This French text is of particular linguistic interest since it is a text written in a language other than the author's mother tongue and is suited for research on bilingualism and self-translation. It would be worth measuring the naturalness of the language with computational tools, for example. The corpus is available for download from clarin:el.	Download
Cultural Thesaurus of the Greek Language Size: 1 million tokens Annotation: semantic Licence: proprietary	Greek	This corpus contains prose, poetry, drama, and essays from the 18th century onwards. The corpus is available for online browsing through a dedicated webpage.	Browse
Greek Medieval Texts Size: 3,419,553 words Licence: CC-BY-NC	Greek (Ancient), Greek (Modern)	This corpus contains medieval texts contains written material covering the period from the 4th till the 16th century A.D. The texts can be classified into the following categories: religious, poetical-literary, political-historical, hymns, epigrams. The corpus is available for download from clarin:el.	Download
Latvian literature classics	Latvian	This corpus presents classics from the end of the 19th century to the beginning of the 20th century.
North Saami Corpus (Literature) (UHLCS) Size: 17,830 words Licence: CLARIN RES +NC +NORED +PLAN	North Sami	This corpus contains Kerttu Vuolab's novel Cheppari cháráhus. The corpus is available for online browsing through the TAITO shell.	Browse
NorGramBank – Fiction in Norwegian Bokmål Size: 26,903,637 words; 2,469,916 sentences Annotation: syntactically parsed Licence: CLARIN ACA	Norwegian (Bokmal)	This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.	Browse
NorGramBank children’s fiction in Norwegian Bokmål Size: 4,111,213 words; 389,564 sentences Annotation: syntactically parsed Licence: CLARIN ACA	Norwegian (Bokmal)	This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.	Browse
NorGrambank children's fiction in Norwegian Nynorsk Size: 1,043,260 words; 106,434 sentences Annotation: syntactically parsed Licence: CLARIN ACA	Norwegian (Nynorsk)	This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.	Browse
NorGramBank fiction in Norwegian Nynorsk Size: 2,884,376 words; 260,285 sentences Annotation: syntactically parsed Licence: CLARIN ACA	Norwegian (Nynorsk)	This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.	Browse
1000 Novels Corpus Size: 1000 texts Licence: CC-BY 4.0	Polish	This corpus is available for download from CLARIN-PL.	Download
1000PLUS Novels Corpus (1.0) Size: 1000 texts; 17,352,826 words Licence: CC-BY-SA 3.0	Polish	This corpus is available for download from CLARIN-PL.	Download
Late 19th- and Early 20th-Century Polish Novels Licence: CC-BY 3.0	Polish	This corpus is available for download from CLARIN-PL.	Download
POE: Microcorpus of 20th century Polish poetry Annotation: unannotated Licence: plWordNet	Polish	This corpus is available for download from CLARIN-PL.	Download
LT Corpus Size: 1,781,083 words Annotation: PoS-tagged, lemmatized Licence: CLARIN RES	Portuguese	This corpus contains 70 copyright-free classics (61 Portugal and 9 Brazil) published before 1940. The corpus is available for download from PORTULAN.	Download
Corpus of longer narrative Slovenian prose KDSP 1.0 Size: 262 texts, 11 million words, 14 million tokens Annotation: MSD-tagged (MULTEXT-East & UD), lemmatised, annotated with author and text metadata Licence: CC-BY 4.0	Slovenian	This corpus contains 262 texts of longer older Slovenian narrative prose. The texts were published between 1836 and 1918 and are at least 20,000 words long. The texts have bibliographical metadata (author name, title, year of publication, length) and are classified according to the decade of publication, length, text type, text subtype, theme, and level of canonicity (texts by those authors included in school textbooks after 1980 and/or included in the Collected writings of Slovenian poets and writers, are marked with a high degree of canonicity). The metadata about the authors of the texts are provided with their gender, occupation, and years of birth and death. The corpus texts come from three digital sources, and each text is marked for its source. They are Wikisource (145 texts), the ELTeC corpus (96 texts), and the dLib digital library (21 texts). The corpus is provided in two variants, one containing running text and the other with added linguistic analyses. These comprise tokens, sentences, lemmas, MULTEXT-East morphosytactic descriptions and Universal Dependencies morphological features. The linguistic annotation was performed with the CLASSLA program. The source format of the corpus in /XML, with two derived formats also available: one is plain text, and the other vertical files, as used by concordances, like the CWB. The corpus is available for download from CLARIN.SI as well as through the noSketchEngine and KonText concordancers.	Browse (noSketchEngine) Browse (KonText) Download
The corpus of older Slovenian narrative prose PriLit 1.0 Size: 43 texts; 1,275,209 tokens Annotation: word modernisation, lemmatisation, syntactic annotation (Universal Dependencies) Licence: CC BY 4.0	Slovenian	This corpus contains texts of older Slovenian narrative prose by 12 authors. The corpus is available for download from the CLARIN.SI repository.	Download
Banco de Datos de Once Novelas Españolas 1951—1971 (SOL) (2014-10-08) Size: 1,267,391 tokens; 69,270 sentences Annotation: sentence scrambled Licence: CC-BY 4.0	Spanish	This corpus is available for download from SWE-CLARIN and for online browsing through Korp.	Browse Download
Electronic corpus of 15th-century Castilian cancionero manuscripts	Spanish	This is a lyric corpus of 15th century cancioneros. The corpus is available for online browsing through an external interface.	Browse
Electronic text corpus of Sumerian literature (ETCSL) Size: 400 literary compositions	Sumerian	This corpus presents a selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia and date to the late third and early second millennia BCE. The corpus is available for online browsing through an external interface.	Browse
August Strindberg's novels (2017-10-16) Size: 4,309,037 tokens; 321,759 sentences Annotation: sentence scrambling Licence: CC-BY 4.0	Swedish	This corpus presents the collected works of August Strindberg. The corpus is available for download from SWE-CLARIN and for online browsing through Korp.	Browse Download
Bonnier novels I (1976/77) (2017-10-04) Size: 6,578,675 tokens; 462,625 sentences Annotation: sentence scrambling Licence: CC-BY 4.0	Swedish	This corpus presents 69 Bonnier novels from 1976-77. The corpus is available for download from SWE-CLARIN and for online browsing through Korp.	Browse Download
Bonnier novels II (1980/81) (2017-03-17) Size: 4,304,271 tokens; 298,361 sentences Annotation: sentence scrambling Licence: CC-BY 4.0	Swedish	This corpus presents 60 Bonnier novels from 1980-81. The corpus is available for download from SWE-CLARIN and for online browsing through Korp.	Browse Download

Multilingual corpora

Corpus	Language	Description	Availability
MULTEXT-East "1984" annotated corpus 4.0 Size: 12 texts; 79,718 sentences; 1,064,424 words Annotation: sentence-alignment, MSD tagging Licence: CC BY-NC SA 4.0	Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian	This is Parallel corpus of George Orwell's 1984 and its translations. The corpus is available for download from CLARIN.SI.	Download
Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo Size: 4,000 words Licence: Oxford Text Archive Licence	English (Middle), Hebrew	This corpus contains literary texts from 1100 to 1400. The corpus is available for download from the Oxford Text Archive.	Download
Finnish Folk Poetry Size: 7.1 million words Annotation: unannotated Licence: CC-BY-NC	Finnish, Karelian, Ludian, Latin, Swedish, Olonets, Izhorian, Votic	This corpus contains poems from 1564 to 1939. The corpus is available for online browsing through Korp.	Browse
ParFin 2016, Finnish-Russian Parallel Corpus of Literary Texts Size: 2,044,172 tokens Annotation: MSD-tagged, syntactically parsed Licence: CLARIN RES +NC +INF +ND	Finnish, Russian	This corpus contains Finnish literary texts from 1990-2010 and their translations into Russian aligned at sentence level. The corpus is available for online browsing through Korp.	Browse
ParRus 2016, Russian-Finnish Parallel Corpus of Literary Texts Size: 5,900,000 tokens Annotation: MSD-tagged, syntactically parsed Licence: CLARIN RES +NC +INF +ND	Finnish, Russian	This corpus contains Russian literary texts (classical literature & 20th century) and their translations into Finnish aligned at paragraph level. The corpus is available for online browsing through Korp.	Browse
Aleksis Kivi Corpus (SKS) Size: 413,735 words Annotation: MSD-tagged, syntactically parsed Licence: CC-BY-NC	Finnish, Swedish	This corpus contains all the known letters, manuscripts and published works by Finnish author Aleksis Kivi (1834–1872). Most of the texts were written in Finnish while some of the letters and manuscripts are in Swedish. The time coverage of the texts: 1855-1871. The corpus is available for online browsing through Korp.	Browse
Classics Library of the National Library of Finland - Kielipankki version Licence: CC-BY	Finnish, Swedish	This corpus contains literary texts from 1549 to 1944. The corpus is available for online browsing through FIN-CLARIN.	Browse
aformes Size: 376,250 words Licence: CC-BY-NC	Greek (Modern), English	This corpus contains fiction texts from a journal of undergraduate creative writing at the Faculty of English Language and Literature. The corpus is available for download from clarin:el.	Download