Literary corpora

Introduction

Literary corpora comprise poetry and fictional prose texts, such as novels, short stories and plays. They bring together the collected works of a single author or representative from a specific literary period. Since the literary corpora are often available through powerful concordancers, they are especially well suited for a quantitative and qualitative approach to comparative literary analysis, within or across different genres and historical periods.

The CLARIN infrastructure gives access to 44 literary corpora. The majority of the corpora are monolingual and cover 16 languages (Croatian, Danish, English, Estonian, Finnish, French, Greek, Latvian, North Saami, Norwegian, Polish, Portuguese, Spanish, Slovenian, Sumerian and Swedish). 

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 23 July 2021.

Literary corpora in the CLARIN infrastructure

Monolingual corpora

Corpus Language Description Availability

One-million Corpus of Croatian Literary Language

Size: 1 million tokens

Croatian

The corpus is listed in the LINDAT repository.

 

Johannes V. Jensen Corpus

Size: 1,760,093 words, 8,489 pages
Annotation: unannotated
Licence: CC BY-SA 4.0

Danish

This corpus presents the collected works of the Danish author Johannes Jensen.

The corpus is available for download from CLARIN-DK and for online browsing through a dedicated concordancer.

Browse

Download

Complete Corpus of Anglo-Saxon Poetry

Annotation: none

English (Old)

This corpus is available for online browsing through an external interface.

Browse

York-Helsinki Parsed Corpus of Old English Poetry

Size: 71,490 words
Annotation: MSD-tagged, syntactically parsed
Licence: Restricted

English (Old)

This corpus contains a selection of poetic texts (71,490 words) from the Old English Section of the Helsinki Corpus of English Texts.

The corpus is available for download from the Oxford Text Archive.

Download

Collection of older original Estonian-language works of fiction

Size: 173 texts
Licence: CLARIN ACA

Estonian

This corpus collects older Estonian literary texts published on "Kreutzwald's Century: the Estonian Cultural History Web". The electronically republished books, included in the collection, are based on the first editions of works by more important Estonian authors, published in 1854-1944.

The corpus is available for online browsing through an external interface.

Browse

Corpus of Estonian fiction

Size: 5,768,504 words
Licence: CLARIN ACA - NC

Estonian

This corpus contains texts from 1990 onwards.

The corpus is available for download from CELR.

Download

Estonian Runic Songs' Database

Size: 92,134 texts
Licence: CLARIN ACA

Estonian

These are the oldest text recordings of Estonian runic songs (the text recordings were created in the 19th century and in the first decades of the 20th century). In addition to the runic songs, the database also has songs of transitional form and end-rhymed songs (about 6000).

The corpus is available for online browsing through an external interface.

Browse

Classics of English and American Literature in Finnish (CEAL)

Size: 3 novels, 484,010 tokens
Annotation: MSD-tagged, syntactically parsed
Licence: CLARIN RES + NC

Finnish

This corpus contains Finnish translations of the following three texts: Jane Austen: Ylpeys ja ennakkoluulo (Pride and Prejudice), translated by Kersti Juva, Teos 2013; Henry James: Washingtonin aukio (Washington Square), translated by Kersti Juva, Otava 2003; Charles Dickens: Kolea talo (Bleak House), translated by Kersti Juva, Tammi, 2006.

The corpus is available for online browsing through Korp in two versions - Version 1 (Sentences and Paragraphs in the Original Order) and Version 2 (Scrambled Paragraphs))

Browse (original)

Browse (scrambled)

Classics of Finnish Literature, Kielipankki Version

Size: 1,500,000 words
Annotation: syntactically parsed (TDT alpha), named entities (FiNER), MSD-tagged, lemmatized
Licence: EUPL v.1.1 SA

Finnish

This corpus contains prose fiction, plays, poetry and aphorisms (some written originally in Swedish) of established Finnish authors published from 1880s to 1949.

The corpus is available for online browsing through Korp.

Browse

Finnish Corpus (Literature) (UHLCS)

Size: 68,425 words
Annotation: tagged
Licence: CLARIN RES

Finnish

This corpus contains samples of Finnish literature published by the WSOY publishing company in the 1990.

The corpus is available online through FIN-CLARIN.

Browse

Corpus of Old Literary Finnish

Size: 3,428,618 words

Finnish

This corpus contains various works published during the Swedish rule (from the 16th century to about 1810), extensive manuscripts from that period (most of which were later printed), as well as individual almanac and decree texts, sermons and poetry.

This corpus is available for online browsing through an external interface.

Browse

Corpus of Early Literary Finnish

 

Finnish

 

 

Corpus of Finnish Literary Classics

Size: 1,456,658 words

Finnish

This corpus contains works by established Finnish fiction writers from the 1880s to the 1930s. There are different types of prose and plays, as well as lyrics and aphorisms.

This corpus is available for online browsing through an external interface.

Browse

The Finnish Gutenberg Corpus

Size: 34,487,420 words
Licence: CC-BY

Finnish

This corpus contains Finnish books made available by the Gutenberg project. The texts have not been linguistically annotated.

The corpus is available for online browsing through Korp.

Browse

The Morpho-Syntactic Database of Mikael Agricola's Works

Size: 83,678 sentences; 428,314 tokens; 38,308 words
Annotation: MSD-tagged, syntactically parsed
Licence: CC-BY-ND

Finnish

This corpus contains the Finnish parts of Mikael Agricola’s works (Abckiria, Rukouskiria, Se Wsi testamenti, Käsikiria, Messu, Piina, Psaltari, Veisut, Profeetat).

The corpus is available for online browsing through Korp.

Browse

République-Bastille (1948-1949)

Size: 37,965 words
Licence: CC-BY

French

This corpus contains République-Bastille, a novel by Melpo Axioti. This French text is of particular linguistic interest since it is a text written in a language other than the author's mother tongue and is suited for research on bilingualism and self-translation. It would be worth measuring the naturalness of the language with computational tools, for example.

The corpus is available for download from clarin:el.

Download

Cultural Thesaurus of the Greek Language

Size: 1 million tokens
Annotation: semantic
Licence: proprietary

Greek

This corpus contains prose, poetry, drama, and essays from the 18th century onwards.

The corpus is available for online browsing through a dedicated webpage.

Browse

Greek Medieval Texts

Size: 3,419,553 words
Licence: CC-BY-NC

Greek (Ancient), Greek (Modern)

This corpus contains medieval texts contains written material covering the period from the 4th till the 16th century A.D. The texts can be classified into the following categories: religious, poetical-literary, political-historical, hymns, epigrams.

The corpus is available for download from clarin:el.

Download

Latvian literature classics

 

Latvian

This corpus presents classics from the end of the 19th century to the beginning of the 20th century.

 

North Saami Corpus (Literature) (UHLCS)

Size: 17,830 words
Licence: CLARIN RES +NC +NORED +PLAN

North Sami

This corpus contains Kerttu Vuolab's novel Cheppari cháráhus.

The corpus is available for online browsing through the TAITO shell.

Browse

NorGramBank – Fiction in Norwegian Bokmål

Size: 26,903,637 words; 2,469,916 sentences
Annotation: syntactically parsed
Licence: CLARIN ACA

Norwegian (Bokmal)

This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.

Browse

NorGramBank children’s fiction in Norwegian Bokmål

Size: 4,111,213 words; 389,564 sentences
Annotation: syntactically parsed
Licence: CLARIN ACA

Norwegian (Bokmal)

This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.

Browse

NorGrambank children's fiction in Norwegian Nynorsk

Size: 1,043,260 words; 106,434 sentences
Annotation: syntactically parsed
Licence: CLARIN ACA

Norwegian (Nynorsk)

This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.

Browse

NorGramBank fiction in Norwegian Nynorsk

Size: 2,884,376 words; 260,285 sentences
Annotation: syntactically parsed
Licence: CLARIN ACA

Norwegian (Nynorsk)

This corpus, which is based on OCR data from the National Library of Norway, is available for online browsing through INESS.

Browse

1000 Novels Corpus

Size: 1000 texts
Licence: CC-BY 4.0

Polish

This corpus is available for download from CLARIN-PL.

Download

1000PLUS Novels Corpus (1.0)

Size: 1000 texts; 17,352,826 words
Licence: CC-BY-SA 3.0

Polish

This corpus is available for download from CLARIN-PL.

Download

Late 19th- and Early 20th-Century Polish Novels

Licence: CC-BY 3.0

Polish

This corpus is available for download from CLARIN-PL.

Download

POE: Microcorpus of 20th century Polish poetry

Annotation: unannotated
Licence: plWordNet

Polish

This corpus is available for download from CLARIN-PL.

Download

LT Corpus

Size: 1,781,083 words
Annotation: PoS-tagged, lemmatized
Licence: CLARIN RES

Portuguese

This corpus contains 70 copyright-free classics (61 Portugal and 9 Brazil) published before 1940.

The corpus is available for download from PORTULAN.

Download

The corpus of older Slovenian narrative prose PriLit 1.0

Size: 43 texts; 1,275,209 tokens
Annotation: word modernisation, lemmatisation, syntactic annotation (Universal Dependencies)
Licence: CC BY 4.0

Slovenian

This corpus contains texts of older Slovenian narrative prose by 12 authors.

The corpus is available for download from the CLARIN.SI repository.

Download

Banco de Datos de Once Novelas Españolas 1951—1971 (SOL) (2014-10-08)

Size: 1,267,391 tokens; 69,270 sentences
Annotation: sentence scrambled
Licence: CC-BY 4.0

Spanish

This corpus is available for download from SWE-CLARIN and for online browsing through Korp.

Browse

Download

Electronic corpus of 15th-century Castilian cancionero manuscripts

 

Spanish

This is a lyric corpus of 15th century cancioneros.

The corpus is available for online browsing through an external interface.

Browse

Electronic text corpus of Sumerian literature (ETCSL)

Size: 400 literary compositions

Sumerian

This corpus presents a selection of nearly 400 literary compositions recorded on sources which come from ancient Mesopotamia and date to the late third and early second millennia BCE.

The corpus is available for online browsing through an external interface.

Browse

August Strindberg's novels (2017-10-16)

Size: 4,309,037 tokens; 321,759 sentences
Annotation: sentence scrambling
Licence: CC-BY 4.0

Swedish

This corpus presents the collected works of August Strindberg.

The corpus is available for download from SWE-CLARIN and for online browsing through Korp.

Browse

Download

Bonnier novels I (1976/77) (2017-10-04)

Size: 6,578,675 tokens; 462,625 sentences
Annotation: sentence scrambling
Licence: CC-BY 4.0

Swedish

This corpus presents 69 Bonnier novels from 1976-77.

The corpus is available for download from SWE-CLARIN and for online browsing through Korp.

Browse

Download

Bonnier novels II (1980/81) (2017-03-17)

Size: 4,304,271 tokens; 298,361 sentences
Annotation: sentence scrambling
Licence: CC-BY 4.0

Swedish

This corpus presents 60 Bonnier novels from 1980-81.

The corpus is available for download from SWE-CLARIN and for online browsing through Korp.

Browse

Download

Multilingual corpora

Corpus Language Description Availability

MULTEXT-East "1984" annotated corpus 4.0

Size: 12 texts; 79,718 sentences; 1,064,424 words
Annotation: sentence-alignment, MSD tagging
Licence: CC BY-NC SA 4.0

Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovenian

This is Parallel corpus of George Orwell's 1984 and its translations.

The corpus is available for download from CLARIN.SI.

Download

Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo

Size: 4,000 words
Licence: Oxford Text Archive Licence

English (Middle), Hebrew

This corpus contains literary texts from 1100 to 1400.

The corpus is available for download from the Oxford Text Archive.

Download

Finnish Folk Poetry

Size: 7.1 million words
Annotation: unannotated
Licence: CC-BY-NC

Finnish, Karelian, Ludian, Latin, Swedish, Olonets, Izhorian, Votic

This corpus contains poems from 1564 to 1939.

The corpus is available for online browsing through Korp.

Browse

ParFin 2016, Finnish-Russian Parallel Corpus of Literary Texts

Size: 2,044,172 tokens
Annotation: MSD-tagged, syntactically parsed
Licence: CLARIN RES +NC +INF +ND

Finnish, Russian

This corpus contains Finnish literary texts from 1990-2010 and their translations into Russian aligned at sentence level.

The corpus is available for online browsing through Korp.

Browse

ParRus 2016, Russian-Finnish Parallel Corpus of Literary Texts

Size: 5,900,000 tokens
Annotation: MSD-tagged, syntactically parsed
Licence: CLARIN RES +NC +INF +ND

Finnish, Russian

This corpus contains Russian literary texts (classical literature & 20th century) and their translations into Finnish aligned at paragraph level.

The corpus is available for online browsing through Korp.

Browse

Aleksis Kivi Corpus (SKS)

Size: 413,735 words
Annotation: MSD-tagged, syntactically parsed
Licence: CC-BY-NC

Finnish, Swedish

This corpus contains all the known letters, manuscripts and published works by Finnish author Aleksis Kivi (1834–1872). Most of the texts were written in Finnish while some of the letters and manuscripts are in Swedish. The time coverage of the texts: 1855-1871.

The corpus is available for online browsing through Korp.

Browse

Classics Library of the National Library of Finland - Kielipankki version

Licence: CC-BY

Finnish, Swedish

This corpus contains literary texts from 1549 to 1944.

The corpus is available for online browsing through FIN-CLARIN.

Browse

aformes

Size: 376,250 words
Licence: CC-BY-NC

Greek (Modern), English

This corpus contains fiction texts from a journal of undergraduate creative writing at the Faculty of English Language and Literature.

The corpus is available for download from clarin:el.

Download