Parallel corpora

Introduction

Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems. 

The parallel corpora are our largest resource family, as the CLARIN infrastructure provides access to 86 parallel corpora, the majority of which are available for download from national repositories as well as through concordancers such as Korp, Corpuscle, and KonText. There are 47 bilingual corpora in the CLARIN infrastructure, mostly containing European language pairs but also non-European languages such as Hindi, Tamil, and Vietnamese. 39 corpora are multilingual, with 5 containing texts in more than 50 languages. Almost half of the corpora are sentence-aligned, which allows for easy comparative research.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 30 August 2021.

Parallel corpora in the CLARIN infrastructure

Bilingual corpora

Corpus Language Description Availability

Amharic-English bilingual corpus

Size: 500 million tokens
Annotation: tokenised
Licence: ELRA END USER/ELRA VAR

Amharic-English

This corpus contains legal texts and news articles.

The corpus is available for download from the ELRA catalogue.

Download

The Catalan-Spanish Parallel Corpus

Size: 100 million tokens
Annotation: tokenised, sentence-aligned
Licence: ELRA END USER/ELRA VAR

Catalan Spanish-Castillian Spanish

This corpus contains newspaper articles.

The corpus is available for download from the ELRA catalogue.

Download

Croatian-English parallel corpus hrenWaC 2.0

Size: 99,001 sentence pairs
Annotation: sentence-aligned
Licence: CLARIN.SI User License for Internet Corpora

Croatian-English

This corpus contains texts crawled from top-level Croatian .hr domains.

The corpus is available for download from the CLARIN.SI repository.

Download

Czech and English abstracts of ĂšFAL papers

Size: 200,000 tokens
Annotation: tokenised, document-aligned
Licence: CC-BY

Czech-English

This corpus contains abstracts of ĂšFAL papers.

The corpus is available for download from the LINDAT repository.

Download

Czech-English Manual Word Alignment

Size: 113, 000 tokens
Annotation: tokenised, word-aligned
Licence: CC-BY

Czech-English

This corpus contains texts from e-books, Reader’s Digest, the Kačenka magazine, Acquis Communautaire, the Project Syndicate and the PCEDT project.

The corpus is available for download from the LINDAT repository.

Download

CzEng 1.6

Size: 206.4 million tokens
Annotation: tokenised, sentence-aligned
Licence: CC-BY

Czech-English

This corpus is bidirectional, with original texts in English and Czech and accompanying translations.

The corpus is available for download from a dedicated website.

For the relevant publication, see Bojar et al. (2016).

Download

Czech-Slovak Parallel Corpus

Size: 5.7 million sentences
Annotation: automatic morphological annotation
Licence: CC-BY

Czech-Slovak

This corpus contains legal texts (Acquis), parliamentary debates (from the Europarl corpus), articles from the Official Journal of the European Union, and texts from the OPUS corpus.

The corpus is available for download from the LINDAT repository.

Download

Tourism English-Croatian Parallel Corpus 2.0

Size: 140,000 tokens
Annotation: tokenised, sentence-aligned
Licence: CLARIN.SI User Licence for Internet Corpora

English-Croatian

This corpus contains texts from tourist websites.

The corpus is available for download from the CLARIN.SI repository.

Download

Kacenka

Size: 3.3 million tokens
Annotation: tokenised

English-Czech

This corpus contains fictional texts.

 

English-Czech Corpus from Wikipedia

Size: 7.5 million tokens
Annotation: tokenised, sentence-aligned
Licence: CC-BY

English-Czech

This corpus contains Wikipedia articles.

The corpus is available for download from the LINDAT repository.

Download

Text Corpus - EMEL

Size: 43,000 tokens
Annotation: tokenised
Licence: CC-BY

English-French

This corpus contains NLP conference papers.

The corpus is available for download from the CLARIN:el repository.

Download

QTLP English-Greek Corpus for the MEDICAL domain

Size: 62,452 sentence pairs
Annotation: sentence aligned
Licence: MS-NC-NoReD

English-Greek

This corpus contains medical texts.

For access, contact the resource managers.

QTLP English-Greek Corpus for the AUTOMOTIVE domain

Size: 2,946 sentence pairs
Annotation: sentence aligned
Licence: MS-NC-NoReD

English-Greek

This corpus contains texts from the automotive domain.

For access, contact the resource managers.

Interlingual Perspectives

Size: 18 articles
Licence: CC-BY

English-Greek

This corpus contains research articles.

The corpus is available for download from the CLARIN:el repository.

Download

aformes

Size: 376,250 tokens
Annotation: tokenised
Licence: CC-BY

English-Greek

This corpus contains articles from a journal of undergraduate creative writing at an English department in Greece.

The corpus is available for download from the CLARIN:el repository.

Download

HindEnCorp 0.5

Size: 132,300 sentences
Annotation: sentence-aligned
Licence: CC-BY

English-Hindi

This corpus contains TED talks, news articles, Wikipedia articles, etc.

The corpus is available for download from LINDAT and can be queried through KonText.

Concordancer

Download

The English-Slovak Parallel corpus

Annotation: automatic morphological annotation
Licence: CC-BY NC-SA 3.0

English-Slovak

This corpus contains legal texts (Acquis), parliamentary debates (from the Europarl corpus), articles from the Official Journal of the European Union, and texts from the OPUS corpus.

The corpus is available for download from LINDAT.

Download

English-Luganda Parallel Corpus

Size: 150 sentences
Annotation: word-aligned

English-Luganda

This corpus contains Biblical scripture.

The corpus is available for download from a dedicated webpage.

Download

The English-Nepali Parallel Corpus

Size: 1.2 million tokens
Annotation: tokenised, partially sentence-aligned
Licence: ELRA END USER

English-Nepali

This corpus contains texts on national development. 

The corpus is available for download from the ELRA catalogue.

Download

English-Persian parallel Corpus

Size: 3.5 million tokens
Annotation: tokenised, sentence-aligned
Licence: ELRA END USER/ELRA VAR

English-Persian

This corpus contains literary, medical, political, proverbial, religious and scientific texts.

The corpus is available for download from the ELRA catalogue.

Download

UP/TAP annotated by the OpenNLP Part-of-Speech Tagger (Portuguese) and OpenNLP Part-of-Speech Tagger (English)

Size: 31,849 sentences
Annotation: PoS-tagged, sentence aligned
Licence: CC-BY

English-Portuguese

This parallel corpus contains texts extracted from the TAP UP magazine.

The corpus is available for download from the CLARIN:EL repository.

Download

The Corpus of Free Trade Agreement

Size: 3 million tokens
Annotation: tokenised
Licence: CLARIN ACA

English-Spanish

This corpus contains texts on the Free Trade Agreement.

The corpus is available through the concordancer Corpuscle.

Concordancer

ECPC Corpus (European Comparable and Parallel Corpora of Parliamentary Speeches Archive) – set 1

Size: 7.6 million tokens
Annotation: PoS-tagged, sentence-aligned
Licence: CC-BY-NC-SA 4.0

English-Spanish (Castillian)

This corpus is a collection of XML metatextually tagged corpora containing speeches from three European chambers (the European Parliament, the British House of Commons, and the Spanish Congreso de los Diputados). It is a bilingual, bidirectional written corpus.

For the relevant publication, see Zanettin (2012).

Download

EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

Size: 169,871 sentences
Annotation: sentence-aligned
Licence: CC-BY

English-Tamil

This corpus contains news articles and texts related to film.

The corpus is available for download from LINDAT.

Download

English-Urdu Religious Parallel Corpus

Size: 14,371 sentences
Annotation: tokenised, sentence-aligned
Licence: CC-BY

English-Urdu

This corpus contains religious texts (the Bible and the Quran).

The corpus is available for download from LINDAT.

Download

English-Vietnamese Parallel Corpus

Size: 500,000 sentence pairs
Annotation: sentence-aligned
Licence: ELRA END USER/ELRA VAR

English-Vietnamese

This corpus contains newspaper and online news articles and texts from books and dictionaries from 2000 to 2007.

The corpus is available for download from the ELRA catalogue.

Download

Estonian-English parallel corpus

Size: 307,000 sentences
Annotation: sentence-aligned
Licence: CLARIN ACA

Estonian-English

This corpus contains Estonian laws and their translations into English and EU legislation translated into Estonian.

The corpus is available for download from a dedicated webpage.

Download

Finnish-English parallel corpus fienWaC 1.0

Size: 2.9 million tokens
Annotation: tokenised, sentenced-aligned
Licence: CLARIN.SI User License for Internet Corpora

Finnish-English

This corpus contains texts crawled from top-level Finnish .fi domains.

The corpus is available for download from the CLARIN.SI repository.

Download

ParFin

Size: 360,000 tokens
Annotation: tokenised, sentence-aligned
Licence: CLARIN RES

Finnish-Russian

This corpus contains literary texts from 1990 to 2010.

The corpus is available through the concordancer Korp.

Concordancer

The KOTUS Finnish-Swedish Parallel Corpus

Size: 4.3 million tokens
Annotation: tokenised, sentence-aligned
Licence: CC-BY

Finnish-Swedish

This corpus contains corporate press releases, surveys, reports, laws and regulations, as well as governmental proposals from 1993 to 2004.

The corpus is available for download from FIN-CLARIN and through the concordancer Korp.

Concordancer

Download

Parallel corpus newsletters IFT FR-GR

Licence: CC-BY

French-Greek

This corpus contains IFT newsletters.

The corpus is available for download from the CLARIN:el repository.

Download

FREL

Size: 701,401 tokens
Annotation: tokenised
Licence: under negotiation

French-Greek

This corpus contains literary texts.

 

GeFRePaC - German French Reciprocal Parallel Corpus

Size: 30 million tokens
Annotation: tokenised, sentence- and partially word-aligned
Licence: ELRA END USER

German-French

This corpus contains texts from the European Union CELEX Database. The corpus is bidirectional.

The corpus is available for download from the ELRA catalogue.

Download

QTLP German-Greek Corpus for the MEDICAL domain

Size: 2,752 pairs of sentences
Annotation: sentence aligned
Licence: MS-NC-NoReD

German-Greek

This corpus contains medical texts.

For access, contact the resource managers.

Greek-Bulgarian Bul-TM parallel corpus

Size: 10 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

Greek-Bulgarian

This corpus contains societal and political texts.

The corpus is available for download through the CLARIN:el repository.

Download

European Parliament Proceedings Parallel Corpus 1996-2011, parallel corpus Greek-English

Size: 1.2 million sentences
Annotation: sentence-aligned
Licence: CC-ZERO

Greek-English

This corpus contains debates of the European Parliament from 1996 to 2011.

The corpus is available for download from the CLARIN:el repository.

Download

INTERA Corpus - the Greek-English part

Size: 4 million tokens
Annotation: sentence aligned
Licence: CC-BY

Greek-English

This corpus contains texts from the law, education, environment, tourism and health domains.

The corpus is available for download from the CLARIN:el repository.

Download

ParIce

Size: 3,589,000 sentence pairs
Annotation: tokenised, PoS-tagged, sentence-aligned, word-aligned
Licence: CC-BY 4.0

Icelandic-English

This corpus contains Icelandic and English texts from 11 different sources.

The corpus is available for download from CLARIN-IS and for search through the concordancer Korp.

For the relevant publication, see Barkarson and SteingrĂ­msson (2019)

Concordancer

Download

The Norwegian-Spanish Parallel Corpus

Size: 6 million tokens
Annotation: tokenised
Licence: CLARIN ACA

Norwegian-Spanish

This corpus contains fictional and non-fictional texts from 2000 to 2009.

The corpus is available through the concordancer Corpuscle and for download in the CLARINO repository.

Concordancer

Download

The Polish-Lithuanian Parallel Corpus

Licence: IS PAS

Polish-Lithuanian

The corpus is available for download from the CLARIN-PL repository.

Download

COMPARA : Portuguese - English parallel translation corpus

Annotation: sentence-aligned
Licence: CC-BY

Portuguese-English

This corpus contains fictional texts and academic, newspaper and tourist articles.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Frankenberg Garcia and Santos (2003).

Concordancer

QTLP Portuguese-Greek Corpus for the MEDICAL domain

Size: 62,608 sentence pairs
Annotation: sentence aligned
Licence: MS-NC-NoReD

Portuguese-Greek

This corpus contains medical texts.

For access, contact the resource managers.

QTLP Portuguese-Greek Corpus for the AUTOMOTIVE domain

Size: 59,297 sentence pairs
Annotation: sentence aligned
Licence: MS-NC-NoReD

Portuguese-Greek

This corpus contains texts from the automotive domain.

For access, contact the resource managers.

ParRus

Size: 5.9 million tokens
Annotation: tokenised, paragraph-aligned
Licence: CLARIN RES

Russian-Finnish

This corpus contains texts from classical and 20th century literature.

The corpus is available through the concordancer Korp.

Concordancer

Serbian-English parallel corpus srenWaC 1.0

Size: 23.1 million tokens
Annotation: tokenised
Licence: CLARIN.SI User License for Internet Corpora

Serbian-English

This corpus contains texts crawled from top-level Serbian .rs domains.

The corpus is available for download from the CLARIN.SI repository.

Download

Slovene-English parallel corpus slenWaC 1.0

Size: 718,315 tokens
Annotation: tokenised, sentenced-aligned
Licence: CLARIN.SI User License for Internet Corpora

Slovenian-English

This corpus contains texts crawled from top-level Slovenian .si domains.

The corpus is available for download from the CLARIN.SI repository.

Download

Multilingual corpora

Corpus Language Description Availability

Tatoeba

Size: 12 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

117 languages

This corpus contains texts from the Tatoeba website.

The corpus is available for download from the CLARIN:el repository.

Download

Parallel Bible Corpus

 

Approx. 100 languages

This corpus contains historical and contemporary translations of the Bible.

 

A parallel corpus of KDE4 localization files (v.2)

Size: 60 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

92 languages

This corpus contains KDE4 localization files.

The corpus is available for download from the CLARIN:el repository.

Download

OpenSubtitles2011

Size: 8.31G tokens
Annotation: tokenised, sentence and word aligned
Licence: Open For Reuse With Restrictions 

54 languages

This corpus contains subtitles from the OpenSubtitles website.

The corpus is available for download from the CLARIN:el repository.

Download

EAC Translation Memory

Size: 320,000 tokens
Annotation: tokenised, sentence aligned
Licence: Open For Reuse With Restrictions

50 languages

This corpus contains law documents and texts related to education and culture.

The corpus is available for download through the CLARIN:el repository.

Download

Parallel Global Voices

Size: 174,629 documents
Annotation: sentence aligned
Licence: CC-BY

Approx. 50 languages

This corpus contains texts crawled from the Global Voices webpage.

The corpus is available for download from a dedicated webpage.

Download

InterCorp

Size: 1.5 billion tokens
Annotation: sentence aligned
Licence: proprietary

40 languages

The corpus consists of two main parts: manually aligned fiction and a number of collections: political commentaries published by Project Syndicate and VoxEurop, EU legal texts form the Acquis Communautaire corpus, proceedings of the European Parliament from the Europarl corpus, film subtitles from the Open Subtitles database, and the Bible.

The corpus is available primarily through the KonText concordancer. For research purposes, tailor-made linguistic data derived from the InterCorp corpus can be provided upon request. The contact e-mail is cnk@korpus.cz.

For the relevant publication, see Čermák and Rosen (2012)

Concordancer

DGT-TM-2016

Size: 373 million tokens
Annotation: tokenised, sentence aligned
Licence: Open For Reuse With Restrictions

Approx. 30 languages

This corpus contains texts from the European Legislation.

The corpus is available for download from the CLARIN:el repository.

Download

PELCRA multilingual parallel corpora

Size: 143 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

25 languages

This corpus contains texts from the CORDIC and RAPID websites, and the press releases of the European Parliament and the European Southern Observatory.

The corpus is available for download from the CLARIN:EL repository.

Download

DGT-Acquis

Annotation: sentence aligned
Licence: Open For Reuse With Restrictions

23 languages

This corpus contains articles from the Official Journal of the European Union from  2004 to 2011.

The corpus is available for download from the CLARIN:el repository-

Download

JRC-Acquis Multilingual Parallel Corpus

Size: 1 billion tokens
Annotation: tokenised, sentence aligned
Licence: Usage Conditions

22 languages

This corpus contains legislative and legal texts from the Acquis Communautaire from various periods beginning in the 1950s.

The corpus is available for download from the webpage of the European Commission.

For the relevant publication, see Steinberger et al. (2014).

Download

A parallel corpus collected from the European Constitution

Size: 3 million tokens
Annotation: tokenised, sentence aligned
Licence: Open For Reuse With Restrictions

21 languages

This corpus contains European Constitution documents.

The corpus is available for download through the CLARIN:el repository.

Download

Europarl Parallel Corpus

Size: 650,000 tokens
Annotation: tokenised, sentence aligned
Licence: CC-ZERO

21 languages

This corpus contains debates of the European Parliament from 1996 to 2011.

The corpus is available for download from the corpus webpage.

Download

ECDC Translation Memory

Size: 320,000 tokens
Annotation: tokenised, sentence aligned
Licence: Open For Reuse With Restrictions

Approx. 20 languages

This corpus contains texts from the public health domain.

The corpus is available for download from the CLARIN:el repository.

Download

EMEA Corpus

Size: 31 million tokens
Annotation: sentence aligned
Licence: Open For Reuse With Restrictions

Approx. 20 languages

This corpus contains documents of the European Medicines Agency.

The corpus is available for download from the CLARIN:el repository.

Download

DGT-Translation Memory

Size: 10.1 million tokens
Annotation: tokenised
Licence: Open For Reuse With Restrictions

Approx. 20 languages

This corpus contains legislative texts of the European Legislation.

The corpus is available for download from the CLARIN:el repository.

Download

European Central Bank parallel corpus

Size: 757 million tokens
Annotation: tokenised, sentence aligned
Licence: Open For Reuse With Restrictions

19 languages

This corpus contains texts from the European Central Bank.

The corpus is available for download from the CLARIN:el repository.

Download

Opus, Helsinki Korp Version

Size: 2.7 billion tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

16 languages

This is a multilingual variant of the OPUS corpus that contains texts in the following languages: Czech, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Russian, Swedish, Spanish, and Turkish.

The corpus is available through the concordancer Korp.

Concordancer

MULTEXT-East "1984" annotated corpus 4.0

Size: 1.06 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

11 languages

This corpus contains George Orwell’s 1984 original novel in English and its translations into the following languages: Bulgarian, Czech, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, and Slovenian.

The corpus is available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2012).

Download

ParaCrawl Corpus version 1.0

Licence: CC Zero

11 languages

This corpus contains webcrawled data in the following languages: Czech, Dutch, English, Estonian, Finnish, French, German, Italian, Latvian, Polish, Portuguese, Romanian, Russian, and Spanish.

The corpus is available for download from LINDAT. Additionally, the 2.0 version of the corpus, which includes six new languages (Irish, Croatian, Maltese, Lithuanian, Hungarian, and Estonian), can be downloaded from the corpus's dedicated website.

Download

MLCC Multilingual and Parallel Corpora

Size: 10.2 million tokens
Annotation: tokenised
Licence: ELRA END USER

9 language

This corpus contains articles from the Official Journal of the European Communities from 1986 to 1994 in the following languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese, and Spanish.

The corpus is available for download from the ELRA catalogue.

Download

SETimes

Size: 43 million tokens
Annotation: partially sentence aligned
Licence: CC-BY

9 languages

This corpus contains texts from the setimes.com website.

The corpus is available for download from the CLARIN:EL repository.

For the relevant publication, see Tyers and Alperen (2010)

Download

ACCURAT balanced test corpus for under resourced languages

Size: 4,608 sentences
Annotation: sentence aligned
Licence: CC-BY

7 languages

This corpus contains texts in Greek, Slovenian, Romanian, Latvian, Estonian, Croatian, and Lithuanian.

The corpus is available for download from the CLARIN:el repository.

Download

The CLUVI parallel corpus

Size: 23 million tokens
Annotation: tokenised
Licence: CC-BY

6 languages

This corpus contains fictional, legal, scientific, computational, legal and administrative texts from 2003 to 2012 in the following language combinations: English-Galician, Galician-Spanish, French-Galician, English-Galician-French-Spanish, and Spanish-Catalan-Basque.

The corpus is available for download from a dedicated webpage.

Download

Europarl QTLeap WSD/NED corpus

Size: 52 million tokens
Annotation: tokenised, WSD, NER, CR-tagged
Licence: CC-BY

6 languages

This corpus contains debates of the European Parliament in the following language pairs: Bulgarian-English, Czech-English, Portuguese-English, Spanish-English, and Basque-English.

The corpus is available for download from LINDAT.

Download

MultiJur: Multilingual Parallel Corpus of Legal Texts

Size: 1.2 million tokens
Annotation: paragraph aligned
Licence: CLARIN PUB

5 languages

This corpus contains international conventions and treaties in the following languages: English, Russian, German, Finnish, and Swedish.

The corpus is available through the concordancer Korp.

Concordancer

GLOSSOLOGIA

Licence: CC-BY

4 languages

This corpus contains articles from Glossologia, a journal of general and historical Greek linguistics, in French, Greek, English, and German.

The corpus is available for download from the CLARIN:el repository.

Download

MULCOLD - Multilingual Corpus of Legal Documents

Size: 1.2 million tokens
Annotation: tokenised, paragraph aligned, PoS-tagged, lemmatized
Licence: CC-BY

4 languages

This corpus contains international conventions and treaties in Russian, English, Swedish, and Finnish.

The corpus is available through the concordancer Korp.

Concordancer

SPC - Stockholm Parallel Corpora

Size: 1.32 million tokens
Annotation: tokenised, sentence aligned
Licence: Open For Reuse With Restrictions

4 languages

This corpus contains legal texts in English, Afrikaans, Chinese, and Greek.

The corpus is available for download from the CLARIN:el repository.

Download

Civitas Gentium

Size: 31 articles
Licence: CC-BY

3 languages

This corpus contains scientific papers and book reviews in English, Greek, and French.

The corpus is available for download from the CLARIN:el repository.

Download

CRATER 2 Corpus

Size: 4 million tokens
Annotation: tokenised, morphosyntactically tagged
Licence: ELRA END USER/ELRA VAR

3 languages

This corpus contains texts from the telecommunications domain.

The corpus is available for download from the ELRA catalogue.

Download

CsEnVi Pairwise Parallel Corpora

Size: 31 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

3 languages

This corpus contains TED talks and subtitles from the CLUVI corpus in Vietnamese, Czech, and English.

The corpus is available for download from LINDAT.

Download

The DPC – Dutch Parallel Corpus

Size: 10.8 million tokens
Annotation: tokenised, sentence aligned
Licence: CLARIN ACA

3 languages

This corpus contains fictional, journalistic, instructive and administrative texts in English, Dutch, and French.

The corpus is available for download (after registration) from the Dutch Language Institute.

For the relevant publication, see Macken et al. (2007).

Download

EuroParl-UdS

Annotation: sentence aligned
Licence: CC-BY-NC-SA 4.0

3 languages

The corpus contains parliamentary debates of the European Parliament. A subset is a parallel corpus for the following language combinations: English-German and English-Spanish.

The corpus is available for download from a CLARIN-D repository.

Download

European Parliament Interpretation Corpus (EPIC)

Size: 177,000 tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: ELRA END USER

3 languages

This corpus contains debates of the European Parliament in Italian, English, and Spanish, with translations in all possible combinations.

The corpus is available for download from the ELRA catalogue.

Download

MUSA Multilingual Multimodal Corpus

Size: 1.2 million words
Annotation: subtitle alignment
Licence: Academic

3 languages

This parallel multimodal corpus contains English, Greek, and French.

The corpus is distributed by CLARIN:EL.

 

PANACEA English-French and English-Greek parallel corpus

Licence: ELRA END USER

3 languages

This corpus contains environmental and legislative texts in English and their French and Greek translations.

The corpus is available for download from the ELRA catalogue.

Download

Polish-Bulgarian-Russian Parallel Corpus

Size: 55 texts
Licence: IS PAS corpora license

3 languages

This corpus is available for download from the CLARIN PL repository.

Download

UMC 0.1: Czech-Russian-English Multilingual Corpus

Size: 1.8 million tokens
Annotation: tokenised, sentence aligned
Licence: CC-BY

3 languages

This corpus contains news articles and commentaries in Czech, Russian, and English from the Project Syndicate website from 1995 to 2008.

The corpus is available for download from LINDAT and through the concordancer Korp.

Concordancer

Download

Other parallel corpora

Bilingual corpora

Corpus Language Description Availability

ParCor - A Parallel Pronoun-Coreference Corpus

Annotation: pronoun coreference

English-German

This corpus contains TED talks and EU Bookshop publications.

The corpus is available for download from the OPUS webpage.

For the relevant publication, see Guillou et al. (2014).

Download

Parallel English-Irish corpus of legal texts

Annotation: sentence aligned 

English-Irish

This corpus contains legal texts.

The corpus is available through a dedicated concordancer.

Concordancer

The NAACL 2003 English-Romanian corpus

Size: 1.6 million tokens
Licence: MS-BY-NC-ND

English-Romanian

The corpus contains texts from 2003.

For access, contact the resource managers.

The English-Swedish Parallel Corpus

Size: 3.5 million tokens
Annotation: tokenised, paragraph aligned

English-Swedish

This corpus contains fictional and non-fictional texts. It is bidirectional.

For access, contact the resource managers.

Estonian Open Parallel Corpus 2012. Estonian-English

Size: 2.5 million tokens
Annotation: tokenised
Licence: CC-BY

Estonian-English

This corpus contains Biblical and legal texts.

The corpus is available for download from META-SHARE.

Download

SzegedParalell: angol-magyar párhuzamos korpusz

 

English-Hungarian

This corpus contains literary texts and texts on the European Union.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see TĂłth et al. (2002)

Download

The TRIS corpus

Size: 1.76 million tokens
Annotation: tokenised, sentence-aligned

German-Spanish

This corpus contains texts from the European Commission from 1997 to 2010.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Parra EscartĂ­n (2012).

Download

LILA parallel corpus

Size: 8 million tokens
Annotation: tokenised, sentence-aligned

Lithuanian-Latvian

This corpus contains fictional and non-fictional texts from 1991 to 2012. It is bidirectional.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Utka et al. (2012).

Concordancer

Manually aligned CES Polish-English parallel corpus

Size: 1.4 million tokens
Annotation: tokenised, sentence-aligned
Licence: CC-BY

Polish-English

This corpus contains CES reports.

The corpus is available for download from a dedicated webpage.

Download

Slovak-English Parallel Corpus

Size: 556 million tokens
Annotation: tokenised, sentence-aligned
Licence: proprietary

Slovak-English

This corpus contains texts from language books. It is bidirectional.

The corpus is available through a dedicated concordancer.

Concordancer

Multilingual corpora

Corpus Language Description Availability

OPUS corpus

Size: A great many subcorpora
Annotation: sentence-aligned
Licence: CC-BY

Approx. 100 languages

This corpus contains various subcorpora that compile texts from a great number of domains, such as literary texts, political documents, subtitles, UN documents, and the debates of the European Parliament.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.

For the relevant publication, see Tiedemann (2009)

Concordancer

Download

Bulgarian-X language Parallel Corpus2

Size: 1.2 billion tokens
Annotation: tokenised 
Licence: CC-BY

50 languages

This corpus is a part of the Bulgarian National Corpus.

The corpus is available through a dedicated concordancer.

Concordancer

EUbookshop

Size: 3.5 billion tokens
Annotation: tokenised, sentence-aligned

48 languages

This corpus contains texts from EU law books and related publications.

The corpus is available for download from the OPUS webpage.

For the relevant publication, see Skadiņš et al. (2014)

Download

PELCRA multilingual parallel corpora

Size: 143 million tokens
Annotation: tokenised, sentence-aligned
Licence: CC-BY

25 languages

This corpus contains texts from the CORDIC and RAPID websites, and the press releases of the European Parliament and the European Southern Observatory.

The corpus is available for download from .

Download

TED-Parallel-Corpus

Size: 300,000 sentences

11 languages

This corpus contains TED talks in English and translations into the following languages: Arabic, Simplified Chinese, Traditional Chinese, Dutch, French, German, Hebrew, Italian, Japanese, Korean, and Russian.

The corpus is available for download from GIT-HUB.

Download

SETimes

Size: 43 million tokens
Annotation: partially sentence aligned
Licence: CC-BY

10 languages

This corpus contains texts from the setimes.com website.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Tyers and Alperen (2010).

Download

The United Nations Parallel Corpus

Size: 335 million tokens
Annotation: tokenised

6 languages

This corpus contains the official records and other parliamentary documents of the United Nations that are in the public domain in the following languages: English, Russian, Spanish, French, Chinese, and Arabic.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Ziemski et al. (2016).

Download

ÎĽtopia

Size: 1.5 million tokens
Annotation: tokenised

6 languages

This corpus contains tweets and blogposts in the following language pairs: English-Mandarin, English-Arabic, English-Russian, English-Korean, and English-Japanese.

The corpus is available for download from a dedicated webpage.

Download

QTLeap Corpus V1.2

Size: 140,000 tokens
Annotation: sentence-aligned
Licence: CC-BY

5 languages

This corpus contains texts related to computer and IT troubleshooting for the following language pairs: Bulgarian-English, Czech-English, Portuguese-English, Spanish-English, and Basque-English

The corpus available for download from META-SHARE under the CC-BY license.

Download

Parallel Wiki

Licence: CC-BY

4 languages

This corpus contains Wikipedia texts in the following language pairs: English-German, English-Romanian, and English-Spanish.

For access, contact the resource managers.

QTLeap News Corpus

Size: 1,104 sentences
Annotation: sentence-aligned
Licence: CC-BY

4 languages

This corpus contains news articles in the following language pairs: English-Czech, English-German and English-Spanish.

For access, contact the resource managers.

Scielo corpus

 

4 languages

This corpus contains scientific articles from the Scielo database in the following language pairs: English-French, English-Spanish, and English-Portuguese.

For access, contact the resource managers.

MultiUN: Multilingual UN Parallel Text 2000—2009

Size: 1 billion tokens
Annotation: tokenised, sentence-aligned

3 languages

This corpus contains texts from the United Nations website from 2000 to 2009 in the following language pairs: Spanish-Chinese, Chinese-Spanish, French-Chinese, and Chinese-French.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Eisele and Chen (2010).

Download

REVEAL-THIS Corpus

Size: 325,000 words
Licence: under negotiation

3 languages

This is a multilingual corpus of English, French and Greek.

For access, contact the resource managers.

REVISTA PESQUISA FAPESP PARALLEL CORPORA

Size: 150,000 sentences
Annotation: sentence- and word-aligned

3 languages

This corpus contains texts from the Brazilian magazine REVISTA PESQUISA FAPESP in the following language pairs: Portuguese-English and Portuguese-Spanish.

The corpus is available for download from the corpus webpage.

Download

Publications on the parallel corpora

[Barkarson and Steingrímsson 2019] Starkaður Barkarson and Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus.

[Bojar et al. 2016] OndĹ™ej Bojar, OndĹ™ej Dušek, Tom Kocmi, JindĹ™ich LibovickĂ˝, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš. 2016. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered.

[ÄŚermák and Rosen 2012]  František ÄŚermák and  Alexandr Rosen. 2012. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus. Linguistics, 17(3): 411–427.

[Eisele and Chen 2010] Andreas Eisele, Yu Chen. 2010. MultiUN: A Multilingual Corpus from United Nations Documents.

[Erjavec 2012] TomaĹľ Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Frankenberg Garcia and Santos 2003] Ana Frankenberg-Garcia and Diana Santos. 2003. Introducing COMPARA, the Portuguese-English parallel corpus.

[Guillou et al. 2014] Liane Guillou, Christian Hardmeier, Aaron Smith, Jorg Tiedemann, Bonnie Webber. 2014. ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical .

[Macken et al. 2007] Lieve Macken, Julia Trushkina, Hans Paulussen, Lidia Rura, Piet Desmet, Willy Vandeweghe. 2007. Dutch Parallel Corpus: A Multilingual Annotated Corpus.

[Parra EscartĂ­n 2012] Carla Parra EscartĂ­n. 2012. Design and compilation of a specialized Spanish-German parallel corpus.

[Skadiņš et al. 2014] Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, Daiga Deksne. 2014. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus.

[Steinberger et al. 2014] Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick SchlĂĽter, Marek Przbyszewski, Signe Gilbro. 2014. An overview of the European Union's highly multilingual parallel corpora. 

[Tiedemann 2009] Jörg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. 

[TĂłth et al. 2002] Krisztina TĂłth, Richárd Farkas, András Kocsor. 2002. Sentence Alignment of Hungarian-English parallel corpora using a hybrid algorithm. 

[Tyers and Alperen 2010] Francis M. Tyers and Murat Serdar Alperen. 2010. South-Eastern European Times: A parallel corpus of Balkan languages. 

[Utka et al. 2012] Andrius Utka, Kristine Levane-Petrova, Agne Bielinskiene, Jolanta Kovalevskaite, Erika Rimkute, Daira Vevere. 2012. Lithuanian-Latvian-Lithuanian Parallel Corpus.

[Ziemski et al. 2016] MichaĹ‚ Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0.