Parallel Corpora | CLARIN ERIC

Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems.

The parallel corpora are our largest resource family, as the CLARIN infrastructure provides access to 82 parallel corpora, the majority of which are available for download from national repositories as well as through concordancers such as Korp, Corpuscle, and KonText. There are 40 bilingual corpora in the CLARIN infrastructure, mostly containing European language pairs but also non-European languages such as Hindi, Tamil, and Vietnamese. 41 corpora are multilingual, with 5 containing texts in more than 50 languages. Almost half of the corpora are sentence-aligned, which allows for easy comparative research.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Parallel Corpora in the CLARIN Infrastructure

Bilingual Corpora

Corpus	Language	Description	Availability
Croatian-English parallel corpus hrenWaC 2.0 Size: 1.6 million sentences, 55 million words Annotation: sentence-aligned Licence: CLARIN.SI User License for Internet Corpora	Croatian-English	This corpus contains texts crawled from top-level Croatian .hr domains. The corpus was built with Spidextor, a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%. The corpus is available for download from the CLARIN.SI repository.	Download
Czech and English abstracts of ÚFAL papers Size: 1556 entries, 12,000 sentences, 200,000 words Annotation: tokenised, document-aligned Licence: CC-BY	Czech-English	This corpus contains abstracts of published by authors from the Institute of Formal and Applied Linguistics, Charles University, as reported in the institute's system Biblio. No filtering was performed, except for removing entries missing the Czech or English abstract, and replacing newline and tabulator characters by spaces. The corpus is available for download from the LINDAT repository.	Download
Czech-English Manual Word Alignment Size: 113,000 tokens, 2500 sentences Annotation: tokenised, word-aligned (manually) Licence: CC-BY	Czech-English	This corpus contains texts from e-books, Reader’s Digest, the Kačenka magazine, Acquis Communautaire, the Project Syndicate and the PCEDT project. The corpus is available for download from the LINDAT repository.	Download
CzEng 2.0 Size: 702 million tokens Annotation: tokenised, sentence-aligned Licence: CC-BY	Czech-English	This corpus is bidirectional, with original texts in English and Czech and accompanying translations. CzEng 2.0 is composed from authentic and synthetic parallel data. The authentic part contains filtered CzEng 1.6 and six additional resources: Europarl, Paracrawl, Common Crawl, News Commentary, Tilde MODEL, Wiki Titles, WikiMatrix, which was downloaded from WMT 2020. The corpus is available for download from a dedicated website. For the relevant publication, see Bojar et al. (2016).	Download
Czech-Slovak Parallel Corpus Size: 5.7 million sentences Annotation: automatic morphological annotation Licence: CC-BY	Czech-Slovak	This corpus contains legal texts (Acquis), parliamentary debates (from the Europarl corpus), articles from the Official Journal of the European Union, and texts from the OPUS corpus. The corpus is available for download from the LINDAT repository.	Download
Tourism English-Croatian Parallel Corpus 2.0 Size: 140,000 tokens Annotation: tokenised, sentence-aligned Licence: CLARIN.SI User Licence for Internet Corpora	English-Croatian	This corpus contains automatically crawled texts from 25 tourist websites. The corpus is available for download from the CLARIN.SI repository.	Download
English-Czech Corpus from Wikipedia Size: 7.5 million tokens Annotation: tokenised, sentence-aligned Licence: CC-BY	English-Czech	This corpus contains Wikipedi articles translated from English into Czech. The corpus is available for download from the LINDAT repository.	Download
Kacenka Size: 3.3 million tokens Annotation: tokenised	English-Czech	This corpus contains both fictional and non-fictional texts. Most of the English texts for KACENKA have been retrieved from the Internet resources. The rest (andd nearly all the Czech texts) had to be scanned from fiction books (e.g., Czech translations of The Jungle Book by Rudyard Kipling, Lucky Jim by Kingsley Amis, and Sons and Lovers by D.H. Lawrence, among others) with the use of the OCR programme ProLector 1.2.
Text Corpus - EMEL Size: 43,000 tokens Annotation: tokenised Licence: CC-BY	English-French	This corpus contains NLP conference papers. The corpus is available for download from the CLARIN:el repository.	Download
aformes Size: 376,250 tokens Annotation: tokenised Licence: CC-BY	English-Greek	This corpus contains articles from a journal of undergraduate creative writing at an English department in Greece. The corpus is available for download from the CLARIN:el repository.	Download
Interlingual Perspectives Size: 18 articles Licence: CC-BY	English-Greek	This corpus contains research articles published from 2010 onwards focusing on the interaction of Greek with other languages through translation. . The corpus is available for download from the CLARIN:EL repository.	Download
QTLP English-Greek Corpus for the AUTOMOTIVE domain Size: 2,946 sentence pairs Annotation: sentence aligned Licence: MS-NC-NoReD	English-Greek	This corpus contains automatically detected pairs of parallel documents that were acquired from the web (i.e. from multilingual sites which contain content in the targeted languages and domain). The majority of the crawled sites were: i) websites of automobile manufacturers and ii) websites of companies that produce car accessories or car parts.	For access, contact the resource managers.
QTLP English-Greek Corpus for the MEDICAL domain Size: 62,452 sentence pairs Annotation: sentence aligned Licence: MS-NC-NoReD	English-Greek	This corpus contains automatically detected pairs of parallel documents that were acquired from the web (i.e. from multilingual sites which contain content in the targeted languages and domain). The majority of the crawled sites were: i) websites that contain abstracts of scientific papers and ii) websites of organizations from the public or private sector that are related to medical/health services (e.g. medical centers, institutes, hospitals, etc.).	For access, contact the resource managers.
HindEnCorp 0.5 Size: 132,300 sentences Annotation: sentence-aligned Licence: CC-BY	English-Hindi	This corpus contains TED talks, news articles, Wikipedia articles, etc. The corpus is available for download from LINDAT and can be queried through KonText.	Concordancer Download
English-Luganda Parallel Corpus Size: 150 sentences Annotation: word-aligned	English-Luganda	This corpus contains Biblical scripture (150 manually annotated sentences from the Gospel of Luke (1:1 to 3:18). The English text is King James Bible whereas the Lugandan text is taken from the online Luganda bible. The corpus is available for download from a dedicated webpage.	Download
UP/TAP annotated by the OpenNLP Part-of-Speech Tagger (Portuguese) and OpenNLP Part-of-Speech Tagger (English) Size: 31,849 sentences Annotation: PoS-tagged, sentence aligned Licence: CC-BY	English-Portuguese	This parallel corpus contains texts extracted from the TAP UP magazine. The corpus is available for download from the CLARIN:EL repository.	Download
The English-Slovak Parallel corpus Annotation: automatic morphological annotation Licence: CC-BY NC-SA 3.0	English-Slovak	This corpus contains legal texts (Acquis), parliamentary debates (from the Europarl corpus), articles from the Official Journal of the European Union, and texts from the OPUS corpus. The corpus is available for download from the LINDAT repository.	Download
The Corpus of Free Trade Agreement Size: 3 million tokens Annotation: tokenised Licence: CLARIN ACA	English-Spanish	This corpus contains texts on the Free Trade Agreement. The corpus is available through the concordancer Corpuscle.	Concordancer
EnTam: An English-Tamil Parallel Corpus (EnTam v2.0) Size: 169,871 sentences Annotation: sentence-aligned Licence: CC-BY	English-Tamil	This corpus contains news articles and texts related to film. The corpus is available for download from LINDAT.	Download
English-Urdu Religious Parallel Corpus Size: 14,371 sentences Annotation: tokenised, sentence-aligned Licence: CC-BY	English-Urdu	This corpus contains religious texts (the Bible and the Quran). The corpus is available for download from LINDAT.	Download
Estonian-English parallel corpus Size: 307,000 sentences Annotation: sentence-aligned Licence: CLARIN ACA	Estonian-English	This corpus contains Estonian laws and their translations into English and EU legislation translated into Estonian. The corpus is available for download from a dedicated webpage.	Download
Finnish-English parallel corpus fienWaC 1.0 Size: 2.9 million tokens Annotation: tokenised, sentenced-aligned Licence: CLARIN.SI User License for Internet Corpora	Finnish-English	This corpus contains texts crawled from top-level Finnish .fi domains. The corpus is available for download from the CLARIN.SI repository.	Download
ParFin Size: 360,000 tokens Annotation: tokenised, sentence-aligned Licence: CLARIN RES	Finnish-Russian	This corpus contains literary texts from 1990 to 2010. The corpus is available through the concordancer Korp.	Concordancer
The KOTUS Finnish-Swedish Parallel Corpus Size: 4.3 million tokens Annotation: tokenised, sentence-aligned Licence: CC-BY	Finnish-Swedish	This corpus contains corporate press releases, surveys, reports, laws and regulations, as well as governmental proposals from 1993 to 2004. The corpus is available for download from FIN-CLARIN and through the concordancer Korp.	Concordancer Download
FREL Size: 701,401 tokens Annotation: tokenised Licence: under negotiation	French-Greek	This corpus contains literary texts translated from French to Greek.
Parallel corpus newsletters IFT FR-GR Licence: CC-BY	French-Greek	This corpus contains IFT newsletters. The corpus is available for download from the CLARIN:el repository.	Download
QTLP German-Greek Corpus for the MEDICAL domain Size: 2,752 pairs of sentences Annotation: sentence aligned Licence: MS-NC-NoReD	German-Greek	This corpus contains medical texts. Almost all of the acquired documents were acquired from the official site of the European Union.	For access, contact the resource managers.
Greek-Bulgarian Bul-TM parallel corpus Size: 10 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	Greek-Bulgarian	This corpus contains societal and political texts. The corpus is available for download through the CLARIN:el repository.	Download
European Parliament Proceedings Parallel Corpus 1996-2011, parallel corpus Greek-English Size: 1.2 million sentences Annotation: sentence-aligned Licence: CC-ZERO	Greek-English	This corpus contains debates of the European Parliament from 1996 to 2011. The corpus is available for download from the CLARIN:el repository.	Download
INTERA Corpus - the Greek-English part Size: 4 million tokens Annotation: sentence aligned Licence: CC-BY	Greek-English	This corpus contains texts from the law, education, environment, tourism and health domains. The corpus is available for download from the CLARIN:el repository.	Download
ParIce Size: 3,589,000 sentence pairs Annotation: tokenised, PoS-tagged, sentence-aligned, word-aligned Licence: CC-BY 4.0	Icelandic-English	This corpus contains Icelandic and English texts from 11 different sources. The corpus is available for download from CLARIN-IS and for search through the concordancer Korp. For the relevant publication, see Barkarson and Steingrímsson (2019)	Concordancer Download
Parallel corpus MaCoCu Annotation: annotated with extensive metadata Licence: CC0-No Rights Reserved	Multilingual	These corpora are a collection containing web texts and were built by crawling national internet top-level domains (specified below) and by extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler. Websites containing documents in both target languages were identified and processed using the tool Bitextor. Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer and BicleanerAI were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. When relevant, in each format, the texts are separated based on the script into two files: a Latin and a Cyrillic subcorpus. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model, which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility; - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier) was identified on document and domain level. Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool); - information whether the sentence is a heading or not in the original document. The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American). As opposed to the previous version in the case of corpora in version 2.0, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The new version also provides additional metadata, such as the position of the sentence in the paragraph and document, and information whether the sentence is related to a DSI. Moreover, the corpus is now also provided in a document-level format. The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American). The ALBANIAN-ENGLISH parallel corpus MaCoCu-sq-en 1.0 was built by crawling the “.al” internet top-level domain in 2022. The BOSNIAN-ENGLISH parallel corpus MaCoCu-bs-en 1.0 was built by crawling the “.ba” internet top-level domain in 2021 and 2022. The BULGARIAN-ENGLISH parallel corpus MaCoCu-bg-en 2.0 was built by crawling the “.bg” and “.бг” internet top-level domains in 2021. The CATALAN-ENGLISH parallel corpus MaCoCu-ca-en 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu” internet top-level domain in 2022. The CROATIAN-ENGLISH parallel corpus MaCoCu-hr-en 2.0 was built by crawling the “.hr” internet top-level domain in 2021 and 2022. The GREEK-ENGLISH parallel corpus MaCoCu-el-en 1.0 was built by crawling the “.gr", ".ελ", ".cy" and ".eu" internet top-level domain in 2023. The ICELANDIC-ENGLISH parallel corpus MaCoCu-is-en 2.0 was built by crawling the “.is” internet top-level domain in 2021. The MACEDONIAN-ENGLISH parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk” and “.мкд” internet top-level domains in 2021. The MALTESE-ENGLISH parallel corpus MaCoCu-mt-en 2.0 was built by crawling the ".mt" internet top-level domain in 2021. The MONTENEGRIN-ENGLISH parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022. The SERBIAN-ENGLISH parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022. The SLOVENE-ENGLISH parallel corpus MaCoCu-sl-en 2.0 was built by crawling the “.si” internet top-level domain in 2021 and 2022. The TURKISH-ENGLISH parallel corpus MaCoCu-tr-en 2.0 was built by crawling the “.tr” and “.cy” internet top-level domains in 2021. The UKRAINIAN-ENGLISH parallel corpus MaCoCu-uk-en 1.0 was built by crawling the ".ua" and ".укр" internet top-level domain in 2022. The corpora are available for download from the Slovenian repository CLARIN.SI. For the relevant publication, see Bañón et al. (2022)	Download (Albanian-English) Download (Bosnian-English) Download (Bulgarian-English) Download (Catalan-English) Download (Croatian-English) Download (Modern Greek-English) Download (Icelandic-English) Download (Macedonian-English) Download (Maltese-English) Download (Montenegrin-English) Download (Serbian-English) Download (Slovenian-English) Download (Turkish-English) Download (Ukrainian-English)
The Norwegian-Spanish Parallel Corpus Size: 6 million tokens Annotation: tokenised Licence: CLARIN ACA	Norwegian-Spanish	This corpus contains fictional and non-fictional texts from 2000 to 2009. The corpus is available through the concordancer Corpuscle and for download in the CLARINO repository.	Concordancer Download
The Polish-Lithuanian Parallel Corpus Licence: IS PAS	Polish-Lithuanian	The corpus is available for download from the CLARIN-PL repository.	Download
COMPARA : Portuguese - English parallel translation corpus Annotation: sentence-aligned Licence: CC-BY	Portuguese-English	This corpus contains fictional texts and academic, newspaper and tourist articles. The corpus is available through a dedicated concordancer. For the relevant publication, see Frankenberg Garcia and Santos (2003).	Concordancer
QTLP Portuguese-Greek Corpus for the AUTOMOTIVE domain Size: 59,297 sentence pairs Annotation: sentence aligned Licence: MS-NC-NoReD	Portuguese-Greek	This corpus contains automatically detected pairs of parallel documents that were acquired from the web (i.e. from multilingual sites which contain content in the targeted languages and domain). The majority of the crawled sites were: i) websites of automobile manufacturers and ii) websites of companies that produce car accessories or car parts.	For access, contact the resource managers.
QTLP Portuguese-Greek Corpus for the MEDICAL domain Size: 62,608 sentence pairs Annotation: sentence aligned Licence: MS-NC-NoReD	Portuguese-Greek	This corpus contains medical texts. Almost all of the acquired documents were acquired from the official site of the European Union.	For access, contact the resource managers.
ParRus Size: 5.9 million tokens Annotation: tokenised, paragraph-aligned Licence: CLARIN RES	Russian-Finnish	This corpus contains texts from classical and 20th century literature. The corpus is available through the concordancer Korp.	Concordancer
Serbian-English parallel corpus srenWaC 1.0 Size: 23.1 million tokens Annotation: tokenised Licence: CLARIN.SI User License for Internet Corpora	Serbian-English	This corpus contains texts crawled from top-level Serbian .rs domains. The corpus was built with Spidextor, a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the sentence level and 76% on the word level. The corpus is available for download from the CLARIN.SI repository.	Download
Slovene-English parallel corpus slenWaC 1.0 Size: 718,315 tokens Annotation: tokenised, sentenced-aligned Licence: CLARIN.SI User License for Internet Corpora	Slovenian-English	This corpus contains texts crawled from top-level Slovenian .si domains. The corpus was built with Spidextor, a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 67% and on the word level around 68%. The corpus is available for download from the CLARIN.SI repository.	Download

Multilingual Corpora

Corpus	Language	Description	Availability
Tatoeba Size: 12 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	117 languages	This corpus contains texts from the Tatoeba website. The corpus is available for download from the CLARIN:el repository.	Download
Parallel Bible Corpus	Approx. 100 languages	This corpus contains historical and contemporary translations of the Bible.
A parallel corpus of KDE4 localization files (v.2) Size: 60 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	92 languages	This corpus contains KDE4 localization files. The corpus is available for download from the CLARIN:el repository.	Download
OpenSubtitles2011 Size: 8.31G tokens Annotation: tokenised, sentence and word aligned Licence: Open For Reuse With Restrictions	54 languages	This corpus contains subtitles from the OpenSubtitles website. The corpus is available for download from the CLARIN:el repository.	Download
EAC Translation Memory Size: 320,000 tokens Annotation: tokenised, sentence aligned Licence: Open For Reuse With Restrictions	50 languages	This corpus contains law documents and texts related to education and culture. The corpus is available for download through the CLARIN:el repository.	Download
Parallel Global Voices Size: 174,629 documents Annotation: sentence aligned Licence: CC-BY	Approx. 50 languages	This corpus contains texts crawled from the Global Voices webpage. The corpus is available for download from a dedicated webpage.	Download
InterCorp Size: 1.5 billion tokens Annotation: sentence aligned Licence: proprietary	40 languages	The corpus consists of two main parts: manually aligned fiction and a number of collections: political commentaries published by Project Syndicate and VoxEurop, EU legal texts form the Acquis Communautaire corpus, proceedings of the European Parliament from the Europarl corpus, film subtitles from the Open Subtitles database, and the Bible. The corpus is available primarily through the KonText concordancer. For research purposes, tailor-made linguistic data derived from the InterCorp corpus can be provided upon request. The contact e-mail is cnk [at] korpus.cz (cnk[at]korpus[dot]cz). For the relevant publication, see Čermák and Rosen (2012)	Concordancer
DGT-TM-2016 Size: 373 million tokens Annotation: tokenised, sentence aligned Licence: Open For Reuse With Restrictions	Approx. 30 languages	This corpus contains texts from the European Legislation. The corpus is available for download from the CLARIN:el repository.	Download
PELCRA multilingual parallel corpora Size: 143 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	25 languages	This corpus contains texts from the CORDIC and RAPID websites, and the press releases of the European Parliament and the European Southern Observatory. The corpus is available for download from the CLARIN:EL repository.	Download
DGT-Acquis Annotation: sentence aligned Licence: Open For Reuse With Restrictions	23 languages	This corpus contains articles from the Official Journal of the European Union from 2004 to 2011. The corpus is available for download from the CLARIN:el repository-	Download
JRC-Acquis Multilingual Parallel Corpus Size: 1 billion tokens Annotation: tokenised, sentence aligned Licence: Usage Conditions	22 languages	This corpus contains legislative and legal texts from the Acquis Communautaire from various periods beginning in the 1950s. The corpus is available for download from the webpage of the European Commission. For the relevant publication, see Steinberger et al. (2014).	Download
A parallel corpus collected from the European Constitution Size: 3 million tokens Annotation: tokenised, sentence aligned Licence: Open For Reuse With Restrictions	21 languages	This corpus contains European Constitution documents. The corpus is available for download through the CLARIN:el repository.	Download
Europarl Parallel Corpus Size: 650,000 tokens Annotation: tokenised, sentence aligned Licence: CC-ZERO	21 languages	This corpus contains debates of the European Parliament from 1996 to 2011. The corpus is available for download from the corpus webpage.	Download
ECDC Translation Memory Size: 320,000 tokens Annotation: tokenised, sentence aligned Licence: Open For Reuse With Restrictions	Approx. 20 languages	This corpus contains texts from the public health domain. The corpus is available for download from the CLARIN:el repository.	Download
EMEA Corpus Size: 31 million tokens Annotation: sentence aligned Licence: Open For Reuse With Restrictions	Approx. 20 languages	This corpus contains documents of the European Medicines Agency. The corpus is available for download from the CLARIN:el repository.	Download
DGT-Translation Memory Size: 10.1 million tokens Annotation: tokenised Licence: Open For Reuse With Restrictions	Approx. 20 languages	This corpus contains legislative texts of the European Legislation. The corpus is available for download from the CLARIN:el repository.	Download
European Central Bank parallel corpus Size: 757 million tokens Annotation: tokenised, sentence aligned Licence: Open For Reuse With Restrictions	19 languages	This corpus contains texts from the European Central Bank. The corpus is available for download from the CLARIN:el repository.	Download
Opus, Helsinki Korp Version Size: 2.7 billion tokens Annotation: tokenised, sentence aligned Licence: CC-BY	16 languages	This is a multilingual variant of the OPUS corpus that contains texts in the following languages: Czech, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Russian, Swedish, Spanish, and Turkish. The corpus is available through the concordancer Korp.	Concordancer
MULTEXT-East "1984" annotated corpus 4.0 Size: 1.06 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	11 languages	This corpus contains George Orwell’s 1984 original novel in English and its translations into the following languages: Bulgarian, Czech, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, and Slovenian. The corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec (2012).	Download
ParaCrawl Corpus version 1.0 Licence: CC Zero	11 languages	This corpus contains webcrawled data in the following languages: Czech, Dutch, English, Estonian, Finnish, French, German, Italian, Latvian, Polish, Portuguese, Romanian, Russian, and Spanish. The corpus is available for download from LINDAT. Additionally, the 2.0 version of the corpus, which includes six new languages (Irish, Croatian, Maltese, Lithuanian, Hungarian, and Estonian), can be downloaded from the corpus's dedicated website.	Download
MLCC Multilingual and Parallel Corpora Size: 10.2 million tokens Annotation: tokenised Licence: ELRA END USER	9 language	This corpus contains articles from the Official Journal of the European Communities from 1986 to 1994 in the following languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese, and Spanish. The corpus is available for download from the ELRA catalogue.	Download
SETimes Size: 43 million tokens Annotation: partially sentence aligned Licence: CC-BY	9 languages	This corpus contains texts from the setimes.com website. The corpus is available for download from the CLARIN:EL repository. For the relevant publication, see Tyers and Alperen (2010)	Download
ACCURAT balanced test corpus for under resourced languages Size: 4,608 sentences Annotation: sentence aligned Licence: CC-BY	7 languages	This corpus contains texts in Greek, Slovenian, Romanian, Latvian, Estonian, Croatian, and Lithuanian. The corpus is available for download from the CLARIN:el repository.	Download
UFAL Parallel Corpus of North Levantine 1.0 Size: 844,200 sentences; 6.2 million words Annotation: sentence aligned Licence: CC BY-NC-SA 4.0	6 languages	This corpus contains multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic. The sentences have been selected from the OpenSubtitles2018 corpus and are manually translated into the North Levantine Arabic language. The corpus is available for download from LINDAT.	Download
Europarl QTLeap WSD/NED corpus Size: 52 million tokens Annotation: tokenised, WSD, NER, CR-tagged Licence: CC-BY	6 languages	This corpus contains debates of the European Parliament in the following language pairs: Bulgarian-English, Czech-English, Portuguese-English, Spanish-English, and Basque-English. The corpus is available for download from LINDAT.	Download
MultiJur: Multilingual Parallel Corpus of Legal Texts Size: 1.2 million tokens Annotation: paragraph aligned Licence: CLARIN PUB	5 languages	This corpus contains international conventions and treaties in the following languages: English, Russian, German, Finnish, and Swedish. The corpus is available through the concordancer Korp.	Concordancer
GLOSSOLOGIA Licence: CC-BY	4 languages	This corpus contains articles from Glossologia, a journal of general and historical Greek linguistics, in French, Greek, English, and German. The corpus is available for download from the CLARIN:el repository.	Download
MULCOLD - Multilingual Corpus of Legal Documents Size: 1.2 million tokens Annotation: tokenised, paragraph aligned, PoS-tagged, lemmatized Licence: CC-BY	4 languages	This corpus contains international conventions and treaties in Russian, English, Swedish, and Finnish. The corpus is available through the concordancer Korp.	Concordancer
SPC - Stockholm Parallel Corpora Size: 1.32 million tokens Annotation: tokenised, sentence aligned Licence: Open For Reuse With Restrictions	4 languages	This corpus contains legal texts in English, Afrikaans, Chinese, and Greek. The corpus is available for download from the CLARIN:el repository.	Download
Civitas Gentium Size: 31 articles Licence: CC-BY	3 languages	This corpus contains scientific papers and book reviews in English, Greek, and French. The corpus is available for download from the CLARIN:el repository.	Download
CRATER 2 Corpus Size: 4 million tokens Annotation: tokenised, morphosyntactically tagged Licence: ELRA END USER/ELRA VAR	3 languages	This corpus contains texts from the telecommunications domain. The corpus is available for download from the ELRA catalogue.	Download
CsEnVi Pairwise Parallel Corpora Size: 31 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	3 languages	This corpus contains TED talks and subtitles from the CLUVI corpus in Vietnamese, Czech, and English. The corpus is available for download from LINDAT.	Download
The DPC – Dutch Parallel Corpus Size: 10.8 million tokens Annotation: tokenised, sentence aligned Licence: CLARIN ACA	3 languages	This corpus contains fictional, journalistic, instructive and administrative texts in English, Dutch, and French. The corpus is available for download (after registration) from the Dutch Language Institute. For the relevant publication, see Macken et al. (2007).	Download
EuroParl-UdS Annotation: sentence aligned Licence: CC-BY-NC-SA 4.0	3 languages	The corpus contains parliamentary debates of the European Parliament. A subset is a parallel corpus for the following language combinations: English-German and English-Spanish. The corpus is available for download from a CLARIN-D repository.	Download
European Parliament Interpretation Corpus (EPIC) Size: 177,000 tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: ELRA END USER	3 languages	This corpus contains debates of the European Parliament in Italian, English, and Spanish, with translations in all possible combinations. The corpus is available for download from the ELRA catalogue.	Download
EPIC-UdS Size: 350,000 tokens, 20,000 sentences Annotation: tokenised, PoS-tagged, syntactically parsed, speech phenomena Licence: CC BY-NC-SA 4.0	3 languages	This is a parallel and comparable corpus of speeches held in the European Parliament; the corpus follows the European Parliament Interpreting Corpora tradition of the EPIC and EPICG corpora. It contains original speeches from 2008 to 2013 by English, German, and Spanish native speakers and their interpretation (English to and from German; Spanish to English). All transcripts in the corpus are based on videos of the European Parliament Proceedings published by the European Parliament. Annotation includes typical characteristics of spoken language such as false starts, hesitations and truncated words. To obtain better results for source-target alignment as well as sentence parsing the transcripts were segmented using a main clause approach: compound sentences were segmented separately. For the second version of the corpus, the transcripts were processed clause by clause with the spaCy tools; the data is encoded in CoNLL-U and provides universal PoS tags, fine-grained language-specific PoS tags as well as Universal Dependency syntactic relations. All data was enriched with relevant metadata such as source language, name of original speaker, speech timing, mode of delivery and delivery rate. The corpus is available for download from CLARIN-D (Saarland University B-centre). For the relevant publication, see Przybyl et al. (2022)	Download
MUSA Multilingual Multimodal Corpus Size: 1.2 million words Annotation: subtitle alignment Licence: Academic	3 languages	This parallel multimodal corpus contains English, Greek, and French. The corpus is distributed by CLARIN:EL.
PANACEA English-French and English-Greek parallel corpus Licence: ELRA END USER	3 languages	This corpus contains environmental and legislative texts in English and their French and Greek translations. The corpus is available for download from the ELRA catalogue.	Download
Polish-Bulgarian-Russian Parallel Corpus Size: 55 texts Licence: IS PAS corpora license	3 languages	This corpus is available for download from the CLARIN PL repository.	Download
UMC 0.1: Czech-Russian-English Multilingual Corpus Size: 1.8 million tokens Annotation: tokenised, sentence aligned Licence: CC-BY	3 languages	This corpus contains news articles and commentaries in Czech, Russian, and English from the Project Syndicate website from 1995 to 2008. The corpus is available for download from LINDAT and through the concordancer Korp.	Concordancer Download
Parallel sense-annotated corpus ELEXIS-WSD 1.1 Size: 345,092 tokens Annotation: tokenised, PoS-tagged, lemmatised, annotated for senses Licence: CC BY-SA 4.0	10 langages	This corpus is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. This corpus is available for download from the CLARIN.SI repository. For the relevant publication, see Martelli et al. (2021)	Download

Other Parallel Corpora

Bilingual Corpora

Corpus	Language	Description	Availability
ParCor - A Parallel Pronoun-Coreference Corpus Annotation: pronoun coreference	English-German	This corpus contains TED talks and EU Bookshop publications. The corpus is available for download from the OPUS webpage. For the relevant publication, see Guillou et al. (2014).	Download
Parallel English-Irish corpus of legal texts Annotation: sentence aligned	English-Irish	This corpus contains legal texts. The corpus is available through a dedicated concordancer.	Concordancer
The NAACL 2003 English-Romanian corpus Size: 1.6 million tokens Licence: MS-BY-NC-ND	English-Romanian	The corpus contains texts from 2003.	For access, contact the resource managers.
The English-Swedish Parallel Corpus Size: 3.5 million tokens Annotation: tokenised, paragraph aligned	English-Swedish	This corpus contains fictional and non-fictional texts. It is bidirectional. The corpus is not available.
Estonian Open Parallel Corpus 2012. Estonian-English Size: 2.5 million tokens Annotation: tokenised Licence: CC-BY	Estonian-English	This corpus contains Biblical and legal texts. The corpus is available for download from META-SHARE.	Download
SzegedParalell: angol-magyar párhuzamos korpusz	English-Hungarian	This corpus contains literary texts and texts on the European Union. The corpus is available for download from a dedicated webpage. For the relevant publication, see Tóth et al. (2002)	Download
PaGeS Size: Main part: 38 million tokens; 1.1 million bisegments (alignments). Supplements: 80 million tokens Annotation: sentence aligned, PoS-tagged, lemmatised Licence: Terms of Use	German-Spanish	This corpus is comprised of two major parts: the core corpus and the supplements. The core corpus is comprised of original texts in German and Spanish and their respective translations, as well as a small percentage (approx. 6%) of German and Spanish texts translated from a third language. The core corpus includes samples from 178 works of fiction (novels and short stories) as well as samples from non-fiction (essays and popular texts). The text have been manually verified at different levels and the automatic alignment of the bisegments, performed by LF-Aligner, has been manually reviewed. The German texts have been lemmatized and PoS-tagged with Treetagger (part of the PoS taggers and lemmatizers Resource Family) and the Spanish texts with Freeling . The tags of both have been mapped to the Universal PoS tags. The supplements include so far: Europarl v7, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011 (also part of the Parliamentary Corpora Resource Family); and Ted-Talks (part of this family), a corpus that collects the German and Spanish translations of the transcriptions of Ted-Talks from 2006 to 2020. The corpus is available for online browsing via a dedicated interface. For the relevant publication, see Doval et al. (2018)	Browse
The TRIS corpus Size: 1.76 million tokens Annotation: tokenised, sentence-aligned	German-Spanish	This corpus contains texts from the European Commission from 1997 to 2010. The corpus is available for download from a dedicated webpage. For the relevant publication, see Parra Escartín (2012).	Download
LILA parallel corpus Size: 8 million tokens Annotation: tokenised, sentence-aligned	Lithuanian-Latvian	This corpus contains fictional and non-fictional texts from 1991 to 2012. It is bidirectional. The corpus is available through a dedicated concordancer. For the relevant publication, see Utka et al. (2012).	Concordancer
Manually aligned CES Polish-English parallel corpus Size: 1.4 million tokens Annotation: tokenised, sentence-aligned Licence: CC-BY	Polish-English	This corpus contains CES reports. The corpus is available for download from a dedicated webpage.	Download
Slovak-English Parallel Corpus Size: 556 million tokens Annotation: tokenised, sentence-aligned Licence: proprietary	Slovak-English	This corpus contains texts from language books. It is bidirectional. The corpus is available through a dedicated concordancer.	Concordancer

Multilingual Corpora

Corpus	Language	Description	Availability
OPUS corpus Size: A great many subcorpora Annotation: sentence-aligned Licence: CC-BY	Approx. 100 languages	This corpus contains various subcorpora that compile texts from a great number of domains, such as literary texts, political documents, subtitles, UN documents, and the debates of the European Parliament. The corpus is available for download from a dedicated webpage and through a dedicated concordancer. For the relevant publication, see Tiedemann (2009)	Concordancer Download
Bulgarian-X language Parallel Corpus2 Size: 1.2 billion tokens Annotation: tokenised Licence: CC-BY	50 languages	This corpus is a part of the Bulgarian National Corpus. The corpus is available through a dedicated concordancer.	Concordancer
EUbookshop Size: 3.5 billion tokens Annotation: tokenised, sentence-aligned	48 languages	This corpus contains texts from EU law books and related publications. The corpus is available for download from the OPUS webpage. For the relevant publication, see Skadiņš et al. (2014)	Download
PELCRA multilingual parallel corpora Size: 143 million tokens Annotation: tokenised, sentence-aligned Licence: CC-BY	25 languages	This corpus contains texts from the CORDIC and RAPID websites, and the press releases of the European Parliament and the European Southern Observatory. The corpus is available for download from .	Download
TED-Parallel-Corpus Size: 300,000 sentences	11 languages	This corpus contains TED talks in English and translations into the following languages: Arabic, Simplified Chinese, Traditional Chinese, Dutch, French, German, Hebrew, Italian, Japanese, Korean, and Russian. The corpus is available for download from GIT-HUB.	Download
SETimes Size: 43 million tokens Annotation: partially sentence aligned Licence: CC-BY	10 languages	This corpus contains texts from the setimes.com website. The corpus is available for download from a dedicated webpage. For the relevant publication, see Tyers and Alperen (2010).	Download
The United Nations Parallel Corpus Size: 335 million tokens Annotation: tokenised	6 languages	This corpus contains the official records and other parliamentary documents of the United Nations that are in the public domain in the following languages: English, Russian, Spanish, French, Chinese, and Arabic. The corpus is available for download from a dedicated webpage. For the relevant publication, see Ziemski et al. (2016).	Download
μtopia Size: 1.5 million tokens Annotation: tokenised	6 languages	This corpus contains tweets and blogposts in the following language pairs: English-Mandarin, English-Arabic, English-Russian, English-Korean, and English-Japanese. The corpus is available for download from a dedicated webpage.	Download
QTLeap Corpus V1.2 Size: 140,000 tokens Annotation: sentence-aligned Licence: CC-BY	5 languages	This corpus contains texts related to computer and IT troubleshooting for the following language pairs: Bulgarian-English, Czech-English, Portuguese-English, Spanish-English, and Basque-English The corpus available for download from META-SHARE under the CC-BY license.	Download
Parallel Wiki Licence: CC-BY	4 languages	This corpus contains Wikipedia texts in the following language pairs: English-German, English-Romanian, and English-Spanish.	For access, contact the resource managers.
QTLeap News Corpus Size: 1,104 sentences Annotation: sentence-aligned Licence: CC-BY	4 languages	This corpus contains news articles in the following language pairs: English-Czech, English-German and English-Spanish.	For access, contact the resource managers.
Scielo corpus	4 languages	This corpus contains scientific articles from the Scielo database in the following language pairs: English-French, English-Spanish, and English-Portuguese.	For access, contact the resource managers.
MultiUN: Multilingual UN Parallel Text 2000—2009 Size: 1 billion tokens Annotation: tokenised, sentence-aligned	3 languages	This corpus contains texts from the United Nations website from 2000 to 2009 in the following language pairs: Spanish-Chinese, Chinese-Spanish, French-Chinese, and Chinese-French. The corpus is available for download from a dedicated webpage. For the relevant publication, see Eisele and Chen (2010).	Download
REVEAL-THIS Corpus Size: 325,000 words Licence: under negotiation	3 languages	This is a multilingual corpus of English, French and Greek.	For access, contact the resource managers.

Publications on the Parallel Corpora

[Barkarson and Steingrímsson 2019] Starkaður Barkarson and Steinþór Steingrímsson. 2019. Compiling and Filtering ParIce: An English-Icelandic Parallel Corpus.

[Bojar et al. 2016] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš. 2016. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered.

[Čermák and Rosen 2012] František Čermák and Alexandr Rosen. 2012. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus. Linguistics, 17(3): 411–427.

[Doval et al. 2018] Irene Doval, Santiago Fernández Lanza, Tomás Jiménez Juliá, Elsa Liste Lamas, and Barbara Lübke. 2018. Corpus PaGeS: A multifunctional resource for language learning, translation and cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New Resources and applications, 103–121.

[Eisele and Chen 2010] Andreas Eisele, Yu Chen. 2010. MultiUN: A Multilingual Corpus from United Nations Documents.

[Erjavec 2012] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Frankenberg Garcia and Santos 2003] Ana Frankenberg-Garcia and Diana Santos. 2003. Introducing COMPARA, the Portuguese-English parallel corpus.

[Guillou et al. 2014] Liane Guillou, Christian Hardmeier, Aaron Smith, Jorg Tiedemann, Bonnie Webber. 2014. ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical .

[Macken et al. 2007] Lieve Macken, Julia Trushkina, Hans Paulussen, Lidia Rura, Piet Desmet, Willy Vandeweghe. 2007. Dutch Parallel Corpus: A Multilingual Annotated Corpus.

[Parra Escartín 2012] Carla Parra Escartín. 2012. Design and compilation of a specialized Spanish-German parallel corpus.

[Skadiņš et al. 2014] Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis, Daiga Deksne. 2014. Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus.

[Steinberger et al. 2014] Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przbyszewski, Signe Gilbro. 2014. An overview of the European Union's highly multilingual parallel corpora.

[Tiedemann 2009] Jörg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces.

[Tóth et al. 2002] Krisztina Tóth, Richárd Farkas, András Kocsor. 2002. Sentence Alignment of Hungarian-English parallel corpora using a hybrid algorithm.

[Tyers and Alperen 2010] Francis M. Tyers and Murat Serdar Alperen. 2010. South-Eastern European Times: A parallel corpus of Balkan languages.

[Utka et al. 2012] Andrius Utka, Kristine Levane-Petrova, Agne Bielinskiene, Jolanta Kovalevskaite, Erika Rimkute, Daira Vevere. 2012. Lithuanian-Latvian-Lithuanian Parallel Corpus.

[Ziemski et al. 2016] Michał Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0.