Skip to main content

Corpora of Academic Texts

Corpora of academic texts contain scholarly writing, such as research papers, essays and abstracts published in academic journals, conference proceedings, and edited volumes, theses written by students at undergraduate and graduate levels, and scientific monographs.

The CLARIN ERIC infrastructure gives access to 24 corpora of academic texts, 2 of which are multilingual and 22  monolingual. The available corpora contain scholarly texts in the following 11 languages: Czech, English, Estonian, Finnish, French, German, Greek, Russian, Slovenian, Spanish, and Swedish. More than 15 different scholarly disciplines are represented, with the most prominent being linguistics, computer science, economics, and medicine. The majority of the corpora are richly tagged and are available under public licences.

We first provide an overview of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

Corpora of academic texts in the CLARIN infrastructure

Monolingual corpora

Corpus Language Description Availability

Czech Sociological Review

Size: 3 million words 
Licence: MIT

Czech

This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format.

The corpus is available for download from the LINDAT repository.

Download

ACL Anthology Reference Corpus

Size: 75 million tokens 
Annotation: PoS-tagged, lemmatised, author/text metadata 
Licence: CC BY SA

English

This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format.

The corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website.

For the relevant publication, see Bird et al. 2008

Concordancer

Download

English Scientific Text Corpus

Size: 35 million tokens 
Annotation: PoS-tagged, lemmatised, author/text metadata, document structure 
Licence: restricted

English

This corpus contains journal articles in the following disciplines:

 

  • computer science,
  • computational linguistics,
  • informatics,
  • digital construction,
  • microelectronics,
  • linguistics,
  • biology,
  • mechanical engineering, and
  • electrical engineering.

 

The articles were published in the 1970s, 1980s and the 200s.

The corpus is available for online querying through CQPWeb (CLARIN-D distribution).

For the relevant publication, see Degaetano-Ortlieb et al. 2013

Concordancer

GENIA corpus

Size: 437,000 words 
Annotation: PoS-tagged, syntactically parsed, annotated for terms, events, semantic relations and coreference; text metadata 
Licence: free but unspecified

English

This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB.

The corpus is available for download from PORTULAN.

For the relevant publication, see Su et al. 2008

Download

UH's English E-thesis corpus

Size: 200 million tokens 
Annotation: PoS-tagged, syntactically parsed 
Licence: CC BY

English

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

The Royal Society Corpus

Size: 32 million tokens 
Annotation: PoS-tagged, lemmatised, normalised, author and document metadata 
Licence: CC BY

English (late and early modern)

This corpus contains journal articles published in Philosophical Transactions of the Royal Society of London between 1665 and 1869.

The corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland.

For the relevant publication, see Kermes et al. 2016

Concordancer

Download

Corpus of Estonian scientific texts

Size: 5 million words 
Licence: CLARIN ACA-NC

Estonian This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format. Download

UH's Finnish E-thesis corpus

Size: 12.5 million tokens 
Annotation: PoS-tagged, lemmatised 
Licence: CC BY

Finnish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Chambers-Le Baron Corpus of Research Articles

Size: 1 million words 
Annotation: No annotation 
Licence: Oxford Text Archive licence (academic use)

French

This corpus contains research papers in the following disciplines:

 

  • media/culture,
  • literature,
  • linguistics and language learning,
  • social anthropology,
  • law, economics,
  • sociology and social sciences,
  • philosophy,
  • history, and
  • communication.

 

The research papers were published between 1998 and 2006. This is a plain text corpus.

The corpus is available for download from the Oxford Text Archive.

Download

UH's French E-thesis corpus

Size: 580,000 tokens 
Licence: CC BY

French

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

UH's German E-thesis corpus

Size: 560,000 tokens 
Annotation: No annotation 
Licence: CC BY

German

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Modern Greek Dialects: scientific papers

Size: 113,000 words 
Licence: CC-BY-SA

Greek

This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus.

The corpus is available for download from the CLARIN:EL repository.

Download

OROSSIMO Corpus

Size: 2.5 million tokens 
Annotation: marked for term candidates, "mixed structural annotation" 
Licence: CC-BY

Greek

This corpus contains academic texts in the following disciplines:

social sciences,

  • computer science,
  • economics,
  • linguistics,
  • photography,
  • law,
  • engineering,
  • history,
  • astronomy,
  • earth sciences and geology,
  • medicine and health, and
  • biology.

 

The corpus is encoded in XML ( ).

The corpus is available for download from the CLARIN:EL repository.

For the relevant publication, see Mantzari et al. 1999

Download

The Language of Literature and the Language of Translation (collected scientific papers)

Size: 48,300 words 
Licence: CC-BY-SA

Greek

This corpus contains journal articles in literary and translation studies. This is a plain text corpus.

The corpus is available for download from the CLARIN:EL repository.

Download

UH's Russian E-thesis corpus

Size: 1.1 million words 
Annotation: No annotation 
Licence: CC BY

Russian

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Corpus of Academic Slovene KAS 2.0

Size: 1.5 billion tokens 
Annotation: MSD-tagged, lemmatised, marked for bilingual and monolingual term candidates 
Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0

Slovenian

This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the format.

The corpus is available for download from CLARIN.SI. Version 1.0 is also available for online querying through noSketch Engine and KonText (CLARIN.SI distribution).

For the relevant publication, see Erjavec et al. 2020

Download

UH's Spanish E-thesis corpus

Size: 2.3 million tokens 
Annotation: No annotation 
Licence: CC BY

Spanish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Academic texts - humanities

Size: 14.5 million tokens 
Licence: CC BY

Swedish

This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.

The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).

Concordancer

Download

Academic texts - social science

Size: 10.8 million tokens 
Annotation: sentence segmentation 
Licence: CC BY

Swedish

This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.

The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).

Concordancer

Download

UH's Swedish E-thesis corpus

Size: 105 million tokens 
Licence: CC BY

Swedish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Corpus of Slovene linguistic scientific writing JezKor

Size: 9.3 million tokens 
Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated for named entities and author/text metadata 
Licence: CC BY

Slovenian

This corpus contains a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise.

The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.

The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

Concordancer (noSketchEngine)

Concordancer (noSketchEngine)

Download

Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0

Size: 326 million tokens 
Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated for named entities and author/text metadata 
Licence: CC BY-SA

Slovenian

This corpus contains a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal. It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the CLASSLA pipeline on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file.

Note that there exist similar, but older and smaller corpora KAS 2.0 and KAS 1.0. These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for "KAS" to find them.

The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

Concordancer (noSketchEngine)

Multilingual corpora

Corpus Language Description Availability

Czech and English abstracts of ÚFAL papers

Size: 2 million words 
Annotation: document aligned 
Licence: CC BY

Czech,English

This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format.

The corpus is available for download from the LINDAT repository.

Download

The KIAP corpus

Size: 3.9 million tokens 
Annotation: PoS-tagged 
Licence: CC-BY 4.0

English,French,Norwegian

This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003.

The corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution).

Concordancer

Corpora outside the infrastructure

Monolingual corpora

Corpus Language Description Availability

Academic Corpus

Size: 3.5 million words

English

This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology.

This corpus is not available.

 

Reading Academic Text corpus

Licence: restricted

English

This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML.

The corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement.

 

Corpus of academic Lithuanian

Size: 9 million words 
Annotation: no linguistic annotation

Lithuanian

This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and master’s and PhD theses from the following disciplines:

 

  • humanities (architecture, fine art studies, ethnology, folklore studies, philosophy, linguistics, literary theory, librarianship, history, theology),
  • social sciences (law, political science,

     

    economics, psychology, education, management),

  • physical sciences (mathematics, astronomy, physics, chemistry, geography, geology and mineralogy, informatics),
  • biomedical sciences (medicine, dental surgery, biology, botany, agronomy, animal husbandry, pharmacy, veterinary science, forestry studies), and
  • technological sciences (energy studies, chemical technology, materials science, mechanics, metrology, building construction, transport technology, agricultural and

     

    environmental sciences, management and informatics).

The materials were published between 1999 and 2009. The corpus is encoded in TEI 5.

 

The corpus is available for online querying through a dedicated website.

For the relevant publication, see Usonienė and Linkevičienė (2009)

Concordancer

Multilingual corpora

Corpus Language Description Availability

MuchMore Springer Bilingual Corpus

Size: 1 million tokens 
Annotation: PoS/MSD-tagged, phrase chunking, semantic class and relations, document structure 
Licence: free but unspecified

English,German

This paper contains journal paper abstracts from medical disciplines. The corpus is encoded in MuchMore XML.

The corpus is available for download from a dedicated website.

Download

Scientext corpus

Size: 20 million words 
Licence: CC BY

French,English

This corpus contains scientific texts and argumentative essays in humanities, experimental sciences, and applied/technical sciences.

The corpus is available for online querying through a dedicated webpage.

Concordancer

Corpus of Romanian Academic Genres – ROGER (bilingual, student papers)

Size: 3.3 million words 
Licence: CC BY-NC-ND

Romanian, English

The corpus contains academic papers from eight disciplines, written by the Romanian students in native Romanian and English L2.

The corpus was collected over a three-year period (2018–2021) with the help of 27 collaborators from nine Romanian universities.

The corpus is available for online querying through a dedicated platform developed at the CODHUS research centre from the West University of Timisoara.

For the relevant publication, see Striletchi et al. (2022)

Concordancer

Spanish-English Research Article Corpus

Size: 5.7 million words

Spanish,English

This corpus contains journal articles published between 2000 and 2010.

The corpus is unavailable.

 

Related Publications

[Bird et al. 2008]  Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. 2008. The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), edited by Nicoletta Calzolari, 1755–1759.

[Degaetano-Ortilieb et al. 2013] Stefania Degaetano-Ortilieb, Hannah Kermes, Ekaterina Lapshinova-Koltunski, and Elke Teich. 2013. SciTex – A Diachronic Corpus for Analyzing the Development of Scientific Registers. In New Method in Historical Corpus Linguistics, edited by Paul Bennett et al.

[Erjavec et al. forthcoming] Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021. The KAS Corpus of Slovenian Academic Writing.Language Resources and Evaluation.

[Kermes et al. 2016] Hannah Kermes, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke teich. The Royal Society Corpus: From Uncharted Data to Corpus. In Proceedings of LREC 2016, edited by Nicoletta Calzolari. 

[Mantazi et al. 1999] Elena Mantazi, Maria Gavrilidou, Penny Labropoulou, and George Carayannis. 1999. Collection of digital terminological resources: methodology and results. In Proceedings of the 2nd Conference on Greek Language and Terminology.

[Parodi 2010] Giovanni Parodi. 2010. Academic and Professional genre variation across four disciplines: exploring the PUCB-2006 corpus of written Spanish. Linguagem em (Dis) curso, 10 (3): 535–567.

[Striletchi et al. 2022] Cosmin Strilețchi, Mădălina Chitez, and Karla Csürös. 2022. Building Roger: Technical Challenges While Developing a Bilingual Corpus Management and Query Platform. In Proceedings of the 17th International Conference on Software Technologies - ICSOFT.

[Su et al. 2008] Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii. 2008. Coreference resolution in biomedical texts: a machine learning approach. In Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives, edited by Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann.

[Usonienė and Linkevičienė 2009] Aurelija Usonienė and Jolė Linkevičienė. 2009. Lietuvių mokslo kalbos tekstynas ir specialioji leksika. Lituanistica, 55 (3–4):  133–143.