Corpora of academic texts

Introduction

Corpora of academic texts contain scholarly writing, which includes research papers, essays and abstracts published in academic journals, conference proceedings, and edited volumes, theses written by students at the undergraduate and graduate levels, and scientific monographs.

The CLARIN ERIC infrastructure gives access to 22 corpora of academic texts, 2 of which are multilingual and 20 monolingual. The available corpora contain scholarly texts in the following 11 languages: Czech, English, Estonian, Finnish, French, German, Greek, Russian, Slovenian, Spanish, and Swedish. More than 15 different scholarly disciplines are represented, with the most prominent being linguistics, computer science, economics, and medicine. The majority of the corpora richly tagged and are available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an email.

This website was last updated on 27 July 2021.

Corpora of academic texts in the CLARIN infrastructure

Monolingual corpora

Corpus Language Description Availability

Czech Sociological Review

Size: 3 million words
Licence: MIT

Czech

This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format.

The corpus is available for download from the LINDAT repository.

Download

ACL Anthology Reference Corpus

Size: 75 million tokens
Annotation: PoS-tagged, lemmatised, author/text metadata
Licence: CC BY SA

English

This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format.

The corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website.

For the relevant publication, see Bird et al. 2008

Concordancer

Download

English Scientific Text Corpus

Size: 35 million tokens
Annotation: PoS-tagged, lemmatised, author/text metadata, document structure
Licence: restricted

English

This corpus contains journal articles in the following disciplines:

 

  • computer science,
  • computational linguistics,
  • informatics,
  • digital construction,
  • microelectronics,
  • linguistics,
  • biology,
  • mechanical engineering, and
  • electrical engineering.

 

The articles were published in the 1970s, 1980s and the 200s.

The corpus is available for online querying through CQPWeb (CLARIN-D distribution).

For the relevant publication, see Degaetano-Ortlieb et al. 2013

Concordancer

GENIA corpus

Size: 437,000 words
Annotation: PoS-tagged, syntactically parsed, annotated for terms, events, semantic relations and coreference; text metadata
Licence: free but unspecified

English

This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB.

The corpus is available for download from PORTULAN.

For the relevant publication, see Su et al. 2008

Download

UH's English E-thesis corpus

Size: 200 million tokens
Annotation: PoS-tagged, syntactically parsed
Licence: CC BY

English

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

The Royal Society Corpus

Size: 32 million tokens
Annotation: PoS-tagged, lemmatised, normalised, author and document metadata
Licence: CC BY

English (late and early modern)

This corpus contains journal articles published in Philosophical Transactions of the Royal Society of London between 1665 and 1869.

The corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland.

For the relevant publication, see Kermes et al. 2016

Concordancer

Download

Corpus of Estonian scientific texts

Size: 5 million words
Licence: CLARIN ACA-NC

Estonian

This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format.

Download

UH's Finnish E-thesis corpus

Size: 12.5 million tokens
Annotation: PoS-tagged, lemmatised
Licence: CC BY

Finnish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Chambers-Le Baron Corpus of Research Articles

Size: 1 million words
Annotation: No annotation
Licence: Oxford Text Archive licence (academic use)

French

This corpus contains research papers in the following disciplines:

 

  • media/culture,
  • literature,
  • linguistics and language learning,
  • social anthropology,
  • law, economics,
  • sociology and social sciences,
  • philosophy,
  • history, and
  • communication.

 

The research papers were published between 1998 and 2006. This is a plain text corpus.

The corpus is available for download from the Oxford Text Archive.

Download

UH's French E-thesis corpus

Size: 580,000 tokens
Licence: CC BY

French

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

UH's German E-thesis corpus

Size: 560,000 tokens
Annotation: No annotation
Licence: CC BY

German

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Modern Greek Dialects: scientific papers

Size: 113,000 words
Licence: CC-BY-SA

Greek

This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus.

The corpus is available for download from the CLARIN:EL repository.

Download

OROSSIMO Corpus

Size: 2.5 million tokens
Annotation: marked for term candidates, "mixed structural annotation"
Licence: CC-BY

Greek

This corpus contains academic texts in the following disciplines:

social sciences,

  • computer science,
  • economics,
  • linguistics,
  • photography,
  • law,
  • engineering,
  • history,
  • astronomy,
  • earth sciences and geology,
  • medicine and health, and
  • biology.

 

The corpus is encoded in XML ( ).

The corpus is available for download from the CLARIN:EL repository.

For the relevant publication, see Mantzari et al. 1999

Download

The Language of Literature and the Language of Translation (collected scientific papers)

Size: 48,300 words
Licence: CC-BY-SA

Greek

This corpus contains journal articles in literary and translation studies. This is a plain text corpus.

The corpus is available for download from the CLARIN:EL repository.

Download

UH's Russian E-thesis corpus

Size: 1.1 million words
Annotation: No annotation
Licence: CC BY

Russian

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Corpus of Academic Slovene KAS 1.0

Size: 1.7 billion tokens
Annotation: MSD-tagged, lemmatised, marked for bilingual and monolingual term candidates
Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0

Slovenian

This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the format.

The corpus is available for download from CLARIN.SI and for online querying through noSketch Engine and KonText (CLARIN.SI distribution).

For the relevant publication, see Erjavec et al. 2020

Concordancer

Download

UH's Spanish E-thesis corpus

Size: 2.3 million tokens
Annotation: No annotation
Licence: CC BY

Spanish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Academic texts - humanities

Size: 14.5 million tokens
Licence: CC BY

Swedish

This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.

The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).

Concordancer

Download

Academic texts - social science

Size: 10.8 million tokens
Annotation: sentence segmentation
Licence: CC BY

Swedish

This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.

The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).

Concordancer

Download

UH's Swedish E-thesis corpus

Size: 105 million tokens
Licence: CC BY

Swedish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Multilingual corpora

Corpus Language Description Availability

Czech and English abstracts of ÚFAL papers

Size: 2 million words
Annotation: document aligned
Licence: CC BY

Czech,English

This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format.

The corpus is available for download from the LINDAT repository.

Download

The KIAP corpus

Size: 3.9 million tokens
Annotation: PoS-tagged
Licence: CC-BY 4.0

English,French,Norwegian

This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003.

The corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution).

Concordancer

Corpora outside the infrastructure

Monolingual corpora

Corpus Language Description Availability

Academic Corpus PUCV-2006

Size: 59 million words
Annotation: PoS-tagged

Spanish

This corpus contains academic texts extracted from dictionaries, didactic guidelines, disciplinary texts, lectures, regulations, reports, research articles, tests, and textbooks in the following disciplines: psychology, social work, construction engineering, industrial chemistry.

The corpus is not available.

For the relevant publication, see Parodi 2010

 

Academic Corpus

Size: 3.5 million words

English

This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology.

This corpus is not available.

 

Reading Academic Text corpus

Licence: restricted

English

This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML.

The corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement.

 

Corpus of academic Lithuanian

Size: 9 million words
Annotation: no linguistic annotation

Lithuanian

This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and master’s and PhD theses from the following disciplines:

 

  • humanities (architecture, fine art studies, ethnology, folklore studies, philosophy, linguistics, literary theory, librarianship, history, theology),
  • social sciences (law, political science,

     

    economics, psychology, education, management),

  • physical sciences (mathematics, astronomy, physics, chemistry, geography, geology and mineralogy, informatics),
  • biomedical sciences (medicine, dental surgery, biology, botany, agronomy, animal husbandry, pharmacy, veterinary science, forestry studies), and
  • technological sciences (energy studies, chemical technology, materials science, mechanics, metrology, building construction, transport technology, agricultural and

     

    environmental sciences, management and informatics).

The materials were published between 1999 and 2009. The corpus is encoded in TEI 5.

 

The corpus is available for online querying through a dedicated website.

For the relevant publication, see Usonienė and Linkevičienė (2009)

Concordancer

Multilingual corpora

Corpus Language Description Availability

MuchMore Springer Bilingual Corpus

Size: 1 million tokens
Annotation: PoS/MSD-tagged, phrase chunking, semantic class and relations, document structure
Licence: free but unspecified

English,German

This paper contains journal paper abstracts from medical disciplines. The corpus is encoded in MuchMore XML.

The corpus is available for download from a dedicated website.

Download

Scientext corpus

Size: 20 million words
Licence: CC BY

French,English

This corpus contains scientific texts and argumentative essays in humanities, experimental sciences, and applied/technical sciences.

The corpus is available for online querying through a dedicated webpage.

Concordancer

Spanish-English Research Article Corpus

Size: 5.7 million words

Spanish,English

This corpus contains journal articles published between 2000 and 2010.

The corpus is unavailable.

 

Related publications

[Bird et al. 2008]  Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. 2008. The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), edited by Nicoletta Calzolari, 1755–1759.

[Degaetano-Ortilieb et al. 2013] Stefania Degaetano-Ortilieb, Hannah Kermes, Ekaterina Lapshinova-Koltunski, and Elke Teich. 2013. SciTex – A Diachronic Corpus for Analyzing the Development of Scientific Registers. In New Method in Historical Corpus Linguistics, edited by Paul Bennett et al.

[Erjavec et al. forthcoming] Tomaž Erjavec, Darja FiÅ”er, and Nikola LjubeÅ”ić. Forthcoming. The KAS Corpus of Slovenian Academic Writing. Submitted to Language Resources and Evaluation.

[Kermes et al. 2016] Hannah Kermes, Stefania Degaetano, Ashraf Khamis, Jƶrg Knappen, and Elke teich. The Royal Society Corpus: From Uncharted Data to Corpus. In Proceedings of LREC 2016, edited by Nicoletta Calzolari. 

[Mantazi et al. 1999] Elena Mantazi, Maria Gavrilidou, Penny Labropoulou, and George Carayannis. 1999. Collection of digital terminological resources: methodology and results. In Proceedings of the 2nd Conference on Greek Language and Terminology.

[Parodi 2010] Giovanni Parodi. 2010. Academic and Professional genre variation across four disciplines: exploring the PUCB-2006 corpus of written Spanish. Linguagem em (Dis) curso, 10 (3): 535–567.

[Su et al. 2008] Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii. 2008. Coreference resolution in biomedical texts: a machine learning approach. In Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives, edited by Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann.

[Usonienė and Linkevičienė 2009] Aurelija Usonienė and Jolė Linkevičienė. 2009. Lietuvių mokslo kalbos tekstynas ir specialioji leksika. Lituanistica, 55 (3–4):  133–143.