Corpora of academic texts contain scholarly writing, such as research papers, essays and abstracts published in academic journals, conference proceedings, and edited volumes, theses written by students at undergraduate and graduate levels, and scientific monographs.
The CLARIN ERIC infrastructure gives access to 24 corpora of academic texts, 2 of which are multilingual and 22 monolingual. The available corpora contain scholarly texts in the following 11 languages: Czech, English, Estonian, Finnish, French, German, Greek, Russian, Slovenian, Spanish, and Swedish. More than 15 different scholarly disciplines are represented, with the most prominent being linguistics, computer science, economics, and medicine. The majority of the corpora are richly tagged and are available under public licences.
We first provide an overview of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.
For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).
Corpora of academic texts in the CLARIN infrastructure
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 3 million words |
Czech |
This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format. The corpus is available for download from the LINDAT repository. |
Download |
ACL Anthology Reference Corpus Size: 75 million tokens |
English |
This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format. The corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website. For the relevant publication, see Bird et al. 2008 |
|
English Scientific Text Corpus Size: 35 million tokens |
English |
This corpus contains journal articles in the following disciplines:
The articles were published in the 1970s, 1980s and the 200s. The corpus is available for online querying through CQPWeb (CLARIN-D distribution). For the relevant publication, see Degaetano-Ortlieb et al. 2013 |
Concordancer |
Size: 437,000 words |
English |
This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB. The corpus is available for download from PORTULAN. For the relevant publication, see Su et al. 2008 |
Download |
Size: 200 million tokens |
English |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Size: 32 million tokens |
English (late and early modern) |
This corpus contains journal articles published in Philosophical Transactions of the Royal Society of London between 1665 and 1869. The corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland. For the relevant publication, see Kermes et al. 2016 |
|
Corpus of Estonian scientific texts Size: 5 million words |
Estonian | This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format. | Download |
Size: 12.5 million tokens |
Finnish |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Chambers-Le Baron Corpus of Research Articles Size: 1 million words |
French |
This corpus contains research papers in the following disciplines:
The research papers were published between 1998 and 2006. This is a plain text corpus. The corpus is available for download from the Oxford Text Archive. |
Download |
Size: 580,000 tokens |
French |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Size: 560,000 tokens |
German |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Modern Greek Dialects: scientific papers Size: 113,000 words |
Greek |
This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus. The corpus is available for download from the CLARIN:EL repository. |
Download |
Size: 2.5 million tokens |
Greek |
This corpus contains academic texts in the following disciplines: social sciences,
The corpus is encoded in XML ( ). The corpus is available for download from the CLARIN:EL repository. For the relevant publication, see Mantzari et al. 1999 |
Download |
The Language of Literature and the Language of Translation (collected scientific papers) Size: 48,300 words |
Greek |
This corpus contains journal articles in literary and translation studies. This is a plain text corpus. The corpus is available for download from the CLARIN:EL repository. |
Download |
Size: 1.1 million words |
Russian |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Corpus of Academic Slovene KAS 2.0 Size: 1.5 billion tokens |
Slovenian |
This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the format. The corpus is available for download from CLARIN.SI. Version 1.0 is also available for online querying through noSketch Engine and KonText (CLARIN.SI distribution). For the relevant publication, see Erjavec et al. 2020 |
Download |
Size: 2.3 million tokens |
Spanish |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Size: 14.5 million tokens |
Swedish |
This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text. The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution). |
|
Academic texts - social science Size: 10.8 million tokens |
Swedish |
This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text. The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution). |
|
Size: 105 million tokens |
Swedish |
This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution). |
Concordancer |
Corpus of Slovene linguistic scientific writing JezKor Size: 9.3 million tokens |
Slovenian |
This corpus contains a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise. The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers. |
|
Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0 Size: 326 million tokens |
Slovenian |
This corpus contains a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal. It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the CLASSLA pipeline on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file. Note that there exist similar, but older and smaller corpora KAS 2.0 and KAS 1.0. These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for "KAS" to find them. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers. |
Concordancer (noSketchEngine) |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Czech and English abstracts of ÚFAL papers Size: 2 million words |
Czech,English |
This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format. The corpus is available for download from the LINDAT repository. |
Download |
Size: 3.9 million tokens |
English,French,Norwegian |
This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003. The corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution). |
Concordancer |
Corpora outside the infrastructure
Monolingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 3.5 million words |
English |
This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology. This corpus is not available. |
|
Licence: restricted |
English |
This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML. The corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement. |
|
Size: 9 million words |
Lithuanian |
This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and master’s and PhD theses from the following disciplines:
The materials were published between 1999 and 2009. The corpus is encoded in TEI 5.
The corpus is available for online querying through a dedicated website. For the relevant publication, see Usonienė and Linkevičienė (2009) |
Concordancer |
Multilingual corpora
Corpus | Language | Description | Availability |
---|---|---|---|
MuchMore Springer Bilingual Corpus Size: 1 million tokens |
English,German |
This paper contains journal paper abstracts from medical disciplines. The corpus is encoded in MuchMore XML. The corpus is available for download from a dedicated website. |
Download |
Size: 20 million words |
French,English |
This corpus contains scientific texts and argumentative essays in humanities, experimental sciences, and applied/technical sciences. The corpus is available for online querying through a dedicated webpage. |
Concordancer |
Corpus of Romanian Academic Genres – ROGER (bilingual, student papers) Size: 3.3 million words |
Romanian, English |
The corpus contains academic papers from eight disciplines, written by the Romanian students in native Romanian and English L2. The corpus was collected over a three-year period (2018–2021) with the help of 27 collaborators from nine Romanian universities. The corpus is available for online querying through a dedicated platform developed at the CODHUS research centre from the West University of Timisoara. For the relevant publication, see Striletchi et al. (2022) |
Concordancer |
Spanish-English Research Article Corpus Size: 5.7 million words |
Spanish,English |
This corpus contains journal articles published between 2000 and 2010. The corpus is unavailable. |
Related Publications
[Bird et al. 2008] Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. 2008. The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), edited by Nicoletta Calzolari, 1755–1759.
[Degaetano-Ortilieb et al. 2013] Stefania Degaetano-Ortilieb, Hannah Kermes, Ekaterina Lapshinova-Koltunski, and Elke Teich. 2013. SciTex – A Diachronic Corpus for Analyzing the Development of Scientific Registers. In New Method in Historical Corpus Linguistics, edited by Paul Bennett et al.
[Erjavec et al. forthcoming] Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021. The KAS Corpus of Slovenian Academic Writing.Language Resources and Evaluation.
[Kermes et al. 2016] Hannah Kermes, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke teich. The Royal Society Corpus: From Uncharted Data to Corpus. In Proceedings of LREC 2016, edited by Nicoletta Calzolari.
[Mantazi et al. 1999] Elena Mantazi, Maria Gavrilidou, Penny Labropoulou, and George Carayannis. 1999. Collection of digital terminological resources: methodology and results. In Proceedings of the 2nd Conference on Greek Language and Terminology.
[Parodi 2010] Giovanni Parodi. 2010. Academic and Professional genre variation across four disciplines: exploring the PUCB-2006 corpus of written Spanish. Linguagem em (Dis) curso, 10 (3): 535–567.
[Striletchi et al. 2022] Cosmin Strilețchi, Mădălina Chitez, and Karla Csürös. 2022. Building Roger: Technical Challenges While Developing a Bilingual Corpus Management and Query Platform. In Proceedings of the 17th International Conference on Software Technologies - ICSOFT.
[Su et al. 2008] Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii. 2008. Coreference resolution in biomedical texts: a machine learning approach. In Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives, edited by Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann.
[Usonienė and Linkevičienė 2009] Aurelija Usonienė and Jolė Linkevičienė. 2009. Lietuvių mokslo kalbos tekstynas ir specialioji leksika. Lituanistica, 55 (3–4): 133–143.