Corpora of Academic Texts

Corpora of academic texts contain scholarly writing, such as research papers, essays and abstracts published in academic journals, conference proceedings, and edited volumes, theses written by students at undergraduate and graduate levels, and scientific monographs.

The CLARIN ERIC infrastructure gives access to 24 corpora of academic texts, 2 of which are multilingual and 22 monolingual. The available corpora contain scholarly texts in the following 11 languages: Czech, English, Estonian, Finnish, French, German, Greek, Russian, Slovenian, Spanish, and Swedish. More than 15 different scholarly disciplines are represented, with the most prominent being linguistics, computer science, economics, and medicine. The majority of the corpora are richly tagged and are available under public licences.

We first provide an overview of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Corpora of academic texts in the CLARIN infrastructure

Monolingual corpora

Corpus	Language	Description	Availability
Czech Sociological Review Size: 3 million words Licence: MIT	Czech	This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format. The corpus is available for download from the LINDAT repository.	Download
ACL Anthology Reference Corpus Size: 75 million tokens Annotation: PoS-tagged, lemmatised, author/text metadata Licence: CC BY SA	English	This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format. The corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website. For the relevant publication, see Bird et al. 2008	Concordancer Download
English Scientific Text Corpus Size: 35 million tokens Annotation: PoS-tagged, lemmatised, author/text metadata, document structure Licence: restricted	English	This corpus contains journal articles in the following disciplines: computer science, computational linguistics, informatics, digital construction, microelectronics, linguistics, biology, mechanical engineering, and electrical engineering. The articles were published in the 1970s, 1980s and the 200s. The corpus is available for online querying through CQPWeb (CLARIN-D distribution). For the relevant publication, see Degaetano-Ortlieb et al. 2013	Concordancer
GENIA corpus Size: 437,000 words Annotation: PoS-tagged, syntactically parsed, annotated for terms, events, semantic relations and coreference; text metadata Licence: free but unspecified	English	This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB. The corpus is available for download from PORTULAN. For the relevant publication, see Su et al. 2008	Download
UH's English E-thesis corpus Size: 200 million tokens Annotation: PoS-tagged, syntactically parsed Licence: CC BY	English	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
The Royal Society Corpus Size: 32 million tokens Annotation: PoS-tagged, lemmatised, normalised, author and document metadata Licence: CC BY	English (late and early modern)	This corpus contains journal articles published in Philosophical Transactions of the Royal Society of London between 1665 and 1869. The corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland. For the relevant publication, see Kermes et al. 2016	Concordancer Download
Corpus of Estonian scientific texts Size: 5 million words Licence: CLARIN ACA-NC	Estonian	This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format.	Download
UH's Finnish E-thesis corpus Size: 12.5 million tokens Annotation: PoS-tagged, lemmatised Licence: CC BY	Finnish	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
Chambers-Le Baron Corpus of Research Articles Size: 1 million words Annotation: No annotation Licence: Oxford Text Archive licence (academic use)	French	This corpus contains research papers in the following disciplines: media/culture, literature, linguistics and language learning, social anthropology, law, economics, sociology and social sciences, philosophy, history, and communication. The research papers were published between 1998 and 2006. This is a plain text corpus. The corpus is available for download from the Oxford Text Archive.	Download
UH's French E-thesis corpus Size: 580,000 tokens Licence: CC BY	French	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
UH's German E-thesis corpus Size: 560,000 tokens Annotation: No annotation Licence: CC BY	German	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
Modern Greek Dialects: scientific papers Size: 113,000 words Licence: CC-BY-SA	Greek	This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus. The corpus is available for download from the CLARIN:EL repository.	Download
OROSSIMO Corpus Size: 2.5 million tokens Annotation: marked for term candidates, "mixed structural annotation" Licence: CC-BY	Greek	This corpus contains academic texts in the following disciplines: social sciences, computer science, economics, linguistics, photography, law, engineering, history, astronomy, earth sciences and geology, medicine and health, and biology. The corpus is encoded in XML ( ). The corpus is available for download from the CLARIN:EL repository. For the relevant publication, see Mantzari et al. 1999	Download
The Language of Literature and the Language of Translation (collected scientific papers) Size: 48,300 words Licence: CC-BY-SA	Greek	This corpus contains journal articles in literary and translation studies. This is a plain text corpus. The corpus is available for download from the CLARIN:EL repository.	Download
UH's Russian E-thesis corpus Size: 1.1 million words Annotation: No annotation Licence: CC BY	Russian	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
Corpus of Academic Slovene KAS 2.0 Size: 1.5 billion tokens Annotation: MSD-tagged, lemmatised, marked for bilingual and monolingual term candidates Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0	Slovenian	This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the format. The corpus is available for download from CLARIN.SI. Version 1.0 is also available for online querying through noSketch Engine and KonText (CLARIN.SI distribution). For the relevant publication, see Erjavec et al. 2020	Download
UH's Spanish E-thesis corpus Size: 2.3 million tokens Annotation: No annotation Licence: CC BY	Spanish	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
Academic texts - humanities Size: 14.5 million tokens Licence: CC BY	Swedish	This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text. The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).	Concordancer Download
Academic texts - social science Size: 10.8 million tokens Annotation: sentence segmentation Licence: CC BY	Swedish	This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text. The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).	Concordancer Download
UH's Swedish E-thesis corpus Size: 105 million tokens Licence: CC BY	Swedish	This corpus contains MA and PhD theses published between 1999 and 2016. The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).	Concordancer
Corpus of Slovene linguistic scientific writing JezKor Size: 9.3 million tokens Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated for named entities and author/text metadata Licence: CC BY	Slovenian	This corpus contains a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise. The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.	Concordancer (noSketchEngine) Concordancer (noSketchEngine) Download
Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0 Size: 326 million tokens Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated for named entities and author/text metadata Licence: CC BY-SA	Slovenian	This corpus contains a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal. It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the CLASSLA pipeline on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file. Note that there exist similar, but older and smaller corpora KAS 2.0 and KAS 1.0. These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for "KAS" to find them. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.	Concordancer (noSketchEngine)

Corpus

Language

Description

Availability

Czech Sociological Review

Size: 3 million words
Licence: MIT

Czech

This corpus contains research papers in sociology published between 1993 and 2016. The corpus data are in the TSV format.

The corpus is available for download from the LINDAT repository.

Download

ACL Anthology Reference Corpus

Size: 75 million tokens
Annotation: PoS-tagged, lemmatised, author/text metadata
Licence: CC BY SA

English

This corpus contains research papers in computational linguistics published between 1979 and 2015. The corpus data are in the XML format.

The corpus is available for online querying through the Sketch Engine (log-in required) and for download from a dedicated website.

For the relevant publication, see Bird et al. 2008

Concordancer

Download

English Scientific Text Corpus

Size: 35 million tokens
Annotation: PoS-tagged, lemmatised, author/text metadata, document structure
Licence: restricted

English

This corpus contains journal articles in the following disciplines:

computer science,
computational linguistics,
informatics,
digital construction,
microelectronics,
linguistics,
biology,
mechanical engineering, and
electrical engineering.

The articles were published in the 1970s, 1980s and the 200s.

The corpus is available for online querying through CQPWeb (CLARIN-D distribution).

For the relevant publication, see Degaetano-Ortlieb et al. 2013

Concordancer

GENIA corpus

Size: 437,000 words
Annotation: PoS-tagged, syntactically parsed, annotated for terms, events, semantic relations and coreference; text metadata
Licence: free but unspecified

English

This corpus contains journal paper abstracts in biomedicine. The corpus data are in various formats, e.g., PTB.

The corpus is available for download from PORTULAN.

For the relevant publication, see Su et al. 2008

Download

UH's English E-thesis corpus

Size: 200 million tokens
Annotation: PoS-tagged, syntactically parsed
Licence: CC BY

English

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

The Royal Society Corpus

Size: 32 million tokens
Annotation: PoS-tagged, lemmatised, normalised, author and document metadata
Licence: CC BY

English (late and early modern)

This corpus contains journal articles published in Philosophical Transactions of the Royal Society of London between 1665 and 1869.

The corpus is available for online querying through CQPweb and for download from the CLARIN-D repository of the University of Saarland.

For the relevant publication, see Kermes et al. 2016

Concordancer

Download

Corpus of Estonian scientific texts

Size: 5 million words
Licence: CLARIN ACA-NC

Estonian

This corpus contains scientific articles and PhD theses. The corpus data are in the P5 format.

Download

UH's Finnish E-thesis corpus

Size: 12.5 million tokens
Annotation: PoS-tagged, lemmatised
Licence: CC BY

Finnish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Chambers-Le Baron Corpus of Research Articles

Size: 1 million words
Annotation: No annotation
Licence: Oxford Text Archive licence (academic use)

French

This corpus contains research papers in the following disciplines:

media/culture,
literature,
linguistics and language learning,
social anthropology,
law, economics,
sociology and social sciences,
philosophy,
history, and
communication.

The research papers were published between 1998 and 2006. This is a plain text corpus.

The corpus is available for download from the Oxford Text Archive.

Download

UH's French E-thesis corpus

Size: 580,000 tokens
Licence: CC BY

French

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

UH's German E-thesis corpus

Size: 560,000 tokens
Annotation: No annotation
Licence: CC BY

German

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Modern Greek Dialects: scientific papers

Size: 113,000 words
Licence: CC-BY-SA

Greek

This corpus contains scientific texts in linguistics and dialectology. This is a plain text corpus.

The corpus is available for download from the CLARIN:EL repository.

Download

OROSSIMO Corpus

Size: 2.5 million tokens
Annotation: marked for term candidates, "mixed structural annotation"
Licence: CC-BY

Greek

This corpus contains academic texts in the following disciplines:

social sciences,

computer science,
economics,
linguistics,
photography,
law,
engineering,
history,
astronomy,
earth sciences and geology,
medicine and health, and
biology.

The corpus is encoded in XML ( ).

The corpus is available for download from the CLARIN:EL repository.

For the relevant publication, see Mantzari et al. 1999

Download

The Language of Literature and the Language of Translation (collected scientific papers)

Size: 48,300 words
Licence: CC-BY-SA

Greek

This corpus contains journal articles in literary and translation studies. This is a plain text corpus.

The corpus is available for download from the CLARIN:EL repository.

Download

UH's Russian E-thesis corpus

Size: 1.1 million words
Annotation: No annotation
Licence: CC BY

Russian

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Corpus of Academic Slovene KAS 2.0

Size: 1.5 billion tokens
Annotation: MSD-tagged, lemmatised, marked for bilingual and monolingual term candidates
Licence: CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0

Slovenian

This corpus contains BA, MA, and PhD theses in humanities, social sciences, and natural sciences published between 2000 and 2018. The corpus data are in the format.

The corpus is available for download from CLARIN.SI. Version 1.0 is also available for online querying through noSketch Engine and KonText (CLARIN.SI distribution).

For the relevant publication, see Erjavec et al. 2020

Download

UH's Spanish E-thesis corpus

Size: 2.3 million tokens
Annotation: No annotation
Licence: CC BY

Spanish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Academic texts - humanities

Size: 14.5 million tokens
Licence: CC BY

Swedish

This corpus contains academic texts from humanities disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.

The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).

Concordancer

Download

Academic texts - social science

Size: 10.8 million tokens
Annotation: sentence segmentation
Licence: CC BY

Swedish

This corpus contains academic texts from social sciences disciplines published between 1997 and 2012. The corpus data are in the XML format and plain text.

The corpus is available for download from the SWECLARIN repository and for online querying through the concordancer Korp (SWECLARIN distribution).

Concordancer

Download

UH's Swedish E-thesis corpus

Size: 105 million tokens
Licence: CC BY

Swedish

This corpus contains MA and PhD theses published between 1999 and 2016.

The corpus is available for online querying through the concordancer Korp (FIN-CLARIN distribution).

Concordancer

Corpus of Slovene linguistic scientific writing JezKor

Size: 9.3 million tokens
Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated for named entities and author/text metadata
Licence: CC BY

Slovenian

This corpus contains a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise.

The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.

The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

Concordancer (noSketchEngine)

Download

Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0

Size: 326 million tokens
Annotation: PoS-tagged (UD), MSD-tagged (UD & MULTEXT-East), lemmatised, annotated for named entities and author/text metadata
Licence: CC BY-SA

Slovenian

This corpus contains a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal. It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the CLASSLA pipeline on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file.

Note that there exist similar, but older and smaller corpora KAS 2.0 and KAS 1.0. These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for "KAS" to find them.

The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

Concordancer (noSketchEngine)

Multilingual corpora

Corpus	Language	Description	Availability
Czech and English abstracts of ÚFAL papers Size: 2 million words Annotation: document aligned Licence: CC BY	Czech,English	This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format. The corpus is available for download from the LINDAT repository.	Download
The KIAP corpus Size: 3.9 million tokens Annotation: PoS-tagged Licence: CC-BY 4.0	English,French,Norwegian	This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003. The corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution).	Concordancer

Corpus

Language

Description

Availability

Czech and English abstracts of ÚFAL papers

Size: 2 million words
Annotation: document aligned
Licence: CC BY

Czech,English

This parallel corpus contains research paper abstracts in formal and applied linguistics. For each publication, the authors were obliged to provide both the original abstract in Czech or English, and its translation into English or Czech, respectively. The corpus data are in the TSV format.

The corpus is available for download from the LINDAT repository.

Download

The KIAP corpus

Size: 3.9 million tokens
Annotation: PoS-tagged
Licence: CC-BY 4.0

English,French,Norwegian

This comparable corpus contains research articles in economics, linguistics, and medicine published between 1992 and 2003.

The corpus is available for online browsing through the concordancer Corpuscle (CLARINO distribution).

Concordancer

Corpora outside the infrastructure

Monolingual corpora

Corpus	Language	Description	Availability
Academic Corpus Size: 3.5 million words	English	This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology. This corpus is not available.
Reading Academic Text corpus Licence: restricted	English	This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML. The corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement.
Corpus of academic Lithuanian Size: 9 million words Annotation: no linguistic annotation	Lithuanian	This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and master’s and PhD theses from the following disciplines: humanities (architecture, fine art studies, ethnology, folklore studies, philosophy, linguistics, literary theory, librarianship, history, theology), social sciences (law, political science, economics, psychology, education, management), physical sciences (mathematics, astronomy, physics, chemistry, geography, geology and mineralogy, informatics), biomedical sciences (medicine, dental surgery, biology, botany, agronomy, animal husbandry, pharmacy, veterinary science, forestry studies), and technological sciences (energy studies, chemical technology, materials science, mechanics, metrology, building construction, transport technology, agricultural and environmental sciences, management and informatics). The materials were published between 1999 and 2009. The corpus is encoded in TEI 5. The corpus is available for online querying through a dedicated website. For the relevant publication, see Usonienė and Linkevičienė (2009)	Concordancer

Corpus

Language

Description

Availability

Academic Corpus

Size: 3.5 million words

English

This corpus contains journal articles, book chapters, course workbooks, laboratory manuals, and course notes from the following disciplines: arts, commerce, law, and biology.

This corpus is not available.

Reading Academic Text corpus

Licence: restricted

English

This corpus contains PhD theses from the following disciplines: agriculture, psychology, food science, technology, meteorology, and history. The data are encoded in ASCII and HTML.

The corpus is not available because it is restricted at present to staff and researchers at the University of Reading, and it is only available 'on-site'. However, it is possible for people outside the University to make use of the corpus on a Research Attachment arrangement.

Corpus of academic Lithuanian

Size: 9 million words
Annotation: no linguistic annotation

Lithuanian

This corpus contains textbooks, scientific monographs, journal articles, abstracts, forewords, research reports, and master’s and PhD theses from the following disciplines:

humanities (architecture, fine art studies, ethnology, folklore studies, philosophy, linguistics, literary theory, librarianship, history, theology),
social sciences (law, political science,

economics, psychology, education, management),
physical sciences (mathematics, astronomy, physics, chemistry, geography, geology and mineralogy, informatics),
biomedical sciences (medicine, dental surgery, biology, botany, agronomy, animal husbandry, pharmacy, veterinary science, forestry studies), and
technological sciences (energy studies, chemical technology, materials science, mechanics, metrology, building construction, transport technology, agricultural and

environmental sciences, management and informatics).

The materials were published between 1999 and 2009. The corpus is encoded in TEI 5.

The corpus is available for online querying through a dedicated website.

For the relevant publication, see Usonienė and Linkevičienė (2009)

Concordancer

Multilingual corpora

Corpus	Language	Description	Availability
MuchMore Springer Bilingual Corpus Size: 1 million tokens Annotation: PoS/MSD-tagged, phrase chunking, semantic class and relations, document structure Licence: free but unspecified	English,German	This paper contains journal paper abstracts from medical disciplines. The corpus is encoded in MuchMore XML. The corpus is available for download from a dedicated website.	Download
Scientext corpus Size: 20 million words Licence: CC BY	French,English	This corpus contains scientific texts and argumentative essays in humanities, experimental sciences, and applied/technical sciences. The corpus is available for online querying through a dedicated webpage.	Concordancer
Corpus of Romanian Academic Genres – ROGER (bilingual, student papers) Size: 3.3 million words Licence: CC BY-NC-ND	Romanian, English	The corpus contains academic papers from eight disciplines, written by the Romanian students in native Romanian and English L2. The corpus was collected over a three-year period (2018–2021) with the help of 27 collaborators from nine Romanian universities. The corpus is available for online querying through a dedicated platform developed at the CODHUS research centre from the West University of Timisoara. For the relevant publication, see Striletchi et al. (2022)	Concordancer
Spanish-English Research Article Corpus Size: 5.7 million words	Spanish,English	This corpus contains journal articles published between 2000 and 2010. The corpus is unavailable.

Related Publications

[Bird et al. 2008] Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. 2008. The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), edited by Nicoletta Calzolari, 1755–1759.

[Degaetano-Ortilieb et al. 2013] Stefania Degaetano-Ortilieb, Hannah Kermes, Ekaterina Lapshinova-Koltunski, and Elke Teich. 2013. SciTex – A Diachronic Corpus for Analyzing the Development of Scientific Registers. In New Method in Historical Corpus Linguistics, edited by Paul Bennett et al.

[Erjavec et al. forthcoming] Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021. The KAS Corpus of Slovenian Academic Writing.Language Resources and Evaluation.

[Kermes et al. 2016] Hannah Kermes, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke teich. The Royal Society Corpus: From Uncharted Data to Corpus. In Proceedings of LREC 2016, edited by Nicoletta Calzolari.

[Mantazi et al. 1999] Elena Mantazi, Maria Gavrilidou, Penny Labropoulou, and George Carayannis. 1999. Collection of digital terminological resources: methodology and results. In Proceedings of the 2nd Conference on Greek Language and Terminology.

[Parodi 2010] Giovanni Parodi. 2010. Academic and Professional genre variation across four disciplines: exploring the PUCB-2006 corpus of written Spanish. Linguagem em (Dis) curso, 10 (3): 535–567.

[Striletchi et al. 2022] Cosmin Strilețchi, Mădălina Chitez, and Karla Csürös. 2022. Building Roger: Technical Challenges While Developing a Bilingual Corpus Management and Query Platform. In Proceedings of the 17th International Conference on Software Technologies - ICSOFT.

[Su et al. 2008] Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii. 2008. Coreference resolution in biomedical texts: a machine learning approach. In Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives, edited by Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann.

[Usonienė and Linkevičienė 2009] Aurelija Usonienė and Jolė Linkevičienė. 2009. Lietuvių mokslo kalbos tekstynas ir specialioji leksika. Lituanistica, 55 (3–4): 133–143.