CLARIN Resource Families: Reference Corpora

Submitted by Linda Stokman on 14 April 2021

The CLARIN Resource Families initiative provides a user-friendly overview of the available language resources in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. 

This month CLARIN highlights the reference corpora. According to the linguist Geoffrey Leech (2002), a "corpus is designed to provide comprehensive information about the language […] It has to be a general corpus of wide coverage of the language, and hopefully it will be treated by its user community as some kind of “standard” for the language." Reference corpora thus contrast with specialised corpus families (e.g., parliamentary corpora, CMC-corpora) in that they are comprehensive with respect to genre inclusion, typically sampling a diverse set of primarily written genres. 

The CLARIN infrastructure offers access to 30 reference corpora for 21 languages.

See the overview