Skip to main content

Wordlists

Wordlists are lexical resources which only provide alphabetical or frequency-based lexical inventories.There are 58 wordlists in the CLARIN infrastructure. About half (33) of the wordlists are monolingual, accounting for 10 languages (Dutch, Estonian, Finnish, German, Greek, Maltese, Ngbugu, Slovenian, Spanish, Swedish), while the other half (25) include a variety of both bilingual and multilingual language combinations (e.g., English-Greek, French-English-Spanish). In the vast majority of the cases, the wordlists can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

For comments, changes of the existing content or inclusion of new resources, send us an resource-families [at] clarin.eu (email).

 

Wordlists in the CLARIN Infrastructure

Monolingual Resources

Corpus Language Description Availability

INT Historical Word List

Size: 500,000 word forms
Licence: other

Dutch

This wordlist includes historical lexemes for the period between 1550 and 1970. The resource is available for download from the Dutch Language Institute (INT).

For the relevant publication, see de Does and Depuydt (2012)

Download

Neologisms Online v3

Size: 19,000 words and expressions
Licence: CLARIN PUB

Dutch This wordlist of neologisms is available for online browsing the Dutch Language Institute (INT). Download

Estonian Frequency Dictionary (ver. 2.0)

Size: 997,934 word forms
Licence: CLARIN PUB

Estonian This is a frequency list available for download from (CELR distribution) and for online browsing.

Browse

Download

Names of Countries

Licence: CLARIN ACA

Estonian This is a wordlist that is based on the Estonian orthography of foreign place names. The resource is available for online browsing. Browse

The Conceptual File of Estonian Lexis of the Institute of Estonian Language

Licence: CLARIN ACA

Estonian This is a controlled vocabulary of several more-and-less related concepts (e.g., gardening, haymaking, weather, fishing, religion). The resource is available for online browsing. Browse

Finnish Verbal Colorative Constructions

Size: 61,617 words
Licence: CC-BY

Finnish This is a wordlist that contains Finnish verbal “colorative” (i.e., stylistically marked) constructions­. The resource is available for download through FIN-CLARIN. Download

Frequencies of Early Modern Finnish Words

Size: 4,862,190 words
Licence: EUPL

Finnish This is a frequency lexicon that consists of words from the Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN. Browse

Frequencies of Old Literary Finnish Words

Size: 3,425,382 words
Licence: EUPL

Finnish This is a frequency lexicon that is constituted of words from Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN. Browse

Frequency Lexicon of the Finnish Newspaper Language

Size: 9,996 words
Licence: CC-BY NC ND 1.0

Finnish This is a frequency lexicon available online through FIN-CLARIN. Browse

Frequency List of Written Finnish Word Forms

Size: 17,604 lemmas; 1,339,787 word forms
Licence: EUPL

Finnish This is a frequency lexicon of Finnish word forms that appear in the Finnish Parole text corpus. The resource is available online through FIN-CLARIN. Browse

Modern Finnish Word List

Size: 94,110 entries
Annotation: MSD-tags, lemmas
Licence: GNU LGPL, EUPL v.1.1, CC-BY SA 3.0

Finnish This is a wordlist of contemporary general vocabulary that is available for download through FIN-CLARIN. Download

Psycholinguistic Descriptives

Size: 2.5 billion words
Licence: CC-BY 4.0

Finnish This is a frequency wordlist (accompanied by a query tool) of acquiring commonly used psycholinguistic descriptives for Finnish words, and word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. The resource is available for download from FIN-CLARIN. Download

Relative frequencies of part-of-speech n-grams in native and translated Finnish literary prose

Licence: CC-BY 4.0

Finnish This is a frequency list of N-grams appearing in the corpus Classics of English and American Literature, English-Finnish parallel corpus and the corpus of Translated Finnish. The resource is available for download from FIN-CLARIN. Download

The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland

Licence: CC-BY 4.0

Finnish This is a frequency list that contains sets of unigrams, bigrams and trigrams extracted from a newspaper corpus. The resource is available for download from FIN-CLARIN. Download

Deutsche Wortschatz

Size: 5.8 million types
Annotation: synonymy, examples of use

German This resource provides a list of annotated words taken from the deu_newscrawl_2011 corpus. The resource is available for online browsing through CLARIN-D/University of Leipzig. Download

KELLY word-list Greek

Size: 7,385 entries
Licence: CC-BY-NC

Greek This wordlist is useful for learning and teaching Greek as a foreign/second language . The words are classified according to the language levels of CEFR. The resource is available for download from clarin:el. Download

Frequency lists for Icelandic 23.06

Licence: CC BY 4.0

Icelandic

This is a frequency list for three Icelandic corpora: the Icelandic Parsed Historical Corpus, the Tagged Icelandic Corpus, and the Icelandic Gigaword Corpus.

The wordlist is available for download from the CLARIN-IS repository.

Download

The Icelandic Academic Word List (v. 1.0)

Size: 2294 words

Icelandic

This is a frequence list from MÍNO, which is a language corpus of academic vocabulary.

The wordlist is available for download from the CLARIN-IS consortium.

Download

Word frequency list from the Icelandic Corpus for Academic Words (v. 1.0)

Size: 10,313 words

Icelandic

This is a frequence list from MÍNO, which is a language corpus of academic vocabulary.

The wordlist is available for download from the CLARIN-IS consortium.

Download

Maltese Fiction Wordlist

Size: 41,251 tokens
Annotation: frequency
Licence: MS-NC-No ReD

Maltese This is a wordlist from 32 fictional books available for download from CLARIN PORTULAN. Download

Maltese Wordlist

Size: 824,839 words
Licence: LGPL

Maltese This is a wordlist used for spell-checking. It is available for download from CLARIN PORTULAN. Download

Ngbugu digital wordlist: Archival form

Size: 204 words
Licence: CLARIN PUB

Ngbugu This is a wordlist used in language documentation, phonetics and lexicography. The resource is available for download from Ortolang. Download

LCM-PL

Size: 10,000 entries
Licence: CC BY SA 3.0

Polish This is a wordlist that list the frequencies and abstract levels of verbs. The resource is available for download from the CLARIN-PL repository. Download

Frequency list of textbook vocabulary by level of education in elementary and secondary schools

Size: 11,906 words
Annotation: lemma, frequency
Licence: CC-BY-NC-SA 4.0

Slovenian

The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects.

The purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene.

This resource is available for download from the CLARIN.SI repository.

Download

Gos corpus n-grams 2.0

Size: 2,598,153 n-grams
Annotation: frequency
Licence: CC-BY-SA 4.0

Slovenian This is a list of n-grams extracted from the Gos corpus of spoken Slovene for download from CLARIN.SI Download

IMP corpus n-grams 2.0

Size: 34,668,696 n-grams
Annotation: frequency
Licence: CC-BY-SA 4.0

Slovenian This is a list of n-grams extracted from the IMP corpus of historical Slovene download from CLARIN.SI. Download

Janes corpus n-grams 1.0

Size: 351,029,703 n-grams
Annotation: frequency
Licence: CC-BY-SA 4.0

Slovenian This is a list of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0. The resource is available for download from CLARIN.SI Download

Kres corpus n-grams 2.0

Size: 211,104,769 n-grams
Annotation: frequency
Licence: CC-BY-SA 4.0

Slovenian This is a list of n-grams extracted from the Kres corpus of written Slovenian. The resource is available for download from CLARIN.SI Download

Lexical functions of Spanish verb-noun collocations

Size: 1,000 verb-noun pairs
Annotation: lexicological classifications (free-word combinations, errors), semantic information
Licence: CLARIN PUB

Spanish This is a wordlist that consists of the most frequent 1000 verb-noun pairs extracted automatically from the Spanish Web Corpus classifications (collocation vs. free-word-combo). The resource is available for download from Ortolang. Download

Idioms from the NEO lexicon DB

Size: 4,928 entries
Licence: CC-BY 4.0

Swedish This is a wordlist of idioms with explanations extracted from the database for the dictionary Nationalencyklopediens ordbok. The resource can be download from the SWE-CLARIN. Download

Kelly (2017-10-16)

Size: 10,510 entries
Licence: CC-BY 4.0

Swedish This is a list of keywords for Language Learning for Young and adults alike. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland

Licence: CC-BY 4.0

Swedish This frequency list contains sets of unigrams, bigrams and trigrams extracted from a corpus compiled by the University of Helsinki from the digitized newspapers from the National Library of Finland. The resource is available for download from FIN-CLARIN. Download

Vocation list (2015-01-10)

Size: 13,833 entries
Licence: CC-BY 4.0

Swedish This is a wordlist of vocations in Swedish. The resource can be download from the SWE-CLARIN repository. Download

Multilingual Resources

Corpus Language Description Availability

Multilingual Flashcards with 4,000 Most Common Icelandic Words (IceFlash4K)

Size: 4000 entries
Licence: CC BY 4.0

Chinese, English, Icelandic, Polish, Ukrainian

This wordlist contains common Icelandic words in 4 languages English, Chinese, Polish, Ukrainian.

The wordlist is available for download from the CLARIN-IS repository.

For the relevant publication, see Xindan and Ingason (2021)

Download

Topics of library and information science

Size: 732 words
Licence: CC-BY-NC

English, Greek This is a word list of terms from the domain of library and information science. The resource is available for download from clarin:el. Download

Vocabulaire d'archéologie

Size: 4,431 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of archaeology available for download from ORTOLANG. Download

Vocabulaire d'art et archéologie

Size: 1,960 entries
Annotation: preferred forms, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from fine arts and archaeology available for download from ORTOLANG. Download

Vocabulaire de géographie de l'Amérique du Nord

Size: 4,232 entries
Annotation: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a thesaurus of terms from the geography of North America available for download from ORTOLANG. Download

Vocabulaire de Nutrition artificielle

Size: 2,500 entries
Annotation: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of nutrition available for download from ORTOLANG. Download

Vocabulaire de Pathologies humaines

Size: 5001-10000 entries
Annotation: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of medicine (pathological diseases) available for download from ORTOLANG. Download

Vocabulaire de philosophie

Size: 4,435 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions of philosophical terms available for download from ORTOLANG. Download

Vocabulaire de préhistoire et protohistoire

Size: 3,093 entries
Annotation: hierarchical relationship, preferred forms, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions historical terms available for download from ORTOLANG. Download

Vocabulaire de Psychopathologie

Size: 575 terms
Annotation: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of psychopathology available for download from ORTOLANG. Download

Vocabulaire de Sciences de l'éducation

Size: 2,681 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of education available for download from ORTOLANG. Download

Vocabulaire de Sciences du langage

Size: 6,142 entries
Annotation: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of linguistics available for download from ORTOLANG. Download

Vocabulaire de sociologie

Size: 5,277 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of sociology available for download from ORTOLANG. Download

Vocabulaire de Transferts de chaleur

Size: 1462 entries
Annotation: associative relationship, Hierarchical relationship, Preferred form, Synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions of thermodynamic terms available for download from ORTOLANG. Download

Vocabulaire de Transfusion sanguine

Size: 2,000 entries
Annotation: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of medicine (related to blood transfusion) available for download from ORTOLANG. Download

Vocabulaire d'ethnologie

Size: 9,517 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of ethnology available for download from ORTOLANG. Download

Vocabulaire d'histoire des sciences et des techniques

Size: 3,766 entries
Annotation: contextual information, usage examples,preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain technical sciences available for download from ORTOLANG. Download

Vocabulaire d'histoire et sciences de la littérature

Size: 11,065 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of literary studies available for download from ORTOLANG. Download

Vocabulaire d'Histoire et sciences des religions

Size: 4,581 entries
Annotation: preferred form, synonym shape
Licence: CC-BY 4.0

French, English This is a controlled vocabulary of expressions from the domain of philosophy and religion available for download from ORTOLANG. Download

Vocabulaire de sciences de la Terre

Size: 19,707 entries
Annotation: hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English, Spanish This is a controlled vocabulary of expressions from the domain of geology available for download from ORTOLANG. Download

Vocabulaire d'Electronique et électro-optique

Size: 4,456 entries
Annotation: associative relationship, hierarchical relationship, preferred form, synonym shape
Licence: CC-BY 4.0

French, English, Spanish This is a controlled vocabulary of expressions from the domain of electronics available for download from ORTOLANG. Download

Λεξικό Γλωσσολογικών όρων: Γερμανικά – Ελληνικά - Αγγλικά (lexicon of linguistic terms: DE-EL-EN)

Size: 2,000 words

German, Greek, English This is a wordlist of linguistic terms that is available for download from clarin:el. Download

Labial vibrants in Mangbetu: Archival form

Licence: CC-BY

Mangbetu, French, English This is a wordlist of lexical items that exemplify occurrences of bilabial trills and the labiodental flaps. The resource is available for download from ORTOLANG. Download

JRC-Names - a multilingual named entity resource

Annotation: spelling varieties of names
Licence: Open for Reuse with Restrictions

Slovenian, Swedish, Bulgarian, English, Greek, Estonian, Spanish, Castilian, Czech, German, Danish, French, Finnish, Italian, Hungarian, Latvian, Lithuanian, Maltese, Dutch, Flemish, Portuguese, Polish, Slovak, Romanian This is a wordlist of named entities (person and organisation names). The resource is available for download from clarin:el. Download

Swedish words, LEXIN

Size: 29,111 entries
Licence: CC-BY 4.0

Swedish, Albanian, Bosnian, English, Finnish, Modern Greek, Croatian, Iranian Persian, Russian, Serbian, Somali, Spanish, Turkish This is a word list to be used by immigrants to Sweden. The resource can be download from the SWE-CLARIN repository. Download