Wordlists are lexical resources which only provide alphabetical or frequency-based lexical inventories.There are 58 wordlists in the CLARIN infrastructure. About half (33) of the wordlists are monolingual, accounting for 10 languages (Dutch, Estonian, Finnish, German, Greek, Maltese, Ngbugu, Slovenian, Spanish, Swedish), while the other half (25) include a variety of both bilingual and multilingual language combinations (e.g., English-Greek, French-English-Spanish). In the vast majority of the cases, the wordlists can be directly downloaded from the national repositories or queried through easy-to-use online search environments.
For comments, changes of the existing content or inclusion of new resources, send us an resource-families [at] clarin.eu (email).
Wordlists in the CLARIN Infrastructure
Monolingual Resources
Corpus | Language | Description | Availability |
---|---|---|---|
Size: 500,000 word forms |
Dutch |
This wordlist includes historical lexemes for the period between 1550 and 1970. The resource is available for download from the Dutch Language Institute (INT). For the relevant publication, see de Does and Depuydt (2012) |
Download |
Size: 19,000 words and expressions |
Dutch | This wordlist of neologisms is available for online browsing the Dutch Language Institute (INT). | Download |
Estonian Frequency Dictionary (ver. 2.0) Size: 997,934 word forms |
Estonian | This is a frequency list available for download from (CELR distribution) and for online browsing. | |
Licence: CLARIN ACA |
Estonian | This is a wordlist that is based on the Estonian orthography of foreign place names. The resource is available for online browsing. | Browse |
The Conceptual File of Estonian Lexis of the Institute of Estonian Language Licence: CLARIN ACA |
Estonian | This is a controlled vocabulary of several more-and-less related concepts (e.g., gardening, haymaking, weather, fishing, religion). The resource is available for online browsing. | Browse |
Finnish Verbal Colorative Constructions Size: 61,617 words |
Finnish | This is a wordlist that contains Finnish verbal “colorative” (i.e., stylistically marked) constructions. The resource is available for download through FIN-CLARIN. | Download |
Frequencies of Early Modern Finnish Words Size: 4,862,190 words |
Finnish | This is a frequency lexicon that consists of words from the Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN. | Browse |
Frequencies of Old Literary Finnish Words Size: 3,425,382 words |
Finnish | This is a frequency lexicon that is constituted of words from Old Literary Finnish text corpus. The resource is available online through FIN-CLARIN. | Browse |
Frequency Lexicon of the Finnish Newspaper Language Size: 9,996 words |
Finnish | This is a frequency lexicon available online through FIN-CLARIN. | Browse |
Frequency List of Written Finnish Word Forms Size: 17,604 lemmas; 1,339,787 word forms |
Finnish | This is a frequency lexicon of Finnish word forms that appear in the Finnish Parole text corpus. The resource is available online through FIN-CLARIN. | Browse |
Size: 94,110 entries |
Finnish | This is a wordlist of contemporary general vocabulary that is available for download through FIN-CLARIN. | Download |
Size: 2.5 billion words |
Finnish | This is a frequency wordlist (accompanied by a query tool) of acquiring commonly used psycholinguistic descriptives for Finnish words, and word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. The resource is available for download from FIN-CLARIN. | Download |
Relative frequencies of part-of-speech n-grams in native and translated Finnish literary prose Licence: CC-BY 4.0 |
Finnish | This is a frequency list of N-grams appearing in the corpus Classics of English and American Literature, English-Finnish parallel corpus and the corpus of Translated Finnish. The resource is available for download from FIN-CLARIN. | Download |
Licence: CC-BY 4.0 |
Finnish | This is a frequency list that contains sets of unigrams, bigrams and trigrams extracted from a newspaper corpus. The resource is available for download from FIN-CLARIN. | Download |
Size: 5.8 million types |
German | This resource provides a list of annotated words taken from the deu_newscrawl_2011 corpus. The resource is available for online browsing through CLARIN-D/University of Leipzig. | Download |
Size: 7,385 entries |
Greek | This wordlist is useful for learning and teaching Greek as a foreign/second language . The words are classified according to the language levels of CEFR. The resource is available for download from clarin:el. | Download |
Frequency lists for Icelandic 23.06 Licence: CC BY 4.0 |
Icelandic |
This is a frequency list for three Icelandic corpora: the Icelandic Parsed Historical Corpus, the Tagged Icelandic Corpus, and the Icelandic Gigaword Corpus. The wordlist is available for download from the CLARIN-IS repository. |
Download |
The Icelandic Academic Word List (v. 1.0) Size: 2294 words |
Icelandic |
This is a frequence list from MÍNO, which is a language corpus of academic vocabulary. The wordlist is available for download from the CLARIN-IS consortium. |
Download |
Word frequency list from the Icelandic Corpus for Academic Words (v. 1.0) Size: 10,313 words |
Icelandic |
This is a frequence list from MÍNO, which is a language corpus of academic vocabulary. The wordlist is available for download from the CLARIN-IS consortium. |
Download |
Size: 41,251 tokens |
Maltese | This is a wordlist from 32 fictional books available for download from CLARIN PORTULAN. | Download |
Size: 824,839 words |
Maltese | This is a wordlist used for spell-checking. It is available for download from CLARIN PORTULAN. | Download |
Ngbugu digital wordlist: Archival form Size: 204 words |
Ngbugu | This is a wordlist used in language documentation, phonetics and lexicography. The resource is available for download from Ortolang. | Download |
Size: 10,000 entries |
Polish | This is a wordlist that list the frequencies and abstract levels of verbs. The resource is available for download from the CLARIN-PL repository. | Download |
Frequency list of textbook vocabulary by level of education in elementary and secondary schools Size: 11,906 words |
Slovenian |
The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects. The purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene. This resource is available for download from the CLARIN.SI repository. |
Download |
Size: 2,598,153 n-grams |
Slovenian | This is a list of n-grams extracted from the Gos corpus of spoken Slovene for download from CLARIN.SI | Download |
Size: 34,668,696 n-grams |
Slovenian | This is a list of n-grams extracted from the IMP corpus of historical Slovene download from CLARIN.SI. | Download |
Size: 351,029,703 n-grams |
Slovenian | This is a list of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0. The resource is available for download from CLARIN.SI | Download |
Size: 211,104,769 n-grams |
Slovenian | This is a list of n-grams extracted from the Kres corpus of written Slovenian. The resource is available for download from CLARIN.SI | Download |
Lexical functions of Spanish verb-noun collocations Size: 1,000 verb-noun pairs |
Spanish | This is a wordlist that consists of the most frequent 1000 verb-noun pairs extracted automatically from the Spanish Web Corpus classifications (collocation vs. free-word-combo). The resource is available for download from Ortolang. | Download |
Idioms from the NEO lexicon DB Size: 4,928 entries |
Swedish | This is a wordlist of idioms with explanations extracted from the database for the dictionary Nationalencyklopediens ordbok. The resource can be download from the SWE-CLARIN. | Download |
Size: 10,510 entries |
Swedish | This is a list of keywords for Language Learning for Young and adults alike. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP. | |
Licence: CC-BY 4.0 |
Swedish | This frequency list contains sets of unigrams, bigrams and trigrams extracted from a corpus compiled by the University of Helsinki from the digitized newspapers from the National Library of Finland. The resource is available for download from FIN-CLARIN. | Download |
Size: 13,833 entries |
Swedish | This is a wordlist of vocations in Swedish. The resource can be download from the SWE-CLARIN repository. | Download |
Multilingual Resources
Corpus | Language | Description | Availability |
---|---|---|---|
Multilingual Flashcards with 4,000 Most Common Icelandic Words (IceFlash4K) Size: 4000 entries |
Chinese, English, Icelandic, Polish, Ukrainian |
This wordlist contains common Icelandic words in 4 languages English, Chinese, Polish, Ukrainian. The wordlist is available for download from the CLARIN-IS repository. For the relevant publication, see Xindan and Ingason (2021) |
Download |
Topics of library and information science Size: 732 words |
English, Greek | This is a word list of terms from the domain of library and information science. The resource is available for download from clarin:el. | Download |
Size: 4,431 entries |
French, English | This is a controlled vocabulary of expressions from the domain of archaeology available for download from ORTOLANG. | Download |
Vocabulaire d'art et archéologie Size: 1,960 entries |
French, English | This is a controlled vocabulary of expressions from fine arts and archaeology available for download from ORTOLANG. | Download |
Vocabulaire de géographie de l'Amérique du Nord Size: 4,232 entries |
French, English | This is a thesaurus of terms from the geography of North America available for download from ORTOLANG. | Download |
Vocabulaire de Nutrition artificielle Size: 2,500 entries |
French, English | This is a controlled vocabulary of expressions from the domain of nutrition available for download from ORTOLANG. | Download |
Vocabulaire de Pathologies humaines Size: 5001-10000 entries |
French, English | This is a controlled vocabulary of expressions from the domain of medicine (pathological diseases) available for download from ORTOLANG. | Download |
Size: 4,435 entries |
French, English | This is a controlled vocabulary of expressions of philosophical terms available for download from ORTOLANG. | Download |
Vocabulaire de préhistoire et protohistoire Size: 3,093 entries |
French, English | This is a controlled vocabulary of expressions historical terms available for download from ORTOLANG. | Download |
Vocabulaire de Psychopathologie Size: 575 terms |
French, English | This is a controlled vocabulary of expressions from the domain of psychopathology available for download from ORTOLANG. | Download |
Vocabulaire de Sciences de l'éducation Size: 2,681 entries |
French, English | This is a controlled vocabulary of expressions from the domain of education available for download from ORTOLANG. | Download |
Vocabulaire de Sciences du langage Size: 6,142 entries |
French, English | This is a controlled vocabulary of expressions from the domain of linguistics available for download from ORTOLANG. | Download |
Size: 5,277 entries |
French, English | This is a controlled vocabulary of expressions from the domain of sociology available for download from ORTOLANG. | Download |
Vocabulaire de Transferts de chaleur Size: 1462 entries |
French, English | This is a controlled vocabulary of expressions of thermodynamic terms available for download from ORTOLANG. | Download |
Vocabulaire de Transfusion sanguine Size: 2,000 entries |
French, English | This is a controlled vocabulary of expressions from the domain of medicine (related to blood transfusion) available for download from ORTOLANG. | Download |
Size: 9,517 entries |
French, English | This is a controlled vocabulary of expressions from the domain of ethnology available for download from ORTOLANG. | Download |
Vocabulaire d'histoire des sciences et des techniques Size: 3,766 entries |
French, English | This is a controlled vocabulary of expressions from the domain technical sciences available for download from ORTOLANG. | Download |
Vocabulaire d'histoire et sciences de la littérature Size: 11,065 entries |
French, English | This is a controlled vocabulary of expressions from the domain of literary studies available for download from ORTOLANG. | Download |
Vocabulaire d'Histoire et sciences des religions Size: 4,581 entries |
French, English | This is a controlled vocabulary of expressions from the domain of philosophy and religion available for download from ORTOLANG. | Download |
Vocabulaire de sciences de la Terre Size: 19,707 entries |
French, English, Spanish | This is a controlled vocabulary of expressions from the domain of geology available for download from ORTOLANG. | Download |
Vocabulaire d'Electronique et électro-optique Size: 4,456 entries |
French, English, Spanish | This is a controlled vocabulary of expressions from the domain of electronics available for download from ORTOLANG. | Download |
Λεξικό Γλωσσολογικών όρων: Γερμανικά – Ελληνικά - Αγγλικά (lexicon of linguistic terms: DE-EL-EN) Size: 2,000 words |
German, Greek, English | This is a wordlist of linguistic terms that is available for download from clarin:el. | Download |
Labial vibrants in Mangbetu: Archival form Licence: CC-BY |
Mangbetu, French, English | This is a wordlist of lexical items that exemplify occurrences of bilabial trills and the labiodental flaps. The resource is available for download from ORTOLANG. | Download |
JRC-Names - a multilingual named entity resource Annotation: spelling varieties of names |
Slovenian, Swedish, Bulgarian, English, Greek, Estonian, Spanish, Castilian, Czech, German, Danish, French, Finnish, Italian, Hungarian, Latvian, Lithuanian, Maltese, Dutch, Flemish, Portuguese, Polish, Slovak, Romanian | This is a wordlist of named entities (person and organisation names). The resource is available for download from clarin:el. | Download |
Size: 29,111 entries |
Swedish, Albanian, Bosnian, English, Finnish, Modern Greek, Croatian, Iranian Persian, Russian, Serbian, Somali, Spanish, Turkish | This is a word list to be used by immigrants to Sweden. The resource can be download from the SWE-CLARIN repository. | Download |