Lexica are primarily used in applications. They typically contain an extensive lexical inventory with specific linguistic information (e.g., morphosyntax, sentiment). There are 83 lexica in the CLARIN infrastructure. Most (67) of the lexica are monolingual, accounting for 17 languages (Arabic, Croatian, Czech, Danish, Dutch, English, Estonian, French, Icelandic, Italian, Greek, Maltese, Polish, Portuguese, Serbian, Slovenian, and Swedish). The rest (16) are multilingual and include a variety of language combinations. In the vast majority of the cases, the lexica can be directly downloaded from the national repositories or queried through easy-to-use online search environments.
For comments, changes of the existing content or inclusion of new resources, send us an resource-families [at] clarin.eu (email).
Lexica in the CLARIN Infrastructure
Monolingual Resources
Corpus | Language | Description | Availability |
---|---|---|---|
A machine-readable dictionary of Egyptian Arabic Size: 2,418 entries |
Arabic (Egyptian) | This lexicon presents a more comprehensive version of A machine-readable glossary of Egyptian Arabic. The resource is available for download from ARCHE. | Download |
A machine-readable glossary of Egyptian Arabic Size: 2,204 entries |
Arabic (Egyptian) | This lexicon has been compiled for comparative as well as didactic purposes in the on-going VICAV project. The resource is available for download from ARCHE. | Download |
Automatically constructed multiword lexicon hrMWELex v0.5 Size: 43,730 entries |
Croatian |
This is a lexicon of multiword expressions available for download from CLARIN.SI. For the relevant publication, see Ljubešić et al. (2015) |
Download |
Inflectional lexicon hrLex 1.3 Size: 6,427,709 items, 164,206 entries |
Croatian |
This is a large inflectional lexicon where each entry consists of a (wordform, lemma, MSD, MSD features, UPOS, morphological features, frequency, per-million frequency) 8-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the hrWaC v2.2 corpus. The MSD tagset follows the MULTEXT-East V6 tagset for the Serbo-Croatian macro-language. The UPOS and morphological features follow the UD v2 specifications. The resource is available for download from CLARIN.SI For the relevant publication, see Ljubešić et al. (2016) |
Download |
Word embeddings CLARIN.SI-embed.hr 1.0 Size: 3,147,352 entries |
Croatian | This lexicon contains word embeddings extracted from the Croatian web corpus hrWaC and a 400-million-token-heavy collection of newspaper texts. The resource is available for download from CLARIN.SI. | Download |
Size: 1,027,832 entries |
Czech | This is a lexicon of derivational relations (both compounding and inflections). The resource is available for download and online browsing through LINDAT. | |
Size: 124,259,099 lexical types |
Czech | This is a morphological lexicon available for download from LINDAT. | Download |
Size: 1,621 entries |
Czech | This is a lexicon of single-word paraphrases of Czech verbal multiword expressions. The resource is available for download from LINDAT. | Download |
Size: 7,121 entries, 11,933 frames |
Czech |
This is a valency lexicon linked to several Czech corpora (PDT, PCEDT Cz side, PDTSC, Faust). The resource is available for download and online browsing through LINDAT. For the relevant publication, see Urešová (2011) |
|
Size: 2,722 entries, 6,711 units, 6,711 frames, 4,586 words |
Czech |
This is a valency lexicon available for download and online browsing through LINDAT. For the relevant publication, see Lopatková et al. (2017) |
|
STO morphology (v2) - LMF format Size: 87,209 entries |
Danish | This morphological lexicon is available for download from the CLARIN-DK repository. It is also available in the .csvformat. | Download |
Size: 84,159 entries |
Danish | This syntactic lexicon is available for download from the CLARIN-DK repository. | Download |
|
Dutch | This is a lexicon that comprises all lemmas from the Basilex Corpus. The Basilex Corpus is an annotated collection of texts written for children in elementary school. The resource is available for download from the Dutch Language Institute (INT). | Download |
Licence: other |
Dutch | This is a lexicon that comprises all lemmas from the Basiscript Corpus. The Basiscript Corpus is an annotated collection of texts written by children in elementary school. The resource is available for download from the Dutch Language Institute (INT). | Download |
Size: 213,000 lemmas |
Dutch | This is a lexicon of words and word forms available for download from the Dutch Language Institute (INT). | Download |
Size: 220,000 entries, 600,000 word forms, 77,000 multi-word expressions, 26,000 multi-word lemmas |
Dutch |
This is a lexical database that consists of a one-word lexicon and a multi-word lexicon. This lexicon is available for download from the Dutch Language Institute (INT). |
Download |
Dutch Electronic Lexicon of Multiword Expressions Size: 5,000 expressions |
Dutch | This is a lexicon of multiword expressions available for download from the Dutch Language Institute (INT). | Download |
Size: 20,000 entries |
Dutch | This morphosyntactic lexicon is available for download from the Dutch Language Institute (INT). | Download |
Reference Lexicon for Belgian-Dutch (RBBN) Size: 4,000 words and expressions- |
Dutch | This lexicon, which contains words and expressions typically of Dutch spoken in Belgium, is available for download from the Dutch Language Institute (INT). | Download |
Size: 50,000 lemmas |
Dutch | This is a corpus-based monolingual lexicon available for download the Dutch Language Institute (INT). | Download |
Size: over 2.2 million entries (over 3.3 million semantic relations) |
English | This is a large-scale, wide-coverage computational lexicon covering the biomedical domain. The resource is unavailable for download or online browsing, but can be accessed by contacting the resource manager. | |
Size: 4,337 entries, 7,148 frames |
English | This is a valency lexicon linked to the English side of the PCEDT corpus (WSJ corpus). The resource is available for download from LINDAT and for online browsing. | |
The Database of Estonian Multi-Word Expressions Size: 12,500 words |
Estonian | This is a collection of lexica that contain multi-word expressions consisting of a verb and a particle or a verb and its complements. The resource is available for download from META-SHARE (CELR distribution) and for online browsing through a dedicated website. | |
Size: 96,027 entries |
French | This is a morphological lexicon available for download from ORTOLANG. | Download |
Size: 8,000 entries |
French |
This is a verb-valency lexicon. The lexicon specifies certain selectional restrictions, possible term manifestations (pronominal, phrasal), and whether the valency frames can be used in various passive constructions, as well as references to other valency frames for the same infinitive. The resource is available for download from ORTOLANG. |
Download |
Size: 595,000,000 inflected forms |
French | This is a morphological lexicon available for download from ORTOLANG. | Download |
Size: 159,261 entries |
French | This is a morphological lexicon available for download from ORTOLANG. | Download |
Size: 8,800 entries |
French | This is a morphosyntactic lexicon available for download from ORTOLANG. | Download |
ILSP PsychoLinguistic Resource Size: 217,664 entries |
Greek |
This is a lexicon for psycholinguistic research. The resource is available for download from clarin:el. For the relevant publication, see Protopapas et al. (2010) |
Download |
Database of Modern Icelandic Inflections Size: 305,000 lemmas; 6.5 million inflectional forms; 48,000 non-standard word forms |
Icelandic |
This is a morphological lexicon created for use in language technology (LT), as a reference for the general public in Iceland, and for use in research on the Icelandic language. The term Modern Icelandic here refers to contemporary Icelandic, i.e. late 20th and 21st century usage. The lexicon is available for download and online browsing through CLARIN-IS. For the relevant publication, see Bjarnadóttir (2012) |
|
Size: 2,342,120 items |
Italian | This is a morphological lexicon. The resource is available for download from LINDAT. | Download |
Size: 3,510 entries |
Italian | This is a morphological lexicon. The resource is available for download from LINDAT. | Download |
OpeNER Sentiment Lexicon Italian - LMF Size: 24,293 entries |
Italian | This is a sentiment lexicon available for download from ILC4CLARIN. | Download |
Size: 37,406 syntactic units |
Italian | This is a morphological lexicon available for download from LC4CLARIN. | Download |
Size: 39,242 entries |
Maltese | This is a speech lexicon that is useful for building speech-to-text systems. It is available for download from CLARIN PORTULAN. | Download |
Emotional Annotations Dictionary Size: 178,514 elements |
Polish | This is a lexicon with emotional annotation extracted from Polish Wordnet. The resource is available for download from the CLARIN-PL repository. | Download |
Extended dictionary of named entities NELexicon connected with Linked Open Data Size: 103,585 entries |
Polish | This lexicon contains Polish named entities connected with terminology from available resources within Linked Open Data (e.g. WordNet, DBPedia, Wikipedia, etc.). The resource is available for download from the CLARIN-PL repository. | Download |
Size: 56,500 lexical units |
Polish | This is a lexicon of multiword expressions available for download from CLARIN.PL. | Download |
Size: 18,236 entries |
Polish | This is a lexicon of verb valency that is available for download from the CLARIN-PL repository. | Download |
LEX-MWE-PT: Word Combination in Portuguese Language Size: 1,198 entries, 12,753 multi word unit |
Portuguese | This is a lexicon of multiword expressions. The resource is available for download from CLARIN PORTULAN. | Download |
Size: 208 words |
Portuguese | This is a lexicon of abbreviations. The resource is available for download from CLARIN PORTULAN. | Download |
Size: 17,572 words |
Portuguese | This lexicon provides distributional semantic representations of Portuguese words. The dataset is available for download from GitHub. | Download |
LX-Rare Word Similarity Dataset Size: 2,034 words |
Portuguese | This is a word-similarity lexicon available for download from CLARIN PORTULAN. | Download |
Size: 1,998 words |
Portuguese | This is a word-similarity lexicon. The resource is available for download from CLARIN PORTULAN. | Download |
Size: 2,631 words |
Portuguese | This is a manually compiled exhaustive list of closed-class words in Portuguese. The resource is available for download from CLARIN PORTULAN. | Download |
Size: 706 words |
Portuguese | This is a word-similarity lexicon. The resource is available for download from CLARIN PORTULAN. | Download |
Multifunctional Computational Lexicon of Contemporary Portuguese Size: 26,443 entries |
Portuguese | This is a frequency lexicon suitable for NLP specific purposes (information extraction, lemmatization, PoS tagging). The resource is available for download from (CLARIN PORTULAN distribution). | Download |
Size: 20,000 entries |
Portuguese | This is a morphosyntactic lexicon available for download from CLARIN PORTULAN | Download |
Size: 27,374 words |
Portuguese | This is a lexicon that provides psycholinguistic and cognitive information that is useful to select stimulus materials for experiments and/or training vocabularies. The resource is available for download from CLARIN PORTULAN. | Download |
Size: 10,438 entries |
Portuguese | This semantic lexicon is available for download from CLARIN PORTULAN. | Download |
Automatically constructed multiword lexicon srMWELex v0.5 Size: 22,290 entries |
Serbian | This is a lexicon of multiword expressions available for download from CLARIN.SI. | Download |
Inflectional lexicon srLex 1.3 Size: 6,905,941 items, 169,328 entries |
Serbian |
This is a large inflectional lexicon where each entry consists of a (wordform, lemma, MSD, MSD features, UPOS, morphological features, frequency, per-million frequency) 8-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the hrWaC v2.2 corpus. The MSD tagset follows the MULTEXT-East V6 tagset for the Serbo-Croatian macro-language. The UPOS and morphological features follow the UD v2 specifications. The resource is available for download from CLARIN.SI For the relevant publication, see Ljubešić et al. (2016) |
Download |
Word embeddings CLARIN.SI-embed.sr 1.0 Size: 1,480,566 entries |
Serbian | This lexicon contains word embeddings from the srWaC web corpus. The resource is available for download from CLARIN.SI. | Download |
Automatically constructed multiword lexicon slMWELex v0.5 Size: 47,579 entries |
Slovenian | This is a lexicon of multiword expressions available for download from CLARIN.SI. | Download |
Automatically stress labelled morphological lexicon Sloleks 1.2, version 1.1 Size: 100,805 entries, 2,774,745 words |
Slovenian |
This is an extended version of the morphological lexicon Sloleks 1.2 with added information about the stress of each word form. The resource is available for download from CLARIN.SI. For the relevant publication, see Krsnik and Robnik Šikonja (2017) |
Download |
Beseda Corpus Lemmatisation Lexicon Size: 3,228,127 entries |
Slovenian | This lexicon contains inflected open class words from the Dictionary of Standard Slovenian that are augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. The resource is available for download from CLARIN.SI and for online browsing. | |
Collocation lexicon of Slovene academic discourse Aleks Size: 463 entries |
Slovenian |
This is a lexicon of entries typical for general Slovene academic discourse. The entries include typical context examples (collocations and examples of use) taken from KAS, a corpus of Slovene academic texts (see also the Academic corpora resource family), i.e. a morphosyntactically tagged synchronous and monolingual corpus, containing more than 1.5 billion words. The resource is available for download from CLARIN.SI |
Download |
Lexicon of historical Slovene imp25k 1.1 Size: 28,034 entries |
Slovenian |
This is a morphological lexicon available for download from CLARIN.SI and for online browsing through a dedicated environment. For the relevant publication, see Erjavec (2015) |
|
Morphological lexicon Sloleks 2.0 Size: 100,805 entries |
Slovenian |
This is a reference morphological lexicon of the Slovenian language developed to be used in NLP applications and language manuals. The resource is available for download from CLARIN.SI and for online browsing. For the relevant publication, see Dobrovoljc et al. (2017) |
|
Slovene sentiment lexicon JOB 1.0 Size: 25,524 entries |
Slovenian |
This is a lexicon of sentiment labels available for download from the CLARIN.SI repository. For the relevant publication, see Bučar et al. (2018) |
Download |
Slovene sentiment lexicon KSS 1.1 Size: 90,620 lexica |
Slovenian | This is a lexicon of sentiment labels available for download from the CLARIN.SI repository. | Download |
Word embeddings CLARIN.SI-embed.sl 1.0 Size: 4,560,444 entries |
Slovenian | This is a lexicon of word embeddings that is available for download from CLARIN.SI. | Download |
Old Swedish morphology (2017-10-16) Size: 41,958 entries |
Swedish | This is a glossary of Old Swedish that is available for download from the SWE-CLARIN repository and can be queried online through KARP. | |
Size: 24,523 entries |
Swedish | This is a lexicon for language technologies which offers access to syntactic information and is connected to SALDO senses. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP. | |
SALDO's morphology (2017-10-16) Size: 128,036 entries |
Swedish | This is a semantic and morphological lexicon for language technologies. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP. | |
Size: 11,624 entries |
Swedish | This is a semantic lexicon that is available for download from the SWE-CLARIN repository and can be queried online through KARP. | |
CORDEX inflectional lookup data 1.0 Size: 111,660 lemmas |
Slovenian |
This lexicon consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. This inflectional data lookup module serves as an optional component within the cordex library that significantly improves the quality of the results. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of Conversion utilities tool and its frequency within the Gigafida 2.0 corpus, or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni"). This resource is available for download from the CLARIN.SI repository. |
Download |
Multilingual Resources
Corpus | Language | Description | Availability |
---|---|---|---|
Concreteness and imageability lexicon MEGA.HR-Crossling Size: 7,237,589 entries |
77 languages |
These lexica contain concreteness and imageability predictions for 77 languages. They are available for download from CLARIN.SI. For the relevant publication, see Ljubešić et al. (2018) |
Download |
Size: 751 entries (emojis) |
Albanian, Bulgarian, English, German, Hungarian, Polish, Portuguese, Russian, Serbo-Croatian, Slovak, Slovenian, Spanish, Swedish |
This is a lexicon of emojis available for download from CLARIN.SI and for online browsing through a dedicated environment. For the relevant publication, see Kralj Novak et al. (2015) |
|
Size: 37,000 entries |
Arabic, Dutch | This is a bilingual lexicon that is suitable for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). | Download |
MULTEXT-East free lexicons 4.0 Size: 3,665,864 entries |
Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, Slovak, Slovenian, Ukrainian |
These are morphological lexica available for download from the CLARIN.SI repository. For the relevant publication, see Erjavec (2011) |
Download |
Size: 200 classes, 3,525 entries |
Czech, English | This is a valency lexicon linked to PDT-Vallex, EngVallex and external resources, such as FrameNet, VerbNet, WordNet, etc. The resource is available for download and online browsing through LINDAT. | |
Size: 20,835 pairs (verb senses) |
Czech, English |
This is a valency lexicon linked to the parallel PCEDT corpus. The resource is available for download and online browsing through LINDAT. For the relevant publication, see Fučíková et al. (2016) |
|
Size: 683 verb senses; 22.677 annotated corpus lines |
Croatian, English | This is a manually created verb sense lexicon. | |
The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene Size: 14,182 entries |
Croatian, Dutch, Slovenian |
This lexicon contains manual translations of the NRC Emotion Lexicon, which encodes the sentiment of a word (positive, negative) and its emotion association (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) for Croatian, Dutch and Slovene with a binary schema. Manual translations were produced by inspecting and correcting the automatic translations from English provided with the original lexicon. While translations to all 14,182 entries are provided for Slovene and Croatian, only translations for the 6,468 entries that have any sentiment or emotion associated with the word are given for Dutch. The resource is available for download from the CLARIN.SI repository. |
Download |
Size: 37,000 entries |
Dutch, Arabic | This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). | Download |
Size: 46,000 entries |
Dutch, Danish | This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). | Download |
Size: 50,000 entries |
Dutch, Indonesian | This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). | Download |
Size: 231,516 entries |
English, Spanish, Castilian, Bulgarian, Basque, Dutch, Flemish, Czech, Portuguese | This lexicon is used for the automatic translation of specific IT domain expressions and is available for download from CLARIN PORTULAN. | Download |
MULTEXT-East non-commercial lexicons 4.0 Size: 2,288,228 entries |
Macedonian, Persian, Polish, Russian, Serbian | These are morphological lexica available for download from the CLARIN.SI repository. | Download |
A machine-readable Persian-English dictionary Size: 1,892 entries |
Persian-English | This bilingual lexicon has been compiled for comparative as well as didactic purposes in the on-going VICAV project. The resource is available for download from ARCHE. | Download |
A machine-readable Persian-English glossary of verbs Size: 429 entries |
Persian-English | This lexicon of single-word verbs in Modern Persian is available for download from ARCHE. | Download |
Database of the Western South Slavic Verb HyperVerb -- Derivation Size: 8,300 entries |
Bosnian, Croatian, Serbian, Serbo-Croatian, Slovenian |
This lexicon contains 3000 most frequent Slovenian and 5300 most frequent BCS verbs which are all coded for a number of properties related to verb derivation. The database is a table where each verb is given a row of its own. The coded properties are organized in columns. Verbs in the database are coded for the following properties: root information, whether or not the verb has prefixes and the identity of the included prefix(es), whether or not the verb has suffixes and the identity of the included suffix(es) etc. All coded properties are explained in the accompanying pdf file. This resource is available for download from the CLARIN.SI repository. For the relevant publication, see Milosavljević et al. (2023) |
Download |
Publications
[Bjarnadóttir 2012] Kristín Bjarnadóttir. 2012. The Database of Modern Icelandic Inflection (Beygingarlýsing íslensks nútímamáls).
[Bučar et al. 2018] Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.
[Erjavec 2011] Tomaž Erjavec. 2011. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.
[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources.
[Fučíková et al. 2016] Fučíková Eva, Hajič Jan, and Urešová Zdeňka. 2016. Joint search in a bilingual valency lexicon and an annotated corpus.
[Dobrovoljc et al. 2017] Kaja Dobrovoljc, Simon Krek, and Tomaž Erjavec. 2017. The Sloleks Morphological Lexicon and its Future Development.
[Kralj Novak et al. 2015] Petra Kralj Novak Jasmina Smailović, Borut Sluban, and Igor Mozetič. 2015. Sentiment of Emojis.
[Krsnik and Robnik Šikonja 2017] Luka Krsnik and Marko Robnik Šikonja. 2017. Napovedovanje naglasa slovenskih besed z metodami strojnega učenja.
[Ljubešić et al. 2015] Nikola Ljubešić, Kaja Dobrovoljc, and Darja Fišer. 2015. MWELEX – MWE LEXICA OF CROATIAN, SLOVENE AND SERBIAN EXTRACTED FROM PARSED CORPORA.
[Ljubešić et al. 2016] Nikola Ljubešić, Filip Klubička, Željko Agić, and Ivo-Pavao Jazbec. 2016. New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian.
[Ljubešić et al. 2018] Nikola Ljubešić, Darja Fišer, and Anita Peti-Stantić. 2018. Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings.
[Lopatková et al. 2017] Markéta Lopatková et al. 2017. Valenční slovník českých sloves VALLEX.
[Protopapas et al. 2010] Athanassios Protopapas, Marina Tzakosta, Aimilios Chalamandaris, and Pirros Tsiakoulis. 2010. IPLR: an online resource for Greek word-level and sublexical information.
[Urešová 2011] Zdeňka Urešová. 2011. Valenční slovník Pražského závislostního korpusu (PDT-Vallex).
[Úlfarsdóttir 2014] Thórdís Úlfarsdóttir. 2014. ISLEX – a Multilingual Web Dictionary.