Skip to main content

Lexica

Lexica are primarily used in applications. They typically contain an extensive lexical inventory with specific linguistic information (e.g., morphosyntax, sentiment). There are 83 lexica in the CLARIN infrastructure. Most (67) of the lexica are monolingual, accounting for 17 languages (Arabic, Croatian, Czech, Danish, Dutch, English, Estonian, French, Icelandic, Italian, Greek, Maltese, Polish, Portuguese, Serbian, Slovenian, and Swedish). The rest (16) are multilingual and include a variety of language combinations. In the vast majority of the cases, the lexica can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

For comments, changes of the existing content or inclusion of new resources, send us an resource-families [at] clarin.eu (email).

 

Lexica in the CLARIN Infrastructure

Monolingual Resources

Corpus Language Description Availability

A machine-readable dictionary of Egyptian Arabic

Size: 2,418 entries
Annotation: basic morphological information, usage examples
Licence: CC-BY-NC-SA 3.0

Arabic (Egyptian) This lexicon presents a more comprehensive version of A machine-readable glossary of Egyptian Arabic. The resource is available for download from ARCHE. Download

A machine-readable glossary of Egyptian Arabic

Size: 2,204 entries
Annotation: basic morphological information, usage examples
Licence: CC-BY-NC-SA 3.0

Arabic (Egyptian) This lexicon has been compiled for comparative as well as didactic purposes in the on-going VICAV project. The resource is available for download from ARCHE. Download

Automatically constructed multiword lexicon hrMWELex v0.5

Size: 43,730 entries
Annotation: multi-word expressions
Licence: CC-BY 4.0

Croatian

This is a lexicon of multiword expressions available for download from CLARIN.SI.

For the relevant publication, see Ljubešić et al. (2015)

Download

Inflectional lexicon hrLex 1.3

Size: 6,427,709 items, 164,206 entries
Annotation: wordform, lemma, MSD, UPOS
Licence: CC-BY 4.0

Croatian

This is a large inflectional lexicon where each entry consists of a (wordform, lemma, MSD, MSD features, UPOS, morphological features, frequency, per-million frequency) 8-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the hrWaC v2.2 corpus. The MSD tagset follows the MULTEXT-East V6 tagset for the Serbo-Croatian macro-language. The UPOS and morphological features follow the UD v2 specifications.

The resource is available for download from CLARIN.SI

For the relevant publication, see Ljubešić et al. (2016)

Download

Word embeddings CLARIN.SI-embed.hr 1.0

Size: 3,147,352 entries
Annotation: PoS-tags, lemmas
Licence: CC-BY 4.0

Croatian This lexicon contains word embeddings extracted from the Croatian web corpus hrWaC and a 400-million-token-heavy collection of newspaper texts. The resource is available for download from CLARIN.SI. Download

DeriNet 1.6

Size: 1,027,832 entries
Licence: CC-BY-NC-SA 3.0

Czech This is a lexicon of derivational relations (both compounding and inflections). The resource is available for download and online browsing through LINDAT.

Browse

Download

MorfFlex CZ

Size: 124,259,099 lexical types
Annotation: MSD-tags, derivational, semantic, NER information
Licence: CC-BY-NC-SA 3.0

Czech This is a morphological lexicon available for download from LINDAT. Download

ParaDi 2.0

Size: 1,621 entries
Annotation: MSD-tags, syntactic/semantic features
Licence: CC-BY 4.0

Czech This is a lexicon of single-word paraphrases of Czech verbal multiword expressions. The resource is available for download from LINDAT. Download

PDT-Vallex

Size: 7,121 entries, 11,933 frames
Annotation: verb, adjective and noun valency
Licence: CC-BY-NC-SA 4.0

Czech

This is a valency lexicon linked to several Czech corpora (PDT, PCEDT Cz side, PDTSC, Faust). The resource is available for download and online browsing through LINDAT.

For the relevant publication, see Urešová (2011)

Browse

Download

VALLEX 3.0

Size: 2,722 entries, 6,711 units, 6,711 frames, 4,586 words
Annotation: verb senses (characterized by glosses and examples)
Licence: CC-BY-NC-SA 4.0

Czech

This is a valency lexicon available for download and online browsing through LINDAT.

For the relevant publication, see Lopatková et al. (2017)

Browse

Download

STO morphology (v2) - LMF format

Size: 87,209 entries
Licence: CC BY-SA 4.0

Danish This morphological lexicon is available for download from the CLARIN-DK repository. It is also available in the .csvformat. Download

STO syntax (v2) - LMF format

Size: 84,159 entries
Licence: CC BY-SA 4.0

Danish This syntactic lexicon is available for download from the CLARIN-DK repository. Download

Basilex Lexicon

 

Dutch This is a lexicon that comprises all lemmas from the Basilex Corpus. The Basilex Corpus is an annotated collection of texts written for children in elementary school. The resource is available for download from the Dutch Language Institute (INT). Download

Basiscript Lexicon

Licence: other

Dutch This is a lexicon that comprises all lemmas from the Basiscript Corpus. The Basiscript Corpus is an annotated collection of texts written by children in elementary school. The resource is available for download from the Dutch Language Institute (INT). Download

CombiLex

Size: 213,000 lemmas
Annotation: lemmas and word forms
Licence: other

Dutch This is a lexicon of words and word forms available for download from the Dutch Language Institute (INT). Download

e-Lex

Size: 220,000 entries, 600,000 word forms, 77,000 multi-word expressions, 26,000 multi-word lemmas
Annotation: MSD-tags, syntactical and phonological information, partially semantically annotated
Licence: other

Dutch

This is a lexical database that consists of a one-word lexicon and a multi-word lexicon.

This lexicon is available for download from the Dutch Language Institute (INT).

Download

Dutch Electronic Lexicon of Multiword Expressions

Size: 5,000 expressions
Licence: other

Dutch This is a lexicon of multiword expressions available for download from the Dutch Language Institute (INT). Download

PAROLE Lexicon

Size: 20,000 entries
Annotation: MSD-tags and syntactic complementation patterns
Licence: other

Dutch This morphosyntactic lexicon is available for download from the Dutch Language Institute (INT). Download

Reference Lexicon for Belgian-Dutch (RBBN)

Size: 4,000 words and expressions-
Licence: other

Dutch This lexicon, which contains words and expressions typically of Dutch spoken in Belgium, is available for download from the Dutch Language Institute (INT). Download

Reference Lexicon for Dutch

Size: 50,000 lemmas
Annotation: dialectical information
Licence: other

Dutch This is a corpus-based monolingual lexicon available for download the Dutch Language Institute (INT). Download

BioLexicon

Size: over 2.2 million entries (over 3.3 million semantic relations)
Licence: ELRA END USER

English This is a large-scale, wide-coverage computational lexicon covering the biomedical domain. The resource is unavailable for download or online browsing, but can be accessed by contacting the resource manager.  

EngVallex

Size: 4,337 entries, 7,148 frames
Annotation: verb valency
Licence: CC-BY-NC-SA 4.0

English This is a valency lexicon linked to the English side of the PCEDT corpus (WSJ corpus). The resource is available for download from LINDAT and for online browsing.

Browse

Download

The Database of Estonian Multi-Word Expressions

Size: 12,500 words
Licence: proprietary

Estonian This is a collection of lexica that contain multi-word expressions consisting of a verb and a particle or a verb and its complements. The resource is available for download from META-SHARE (CELR distribution) and for online browsing through a dedicated website.

Browse

Download

Démonette

Size: 96,027 entries
Annotation: MSD-tags (grace format), semantic types
Licence: CC-BY 4.0

French This is a morphological lexicon available for download from ORTOLANG. Download

Dicovalence

Size: 8,000 entries
Annotation: c- and s-selectional restrictions
Licence: Licence Publique Générale Amoindrie GNU

French

This is a verb-valency lexicon.

The lexicon specifies certain selectional restrictions, possible term manifestations (pronominal, phrasal), and whether the valency frames can be used in various passive constructions, as well as references to other valency frames for the same infinitive. The resource is available for download from ORTOLANG.

Download

MarsaLex

Size: 595,000,000 inflected forms
Licence: CC-BY 4.0

French This is a morphological lexicon available for download from ORTOLANG. Download

Morphalou

Size: 159,261 entries
Annotation: spelling, phonetics, mood, tense, MSD-tags, spelling variant, feminine variation, pronominal
Licence: Publique Générale Amoindrie GNU

French This is a morphological lexicon available for download from ORTOLANG. Download

VfrLPL

Size: 8,800 entries
Annotation: conjugation forms, phonetic forms, use frequencies
Licence: restricted

French This is a morphosyntactic lexicon available for download from ORTOLANG. Download

ILSP PsychoLinguistic Resource

Size: 217,664 entries
Annotation: phonetic transcription, frequency of usage
Licence: CC-BY-NC-SA

Greek

This is a lexicon for psycholinguistic research. The resource is available for download from clarin:el.

For the relevant publication, see Protopapas et al. (2010)

Download

Database of Modern Icelandic Inflections

Size: 305,000 lemmas; 6.5 million inflectional forms; 48,000 non-standard word forms
Annotation: MSD-tags
Licence: CC BY-SA 4.0

Icelandic

This is a morphological lexicon created for use in language technology (LT), as a reference for the general public in Iceland, and for use in research on the Icelandic language. The term Modern Icelandic here refers to contemporary Icelandic, i.e. late 20th and 21st century usage.

The lexicon is available for download and online browsing through CLARIN-IS.

For the relevant publication, see Bjarnadóttir (2012)

Browse

Download

Italian Content Words v3

Size: 2,342,120 items
Licence: CC-BY-NC-SA 4.0

Italian This is a morphological lexicon. The resource is available for download from LINDAT. Download

Italian Function Words v3

Size: 3,510 entries
Licence: CC-BY-NC-SA 4.0

Italian This is a morphological lexicon. The resource is available for download from LINDAT. Download

OpeNER Sentiment Lexicon Italian - LMF

Size: 24,293 entries
Annotation: positive/negative/neutral polarity
Licence: CC-BY 4.0

Italian This is a sentiment lexicon available for download from ILC4CLARIN. Download

PAROLE-SIMPLE-CLIPS

Size: 37,406 syntactic units
Licence: CC-BY-SA 4.0

Italian This is a morphological lexicon available for download from LC4CLARIN. Download

Maltese Speech Engine Lexicon

Size: 39,242 entries
Annotation: PoS-tags, orthographic transcription, phonetic forms, syllables, stress position
Licence: MS-BY-NC-SA

Maltese This is a speech lexicon that is useful for building speech-to-text systems. It is available for download from CLARIN PORTULAN. Download

Emotional Annotations Dictionary

Size: 178,514 elements
Licence: CC-BY 4.0

Polish This is a lexicon with emotional annotation extracted from Polish Wordnet. The resource is available for download from the CLARIN-PL repository. Download

Extended dictionary of named entities NELexicon connected with Linked Open Data

Size: 103,585 entries
Licence: GNU LGPL 3.0

Polish This lexicon contains Polish named entities connected with terminology from available resources within Linked Open Data (e.g. WordNet, DBPedia, Wikipedia, etc.). The resource is available for download from the CLARIN-PL repository. Download

MWELexicon 1.1

Size: 56,500 lexical units
Annotation: syntactic behaviour
Licence: plWordNet

Polish This is a lexicon of multiword expressions available for download from CLARIN.PL. Download

Walenty (2018-06-29)

Size: 18,236 entries
Licence: CC BY SA 4.0

Polish This is a lexicon of verb valency that is available for download from the CLARIN-PL repository. Download

LEX-MWE-PT: Word Combination in Portuguese Language

Size: 1,198 entries, 12,753 multi word unit
Annotation: lemmas
Licence: MS NC-NoReD-ND

Portuguese This is a lexicon of multiword expressions. The resource is available for download from CLARIN PORTULAN. Download

LX-Abbreviations

Size: 208 words
Annotation: MSD-tags
Licence: MS NC-NoReD-ND

Portuguese This is a lexicon of abbreviations. The resource is available for download from CLARIN PORTULAN. Download

LX-DSemVectors

Size: 17,572 words
Annotation: word embeddings
Licence: MS NC-NoReD-ND

Portuguese This lexicon provides distributional semantic representations of Portuguese words. The dataset is available for download from GitHub. Download

LX-Rare Word Similarity Dataset

Size: 2,034 words
Annotation: synonyms
Licence: MS NC-NoReD-ND

Portuguese This is a word-similarity lexicon available for download from CLARIN PORTULAN. Download

LX-SimLex-999

Size: 1,998 words
Annotation: MSD-tags, linguistic standardness
Licence: MS NC-NoReD-ND

Portuguese This is a word-similarity lexicon. The resource is available for download from CLARIN PORTULAN. Download

LX-StopWords

Size: 2,631 words
Annotation: MSD-tags, MWEs
Licence: MS NC-NoReD-ND

Portuguese This is a manually compiled exhaustive list of closed-class words in Portuguese. The resource is available for download from CLARIN PORTULAN. Download

LX-WordSim-353

Size: 706 words
Annotation: synonyms, antonyms, identical, hypernym-hyponym, sibling terms, meronym-holonym
Licence: MS NC-NoReD-ND

Portuguese This is a word-similarity lexicon. The resource is available for download from CLARIN PORTULAN. Download

Multifunctional Computational Lexicon of Contemporary Portuguese

Size: 26,443 entries
Annotation: lemmas, MWEs, PoS-tags
Licence: CC-BY - SA

Portuguese This is a frequency lexicon suitable for NLP specific purposes (information extraction, lemmatization, PoS tagging). The resource is available for download from (CLARIN PORTULAN distribution). Download

PAROLE Portuguese Lexicon

Size: 20,000 entries
Annotation: MSD tags, lemma
Licence: ELRA EVALUATION

Portuguese This is a morphosyntactic lexicon available for download from CLARIN PORTULAN Download

Porlex

Size: 27,374 words
Annotation: orthographic and phonological/phonetic transcriptions, phonetic, MSD-tags, and frequency information
Licence: MS NC-NoReD-ND

Portuguese This is a lexicon that provides psycholinguistic and cognitive information that is useful to select stimulus materials for experiments and/or training vocabularies. The resource is available for download from CLARIN PORTULAN. Download

Simple Portuguese Lexicon

Size: 10,438 entries
Annotation: qualia structure, semantic relations (hyponymy, synonymy, etc.)
Licence: MS-BY-NC-SA

Portuguese This semantic lexicon is available for download from CLARIN PORTULAN. Download

Automatically constructed multiword lexicon srMWELex v0.5

Size: 22,290 entries
Annotation: MWEs
Licence: CC-BY 4.0

Serbian This is a lexicon of multiword expressions available for download from CLARIN.SI. Download

Inflectional lexicon srLex 1.3

Size: 6,905,941 items, 169,328 entries
Annotation: wordform, lemma, MSD
Licence: CC-BY 4.0

Serbian

This is a large inflectional lexicon where each entry consists of a (wordform, lemma, MSD, MSD features, UPOS, morphological features, frequency, per-million frequency) 8-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the hrWaC v2.2 corpus. The MSD tagset follows the MULTEXT-East V6 tagset for the Serbo-Croatian macro-language. The UPOS and morphological features follow the UD v2 specifications.

The resource is available for download from CLARIN.SI

For the relevant publication, see Ljubešić et al. (2016)

Download

Word embeddings CLARIN.SI-embed.sr 1.0

Size: 1,480,566 entries
Annotation: PoS-tags, lemmas
Licence: CC-BY 4.0

Serbian This lexicon contains word embeddings from the srWaC web corpus. The resource is available for download from CLARIN.SI. Download

Automatically constructed multiword lexicon slMWELex v0.5

Size: 47,579 entries
Annotation: MWEs
Licence: CC-BY 4.0

Slovenian This is a lexicon of multiword expressions available for download from CLARIN.SI. Download

Automatically stress labelled morphological lexicon Sloleks 1.2, version 1.1

Size: 100,805 entries, 2,774,745 words
Annotation: wordforms, PoS-tags, lemmas, frequency, prosody
Licence: CC-BY-NC-SA 4.0

Slovenian

This is an extended version of the morphological lexicon Sloleks 1.2 with added information about the stress of each word form. The resource is available for download from CLARIN.SI.

For the relevant publication, see Krsnik and Robnik Šikonja (2017)

Download

Beseda Corpus Lemmatisation Lexicon

Size: 3,228,127 entries
Annotation: wordforms, PoS-tags, lemmas, frequency
Licence: CC-BY 4.0

Slovenian This lexicon contains inflected open class words from the Dictionary of Standard Slovenian that are augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. The resource is available for download from CLARIN.SI and for online browsing.

Browse

Download

Collocation lexicon of Slovene academic discourse Aleks

Size: 463 entries
Annotation: collocations
Licence: CC-BY 4.0

Slovenian

This is a lexicon of entries typical for general Slovene academic discourse. The entries include typical context examples (collocations and examples of use) taken from KAS, a corpus of Slovene academic texts (see also the Academic corpora resource family), i.e. a morphosyntactically tagged synchronous and monolingual corpus, containing more than 1.5 billion words.

The resource is available for download from CLARIN.SI

Download

Lexicon of historical Slovene imp25k 1.1

Size: 28,034 entries
Annotation: MSD-tags, lemmas, etymological glosses
Licence: CC-BY 4.0

Slovenian

This is a morphological lexicon available for download from CLARIN.SI and for online browsing through a dedicated environment.

For the relevant publication, see Erjavec (2015)

Browse

Download

Morphological lexicon Sloleks 2.0

Size: 100,805 entries
Annotation: wordforms, PoS-tags, lemmas, frequency, phonology
Licence: CC-BY-NC-SA 4.0

Slovenian

This is a reference morphological lexicon of the Slovenian language developed to be used in NLP applications and language manuals. The resource is available for download from CLARIN.SI and for online browsing.

For the relevant publication, see Dobrovoljc et al. (2017)

Browse

Download

Slovene sentiment lexicon JOB 1.0

Size: 25,524 entries
Annotation: sentiment tags
Licence: CC-BY-S15A 4.0

Slovenian

This is a lexicon of sentiment labels available for download from the CLARIN.SI repository.

For the relevant publication, see Bučar et al. (2018)

Download

Slovene sentiment lexicon KSS 1.1

Size: 90,620 lexica
Annotation: lemmas, sentiment tags
Licence: CC-BY 4.0

Slovenian This is a lexicon of sentiment labels available for download from the CLARIN.SI repository. Download

Word embeddings CLARIN.SI-embed.sl 1.0

Size: 4,560,444 entries
Annotation: PoS-tags, lemmas
Licence: CC-BY 4.0

Slovenian This is a lexicon of word embeddings that is available for download from CLARIN.SI. Download

Old Swedish morphology (2017-10-16)

Size: 41,958 entries
Licence: CC-BY 4.0

Swedish This is a glossary of Old Swedish that is available for download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

Parole+ (2017-10-16)

Size: 24,523 entries
Licence: CC-BY 4.0

Swedish This is a lexicon for language technologies which offers access to syntactic information and is connected to SALDO senses. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

SALDO's morphology (2017-10-16)

Size: 128,036 entries
Licence: CC-BY 4.0

Swedish This is a semantic and morphological lexicon for language technologies. The resource can be download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

Simple lexicon

Size: 11,624 entries
Licence: CC-BY 4.0

Swedish This is a semantic lexicon that is available for download from the SWE-CLARIN repository and can be queried online through KARP.

Browse

Download

CORDEX inflectional lookup data 1.0

Size: 111,660 lemmas
Annotation: MSD-tagged, lemmatised, frequency
Licence: CC-BY-NC-SA 4.0

Slovenian

This lexicon consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. This inflectional data lookup module serves as an optional component within the cordex library that significantly improves the quality of the results.

Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of Conversion utilities tool and its frequency within the Gigafida 2.0 corpus, or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").

This resource is available for download from the CLARIN.SI repository.

Download

Multilingual Resources

Corpus Language Description Availability

Concreteness and imageability lexicon MEGA.HR-Crossling

Size: 7,237,589 entries
Annotation: concreteness prediction, imageability prediction
Licence: CC-BY-SA 4.0

77 languages

These lexica contain concreteness and imageability predictions for 77 languages. They are available for download from CLARIN.SI.

For the relevant publication, see Ljubešić et al. (2018)

Download

Emoji Sentiment Ranking 1.0

Size: 751 entries (emojis)
Annotation: sentiment labels
Licence: CC-BY-SA 4.0

Albanian, Bulgarian, English, German, Hungarian, Polish, Portuguese, Russian, Serbo-Croatian, Slovak, Slovenian, Spanish, Swedish

This is a lexicon of emojis available for download from CLARIN.SI and for online browsing through a dedicated environment.

For the relevant publication, see Kralj Novak et al. (2015)

Browse

Download

OMBI Dutch-Arabic

Size: 37,000 entries
Licence: other

Arabic, Dutch This is a bilingual lexicon that is suitable for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). Download

MULTEXT-East free lexicons 4.0

Size: 3,665,864 entries
Annotation: MSD-tags, lemmas
Licence: CC-BY-SA 4.0

Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, Slovak, Slovenian, Ukrainian

These are morphological lexica available for download from the CLARIN.SI repository.

For the relevant publication, see Erjavec (2011)

Download

CzEngClass 0.2

Size: 200 classes, 3,525 entries
Annotation: valency and synonymy
Licence: CC-BY-NC-SA 4.0

Czech, English This is a valency lexicon linked to PDT-Vallex, EngVallex and external resources, such as FrameNet, VerbNet, WordNet, etc. The resource is available for download and online browsing through LINDAT.

Browse

Download

CzEngVallex

Size: 20,835 pairs (verb senses)
Annotation: verb valency
Licence: CC-BY-NC-SA 4.0

Czech, English

This is a valency lexicon linked to the parallel PCEDT corpus. The resource is available for download and online browsing through LINDAT.

For the relevant publication, see Fučíková et al. (2016)

Browse

Download

CroaTPAS

Size: 683 verb senses; 22.677 annotated corpus lines

Croatian, English This is a manually created verb sense lexicon.  

The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene

Size: 14,182 entries
Annotation: word sentiment
Licence: CC-BY-NC-SA 4.0

Croatian, Dutch, Slovenian

This lexicon contains manual translations of the NRC Emotion Lexicon, which encodes the sentiment of a word (positive, negative) and its emotion association (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) for Croatian, Dutch and Slovene with a binary schema. Manual translations were produced by inspecting and correcting the automatic translations from English provided with the original lexicon. While translations to all 14,182 entries are provided for Slovene and Croatian, only translations for the 6,468 entries that have any sentiment or emotion associated with the word are given for Dutch.

The resource is available for download from the CLARIN.SI repository.

Download

OMBI Arabic-Dutch

Size: 37,000 entries
Licence: other

Dutch, Arabic This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). Download

OMBI Dutch-Danish

Size: 46,000 entries
Licence: other

Dutch, Danish This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). Download

OMBI Dutch-Indonesian

Size: 50,000 entries
Licence: other

Dutch, Indonesian This is a bilingual lexicon for language technology applications such as automatic translation, e-learning, multilingual information retrieval, etc. The resource is available for download from the Dutch Language Institute (INT). Download

QTLeap specialized lexicons

Size: 231,516 entries
Licence: CC-BY

English, Spanish, Castilian, Bulgarian, Basque, Dutch, Flemish, Czech, Portuguese This lexicon is used for the automatic translation of specific IT domain expressions and is available for download from CLARIN PORTULAN. Download

MULTEXT-East non-commercial lexicons 4.0

Size: 2,288,228 entries
Annotation: MSD-tags, lemmas
Licence: CC-BY-NC 4.0

Macedonian, Persian, Polish, Russian, Serbian These are morphological lexica available for download from the CLARIN.SI repository. Download

A machine-readable Persian-English dictionary

Size: 1,892 entries
Annotation: morphological information, usage examples
Licence: CC-BY-NC-SA 3.0

Persian-English This bilingual lexicon has been compiled for comparative as well as didactic purposes in the on-going VICAV project. The resource is available for download from ARCHE. Download

A machine-readable Persian-English glossary of verbs

Size: 429 entries
Annotation: basic morphological information
Licence: CC-BY-NC-SA 3.0

Persian-English This lexicon of single-word verbs in Modern Persian is available for download from ARCHE. Download

Database of the Western South Slavic Verb HyperVerb -- Derivation

Size: 8,300 entries
Annotation: root, prefix and suffix information for verb forms
Licence: CC-BY-SA 4.0

Bosnian, Croatian, Serbian, Serbo-Croatian, Slovenian

This lexicon contains 3000 most frequent Slovenian and 5300 most frequent BCS verbs which are all coded for a number of properties related to verb derivation.

The database is a table where each verb is given a row of its own. The coded properties are organized in columns. Verbs in the database are coded for the following properties: root information, whether or not the verb has prefixes and the identity of the included prefix(es), whether or not the verb has suffixes and the identity of the included suffix(es) etc. All coded properties are explained in the accompanying pdf file.

This resource is available for download from the CLARIN.SI repository.

For the relevant publication, see Milosavljević et al. (2023)

Download

Publications

[Bjarnadóttir 2012] Kristín Bjarnadóttir. 2012. The Database of Modern Icelandic Inflection (Beygingarlýsing íslensks nútímamáls).

[Bučar et al. 2018] Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[Erjavec 2011] Tomaž Erjavec. 2011. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources.

[Fučíková et al. 2016] Fučíková Eva, Hajič Jan, and Urešová Zdeňka. 2016. Joint search in a bilingual valency lexicon and an annotated corpus.

[Dobrovoljc et al. 2017] Kaja Dobrovoljc, Simon Krek, and Tomaž Erjavec. 2017. The Sloleks Morphological Lexicon and its Future Development.

[Kralj Novak et al. 2015] Petra Kralj Novak  Jasmina Smailović, Borut Sluban, and Igor Mozetič. 2015. Sentiment of Emojis.

[Krsnik and Robnik Šikonja 2017] Luka Krsnik and Marko Robnik Šikonja. 2017. Napovedovanje naglasa slovenskih besed z metodami strojnega učenja.

[Ljubešić et al. 2015]  Nikola Ljubešić, Kaja Dobrovoljc, and Darja Fišer. 2015. MWELEX – MWE LEXICA OF CROATIAN, SLOVENE AND SERBIAN EXTRACTED FROM PARSED CORPORA.

[Ljubešić et al. 2016] Nikola Ljubešić, Filip Klubička, Željko Agić, and Ivo-Pavao Jazbec. 2016. New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian.

[Ljubešić et al. 2018] Nikola Ljubešić, Darja Fišer,  and Anita Peti-Stantić. 2018. Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings.

[Lopatková et al. 2017] Markéta Lopatková et al. 2017. Valenční slovník českých sloves VALLEX.

[Protopapas et al. 2010] Athanassios Protopapas, Marina Tzakosta, Aimilios Chalamandaris, and Pirros Tsiakoulis. 2010. IPLR: an online resource for Greek word-level and sublexical information.

[Urešová 2011] Zdeňka Urešová. 2011. Valenční slovník Pražského závislostního korpusu (PDT-Vallex). 

[Úlfarsdóttir 2014] Thórdís Úlfarsdóttir. 2014. ISLEX – a Multilingual Web Dictionary.