Tour de CLARIN: CLARIN-LT introduces Colloc

Submitted by Jakob Lenardič on 13 September 2018

Blog post written by Tomas Krilavičius and Jolanta Kovalevskaitė

Colloc is an experimental tool aimed at the automatic identification of Multiword Expressions (MWEs). Multiword expressions (or multiword units) are fixed word combinations that can be different in their nature: some of them are semantically non-compositional, i.e. their global meaning is different from the sum of their individual parts (idioms or phraseologisms), whereas others are transparent, but have usage-based co-occurrence restrictions (collocations). The tool, developed by a team of researchers working at the CLARIN-LT center at Vytautas Magnus University and the Baltic Institute of Advanced Technology, covers the whole process of MWE identification and can also be used for the development of new methods of MWE identification.

The experimental prototype includes all the steps of linguistic analysis, namely:

Text preprocessing
POS tagging
N-gram generation and calculation of their statistical properties
Calculation of Lexical Association Measures (LAMs)
Word embedding generation
MWE identification using:
- Filtering (gazetteers, dictionaries)
- Application of LAMs
- Application of Machine Learning
- Hybrid methods

The basic user version of Colloc, which is cloud-based and will be available soon, currently supports only Lithuanian and was trained on a 70 million word corpus, collected from the Lithuanian news portal delfi.lt. The tool has been statistically trained on the basis of GloVe Word Vectors and employs Artificial Neural Networks. It is designed to be user friendly, so researchers using it will only have to upload the text file whose multi-word expressions they want to have analyzed (as in the image below), and the tool will simply return the annotated document.

Colloc Upload Window

It is important to have a tool that can extract MWE candidates from particular text(s), since this opens more possibilities not only for terminological, lexicographic and perspectives on language analysis, but also for different areas in applied linguistics, like language learning. The tool will help linguists perform deeper analyses of language, investigate its compositionality, idiomaticity and dynamics. Language technology specialists will be able to use Colloc to improve automatic text analysis, machine translation, information extraction tools, and make chatbots more human.

Further information will be available at http://mwe.lt/en_US/.
The source is accessible at https://bitbucket.org/ievabumb/mwe_tagger_prototype/src.
The tool and related corpora will be available as a part of Clarin-LT centre (and Clarin ) resources.
The tool development is funded by Lithuanian Research Council, Pastovu project (www.mwe.lt).

Click here to read more about Tour de CLARIN.