Natural Language Processing
Tour de CLARIN: Interview with Sidsel Boldsen
Sidsel Boldsen is a PhD Student in Natural Language Processing ( ) and digital humanities, with a special interest in historical languages and linguistic knowledge representation. She has successfully collaborated wIth the Danish CLARIN K-Centre DANSK.
Xenophobia on Greek Twitter during and after the Financial Crisis
plWordNet 3.0 – Słowosieć 3.0
plWordNet 3.0 – Słowosieć 3.0
plWordNet is a lexico-semantic network which reflects the lexical system of the Polish language. plWN currently contains 178 000 nouns, verbs, adjectives, and adverbs, 259 000 word senses, and over 600 000 relations and 240 000 inter-lingual relations between lexical units. It is now the largest wordnet in the world and is still growing.
Senses in plWordNet are interconnected by relations. In the resulting network, each word is defined implicitly in reference to other words. For example, samochód 'car' is a kind of pojazd drogowy 'road vehicle'; it is a whole consisting of silnik 'engine', spryskiwacz 'windscreen washer', podwozie 'chassis' and so on; its close counterpart is the colloquial fura 'wheels'.
Among plWordNet's numerous applications there is its use as a Polish-English and English-Polish dictionary -- the effect of mapping onto Princeton WordNet (the first and for many years the largest wordnet in the world). plWordNet is also an important resource in natural language processing and in artificial intelligence research. For example, it is used by Google Translate for the purposes of machine translation.
The University has made plWordNet available free of charge for all applications, including commercial ones, on a licence modelled on the Princeton WordNet licence. Users may browse plWordNet via mobile version and via WordNetLoom-Viewer (application enabling display of plWN entries), as well as download source files. Programmers may access plWordNet via Web service.
We provide (currently only in download version) 31 000 lexical units marked with their sentiment values: positive, negative, ambiguous or neutral.
Wroclaw University of Technology, Ministry of Science and Higher Education (Poland)
Lärka (English LARK) - Language Acquisition Reusing Korp
Lärka
Lärka - “LÄR språket via KorpusAnalys” - with its English equivalent “Lark” (Language Acquisition Reusing Korp) is the ICALL platform of Språkbanken (the Swedish Language Bank). ICALL – Intelligent Computer-Assisted Language Learning – has as its main aim to draw on the opportunities offered by language resources, such as corpora, lexicons and natural language processing ( ) components including lemmatizers, parsers, etc., to build more sophisticated and flexible applications for language learners and students of grammatical theory.
The work on Lärka started in the project ‘Systems Architecture for ICALL’ financed by NordPlus Sprog from2011 to 2013. Specified as a modular web-based exercise generator that reuses available annotated corpora and lexical resources, Lärka is freely available, targeting primarily learners of Swedish as a second/foreign language and students of Swedish linguistics. Being web-based, Lärka has advantages of accessibility and ease of use.
Lärka is designed as a Service Oriented Architecture based on web services. The platform comprises two main components – user interface and web services – where the web services can be reused by other applications. Web services take care of exercise generation whereas the user interface collects user input, formats the web service output, and assigns behavior to buttons and menus.
At the moment Lärka offers exercises for two target groups: students of linguistics and learners of Swedish*. Students of Linguistics can train parts of speech, syntactic relations and semantic roles, whereas second language learners of Swedish can train spelling, vocabulary and inflection patters. Available exercises share some common features, namely:
- Training context: sentence. The objective with the Lärka-based exercise generator has, from the onset, been to use real-life language examples from corpora. Possible copyright issues are avoided by using only a single-sentence context. We are actively searching for alternatives for working with full texts.
- Reference materials. Relevant articles are looked up in Wikipedia, Wiktionary and Karp, while a text-to-speech module provided by SitePal offers pronunciation of relevant words and sentences. Reference materials are shown in a separate field that can be hidden when not wanted.
- Training modes: self-study, test and timed test. The self-study mode reveals all clues (e.g. reference articles, syntactic tree structure, pronunciation, etc.) and also provides a possibility to try several answer options. In the test modes, the clues are not revealed until the answer is provided; and users cannot change their answer.
- Feedback is offered in the form of immediate correct/incorrect symbols and a result tracker where information on correct/total number of answers is shown.
Recently, text assessment function has been added to Lärka, where reading comprehension texts alternatively learner essays can be tested for their CEFR level, i.e. a level of language proficiency according to Common European Framework of Reference (A1, A2, B1, B2, C1, C2).
There is ongoing work on diagnostic testing and learner modeling.
* Previous version of Lärka is being migrated to new technology, and the newer version does not yet offer all functionalities compared to its predecessor.
Språkbanken (UGOT), CLT (UGOT), Department of Swedish (UGOT), Lars Borin, Markus Forsberg, Jonatan Uppström, Ildikó Pilán, David Alfter