Tour de CLARIN: SWE-CLARIN presents 'Korp' tool

Submitted by karolina@clarin.eu on 10 July 2017

Blog post written by Darja Fišer and Jakob Lenardič

A concordancer is one of the key tools of a language resource provider, as it serves as the main entry point to language in context. One of the best known and widely-used concordancers is that provided by SWE-CLARIN’s Korp. A versatile and user-friendly tool, it is the main corpus infrastructure of Språkbanken and is used extensively by the Swedish and Finnish consortia, as well as in an Estonian and a Norwegian CLARIN centre. Through Korp, researchers can access some of the consortia’s most important language resources, such as SWE-CLARIN’s Riksdagsen öppna data corpus and FIN-CLARIN’s Suomi24 corpus, the latter of which was covered in a previous Tour de CLARIN blogpost.

Korp has been developed by a team of about 8 people at Språkbanken at the University of Gothenburg and consists of three components:

· the Korp corpus pipeline, which is used for the import, annotation and export of corpora;

· the Korp backend, which consists of a series of web services used for searching and retrieving both the corpora and their associated annotations and metadata; and

· the Korp fronted, which is the graphical user interface communicating with the backend.

The exhaustive corpus collection of Språkbanken, which is accessed through Korp, consists of over 400 corpora with more than 13 billion tokens and almost one billion sentences representing mainly modern written Swedish, but also the older language, going back all the way to the Old Swedish of the Middle Ages.

Through the Korp corpus pipeline, researchers can import and annotate their own data. A pivotal characteristic of the pipeline is its dynamic nature, which allows researchers to integrate their existing annotation into the Korp infrastructure and use it as the basis for other types of annotation. The pipeline also provides researchers with a series of automatic annotation options – tokenization, sentence splitting, links to the lexical persistent identifiers, lemmatization, compound analysis, PoS/MSD tagging, and syntactic dependency parsing.

The Korp fronted is the graphical search interface and is thus the aspect of the corpus that researchers usually first come in contact with. The Korp fronted gives users the flexibility to search through the corpora by giving them the option to use simple queries or the CQP-corpus query language. After performing a search, users can then find the concordances under the KWIC tab (figure 1), which also brings up a sidebar on the right hand side that shows how the relevant token is annotated. Other functions of Korp include the ordbild (the word picture) tab (figure 2), which shows the most relevant syntactic collocates of a lemma or text word; the related words tab, where a list of semantically-related lemmas is given; and the statistics tab, which provides users with a statistical overview of the token, as a table with a row for every unique hit and a column for every selected corpus, or in the form of a graph showing frequency of one or more linguistic phenomena over time (figure 3). The Korp backend, which provides access to corpora, their annotations and their metadata, can be downloaded here, while most of the corpora that can be searched through Korp are available for download in Språkbanken.

Figure 1: Concordances for the lemma "forskning" (Eng. research).

Figure 2: The word image for the lemma "forskning" (Eng. research).

Figure 3: The trend diagram for the personal pronouns "hon" (Eng. she) and "han" (Eng. he) in the modern newspaper corpus.

Click here to read more about Tour de CLARIN