corpus

‘Voices from Ravensbrück’ Project

24 January 2022

Read about the ‘Voices from Ravensbrück’ project which aims to give voice to women survivors from Ravensbrück concentration camp and to make the data openly accessible to researchers, schools and communities.

Tour de CLARIN: Interview with Titika Dimitroulia

25 August 2021

Titika Dimitroulia is a professor of translation studies, as well as a translator and literary critic. Many of her literary corpora have benefitted from NLP:EL processing tools...

DiaCollo: collocation analysis in diachronic perspective

In the words of the famous language philosopher Ludwig Wittgenstein: "the meaning of a word is its usage in the language" (Philosophical Investigations, Part I, section 43). In other words, the meaning of a word can be revealed by the context in which it appears. An ambiguous word such as ‘bank’ can be be disambiguated given its context: the ‘bank’ bounding a body of water will tend to occur together with terms like “river”, “lake”, or “slope”, while the ‘bank’ which is a financial institution will tend occur together with expressions like “money”, “cheque”, or “go to”.

Changes in a word's meaning will therefore often be directly associated with changes in its characteristic combinations (the set of words with which it typically occurs together, its collocates). Even political, cultural, or social changes relating to a central term can be revealed and traced through its typical combinations (see the example for ‘revolution’ below).

DiaCollo is a software tool for the discovery, comparison, and interactive visualization of the typical word combinations for a user-specified target term. Characteristic word combination profiles based on various underlying text corpora can be requested for a particular time period, as well as direct comparisons between different time periods. In addition to traditional static tabular display formats, a number of intuitive interactive online visualizations for query result data are also available.

A short guide on how to use DiaCollo

Visit the DiaCollo query form in a browser to query the data from the German Text Archive text corpus
Type the word Revolution in the QUERY field.
Select Cloud from the FORMAT menu. Leave the rest of the fields unchanged.
Click on the submit button (next to the QUERY field).
In the box beneath the query section, the words that typically appear with Revolution will be displayed. The window initially shows the situation in 1610. The presentation format is a word-cloud: the displayed words will differ in size and colour based on their association strengths with respect to the target word, Revolution.
Directly above the display area is a time-line beginning at 1610 and ending at 1910, divided into intervals of 10 years each. To the left of the display area is a scale of the (relative) association strengths for the displayed items for easier interpretation of the results.
Clicking on a date in the time-line (e.g. 1790) will cause the typical combinations for Revolution in the corresponding decade to be displayed; clicking on a word in the display area will display a window containing detailed information on that word to be displayed, including a direct link to the respective underlying corpus hits. Alternatively, you can click on the play button to the left of the time-line to initiate an animation of the changes in typical word combinations over time. Playback speed can be altered with the vertical slider next to the play button.

You can modify the basic recipe above in various ways, for example by changing the queried time period (DATE) and/or the size of the intervals on the time-line (SLICE). You can also change the maximal number of displayed collocates (KBEST) or the mode of visual presentation (FORMAT). Additional corpora and further modes of application are also available. For instance, you can use DiaCollo to display the differences or the similarities between two different words on the basis of their typical collocates over a given time period, or to directly compare the typical collocates of a single word in two different time periods. Further details and examples can be found in the full CLARIN-D DiaCollo use-case (in German), as well as in DiaCollo's online help pages.

Additional versions of this guide

A more detailed guide with examples in German is available in PDF format.

CLARIN Centre

Berlin-Brandenburg Academy of Sciences (BBAW)

Project leader

Bryan Jurish

Contact email

jurish@bbaw.de

Links

DiaCollo

Paper from the CLARIN2015 Conference

DiaCollo help page

The full CLARIN DiaCollo case study (in German)

Screencast of the example in the short guide (Youtube)

Acknowledgements

DiaCollo is a use case of the CLARIN-D centre in the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW).

Participating projects:

German Text Archive (www.deutschestextarchiv.de)
Digital Dictionary of the German Language (www.dwds.de)
CLARIN-D centre of the BBAW (clarin.bbaw.de)

Related CLARIN-D tools and services

WebLicht web-based analysis tool
DTA::CAB historical German text analysis service

Nederlab, online laboratory for humanities research on Dutch text collections

Nederlab is a user-friendly and tool-enriched open access web interface that aims at containing all digitized texts relevant for the Dutch national heritage and the history of Dutch language and culture (c. 800 - present).

The Nederlab project aims to bring together all digitized texts relevant to Dutch national heritage, the history of Dutch language and culture (c. 800 -present) in one user-friendly and tool-enriched open access web interface, allowing scholars to simultaneously search and analyze data from texts spanning the full recorded history of the Netherlands, its language and culture. The project builds on various initiatives: for corpora Nederlab collaborates with the scientific libraries and institutions, for infrastructure with CLARIN (and CLARIAH), for tools with eHumanities programmes such as Catch, IMPACT and CLARIN (TICCL, frog).

Nederlab allows researchers to search and refine its content on basis of metadata, text and several layers of annotations for this text, such as lemmata, part-of-speech tags, named entities or syntactic annotations. These enrichments are added during a preprocessing stage that also applies automatic spelling normalization. Search results can of course be inspected one-by-one, via lists or keyword-in-context concordances, but also in several aggregated forms. For example, results can simultaneously be grouped on basis of publication date and genre and then displayed as visualisations or exported. Or they can be presented as collocations. Statistics about the result set are available as well, as are frequency lists over any subcollection. Search results can be stored as virtual collections in the researcher’s personal workspace. A range of tools will be available in this workspace to analyse the collections or to compare them to each other.

The first version of Nederlab was launched in early 2015, it’ll be expanded until the end of 2017.

CLARIN Centre

Meertens Instituut in collaboration with Huygens ING and the Institute for Dutch Lexicology.

Principal Investigator

prof. dr. Hans Bennis

Project Leader

Hennie Brugman

Country

Netherlands

Language

Dutch

Contact email

hennie.brugman@meertens.knaw.nl

Links

Acknowledgements

Nederlab is financed by NWO, KNAW, CLARIAH and CLARIN-NL.

Woordenboek der Nederlandsche Taal

Search engine demonstration:

Country:

Netherlands

CLARIN Centre:

INT

Description

The integrated language bank of the Dutch Language Institute offers online access to a number of historical dictionaries, including Old, Middle and Modern Dutch, and the Frisian language.

Czech National Corpus

Search engine demonstration:

Country:

Czech Republic

CLARIN Centre:

Charles University

Description

A Corpus is a collection of texts in electronic form used for linguistic research, using provided with digital tools to allow searching, analysis and research. Users can use these tools to find words and collocations in their original contexts, and determine their frequency in the corpus. The Czech National Corpus (CNC) is an academic project focusing on building a large electronic corpus of mainly written Czech. The Institute of the Czech National Corpus (ICNC), Faculty of Arts, Charles University in Prague oversees the development of the CNC, including its use in teaching, and advancing the field of the corpus linguistics.

Das Digitale Wörterbuch der deutschen Sprache

Search engine demonstration:

Country:

Germany

CLARIN Centre:

BBAW

Description

Das Digitale Wörterbuch der deutschen Sprache (Digital Dictionary of the German Language) provides a wealth of information in its contemporary and historical forms, with more than 410,000 entries from five dictionaries, 1.8 million words in 15 corpora, and word profiles and trends based on frequencies.

Korp

Search engine demonstration:

Country:

Finland

CLARIN Centre:

CSC - IT Center for Science

Description

The Korp online resource offers the opportunity to search a wealth of language resources (mostly) in the Finnish language, from a range of time periods. The Korp software was originally developed and is actively maintained by The Swedish Language Bank.

corpus

‘Voices from Ravensbrück’ Project

Tour de CLARIN: Interview with Titika Dimitroulia

DiaCollo: collocation analysis in diachronic perspective

A short guide on how to use DiaCollo

Additional versions of this guide

Tags

Nederlab, online laboratory for humanities research on Dutch text collections

CLARIN Centre

Principal Investigator

Project Leader

Country

Language

Tags

Woordenboek der Nederlandsche Taal

Czech National Corpus

Das Digitale Wörterbuch der deutschen Sprache

Korp

CLARIN – the research infrastructure for language as social and cultural data