Skip to main content

IceTaboo: Offensive Word Database with Commercial Application

Anton Karl Ingason, Agnes Sólmundsdóttir, Lilja Björk Stefánsdóttir
Submitted by Karina Berger on

The Project

The IceTaboo database is a novel resource for processing offensive words in Icelandic. Developed by a small team at the Language and Technology Lab at the University of Iceland during the summer of 2020, IceTaboo includes 2725 words that are inappropriate or offensive to at least some speakers in some contexts. The database has been designed to be used as part of the automatic proofreading software GreynirCorrect, which the lab developed in close collaboration with an industry partner in commercial software development. IceTaboo can be used to flag contextually inappropriate words when the correction software comes across them in texts, and is already being used as part of the automatic proofreading tool by an Icelandic online news website.

 

'Our lab is really focused on collaborating with industry. We really want our work to benefit the public.' 
Agnes Sólmundsdóttir

 

Methodology

The IceTaboo database consists of a list of words in Icelandic that may be considered inappropriate, taboo or loaded in use or meaning. The list includes words that are biased against certain minorities (for instance, different races, abilities, genders or sexualities), words that are derogatory towards people, unnecessarily gendered or obsolete, and those that are not very inappropriate, but can be considered politically loaded or unsuitable for children.

The database was compiled manually at the Language and Technology Lab at the University of Iceland in 2020. The project team began with extensive brainstorming sessions, followed by a systematic search on the internet. Social media platforms, especially the comment sections, were also consulted. Different spelling variations were taken into consideration, and some slang words from English and Danish were also included.

The output of this work was used to establish a classification system, grouping words together in categories depending on either their meaning, form or use. Categories include swear words, health-related words, nasty adjectives, offensive profession names, offensive words related to religion, offensive descriptions of people’s appearance, and offensive words related to sex. It also includes a class for words with a nuanced relationship with offensiveness, such as political terms or words that have an alternative, non-offensive meaning.

Once classified, each class of words was systematically studied. Synonyms or related words were noted and the Database of Modern Icelandic Inflection (DMII) was used in order to identify compounds that contained inappropriate parts. Each word is coded for part-of-speech, a classification as well as information about the meaning of the word, including an explanation as to why it may be considered inappropriate, and in what context.

 

 

Outcome

As part of the GreynirCorrect automatic proofreading software, the IceTaboo database is already being used to highlight inappropriate words at the Icelandic online news website kjarninn. This means that the system now flags potentially inappropriate words while reporters are writing content for the website.

Sólmundsdóttir says: ‘So now, if journalists are writing a story, they might get a pop-up window. It’s part of the correction software – it corrects spelling and grammar, but now they also get a “ping” if they write a word that some readers might find inappropriate.’

 
This screenshot from the correction software interface shows how it appears to users. Here, IceTaboo has flagged the word 'hjúkrunarkona', explains why it might be inappropriate, and provides alternative suggestions. Translated, the example sentence says: 'This man is hurt. Is there a nurse here?' The flagged word is 'hjúkrunarkona', which literally means 'nurse-woman' or 'nursemaid'. The (translated) explanation for the term's inappropriateness says: 'An unfortunate or inappropriate choice of words, a better word would be the word "hjúkrunarfræðingur".' The suggested word 'hjúkrunarfræðingur' translates more closely to 'registered nurse', suggesting that the person is educated in this specific field. A further explanation for inaproppriateness adds: 'The word "hjúkrunarkona" can be considered to enforce certain gendered stereo-types and implies that nursing is a job only done by women.'
 

The database is released under an open CC BY 4.0 licence on CLARIN. The proofreading system GreynirCorrect, which was developed by the Language and Technology Lab in collaboration with Miðeind, a leading software company in the field of linguistics and artificial intelligence, is under development in an open repository on Github under an MIT licence.

 

‘Our lab and the language technology community in Iceland emphasises licences that make all products easily reusable. In this case we used the Creative Commons Attribution Licence, which places almost no restrictions on normal use cases.’ 
Agnes Sólmundsdóttir

According to project leader Agnes Sólmundsdóttir, other Icelandic companies working with text have also shown interest in integrating the correction software, including IceTaboo, into their workflow: ‘It’s already running in the news website, and it will probably be running in more scenarios soon. There’s definitely interest.’

While the project team used manual annotation and focused on integrating IceTaboo into the automatic proofreading system, the database could also be used for different purposes. Apart from research focusing on inappropriateness in Icelandic, it could also be of use in the future development of systems that apply machine learning methods to automatic detection of offensive language. The words in the database can inform feature extraction steps of such systems and potentially make them more effective.

The detection of offensive or contextually inappropriate language could also be important for monitoring freely accessible discussion spaces on social media or for helping a user of a word processing system to avoid inappropriate expressions. For such extended applications, the database could serve as useful first step. In addition, the classification system could be useful when trying to extend the database to other languages.


 

Views on CLARIN

'We deposited our database at CLARIN. It’s a really well-respected platform for language technology tools. Our lab and the language technology community in Iceland is really focused on making all resources publicly available. And CLARIN is a really good platform for this. It suited the project perfectly.' Agnes Sólmundsdóttir
 
'Open access availability on CLARIN and an industry-friendly licencing policy ensures that the resource is ready to be used by any software developer that shows interest.' Agnes Sólmundsdóttir

 
Contributors

Anton Karl Ingason, Associate Professor at the University of Iceland, and Director of the Language and Technology Lab

Agnes Sólmundsdóttir, Research Assistant at the Language and Technology Lab, University of Iceland, and undergraduate student (first author)

Lilja Björk Stefánsdóttir, Project Manager at the Language and Technology Lab, University of Iceland