Blog post written by Normunds Grūzītis, edited by Darja Fišer and Jakob Lenardič
The Latvian FrameNet-annotated text corpus is a balanced, multilayered corpus that shows how words are used and what they mean. In natural language processing, it is used in applications, such as information extraction, machine translation, event recognition, sentiment analysis, while in linguistics it can be used as a valence dictionary that shows the combinatorial properties of vocabulary. Latvian FrameNet is being created within a larger industry-driven R&D project by the Institute of Mathematics and Computer Science at University of Latvia (IMCS UL) and the national news agency LETA (Gruzitis et al. 2018), which relies on natural language understanding and information extraction technologies for efficient and innovative media monitoring and content production. It is the corpus with the most annotation layers in the repository CLARIN Latvia. It is well suited for this as it is anchored in several cross-lingual syntactic and semantic representations:
- Universal Dependencies (Nivre et al. 2016), which provide the framework for the syntactic parsing of the corpus,
- FrameNet (Fillmore et al. 2003), a human- and machine-readable lexical inventory based on frame semantics for semantic role labelling,
- PropBank (Palmer et al. 2005), which provides basic predicate-argument relations such as thematic roles (e.g., agent, patient, recipient, theme, etc.),
- Abstract Meaning Representation (Banarescu et al. 2013), which are graph representations of “who is doing what to whom” in a sentence, and
- auxiliary layers for named entity and coreference annotation.
Latvian FrameNet is annotated according to the latest frame inventory of Berkeley FrameNet on top of the underlying UD layer, using the CLARIN-D annotation tool WebAnno (Eckart de Castilho et al. 2016). Thus, the annotation of frames and frame elements is guided by the dependency structure of a sentence. Currently, Latvian FrameNet consists of 7,581 annotation sets (frame instances) which cover 454 different semantic frames and 834 different target verbs (lexemes), making 1,580 lexical units (LU).
The figure above shows how the Latvian variant of the sentence Jasmine goes to the window and looks outside is annotated with frame semantic labels and relations. This sentence consists of two coordinated clauses that share the same grammatical subject. The verb aiziet (‘go’) in the first clause is labelled with the semantic frame self_motion (triggered by this particular context), while the verb skatīties (‘look’) in the second evokes the frame Perception_active. Since the frame semantics are built on top of the underlying syntactic dependencies, the noun Jasmine gets specified with the relations Self-mover and Perceiver_agent, which are connected to the two verbs.
The dataset is available on GitHub. By the end of the project, CLARIN Latvia expects to double the size of the Latvian FrameNet corpus. The overall aim is to acquire a balanced and representative medium-sized multilayer corpus: around 10,000 sentences annotated at all the above-mentioned layers, including FrameNet. To ensure that the corpus is balanced not only in terms of text genres and writing styles but also in terms of LUs, a fundamental design decision is that the text unit is an isolated paragraph. Paragraphs were manually selected from a balanced 10-million-word text corpus: 60% news, 20% fiction, 7% academic texts, 6% legal texts, 5% spoken language, 2% miscellaneous. As for the LUs, the goal is to cover at least 1,000 most frequently occurring verbs, calculated from the 10-million-word corpus.
Acknowledgements
This work has received financial support from the European Regional Development Fund under the grant agreement No. 1.1.1.1/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian).
References
Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., Schneider, N. Abstract Meaning Representation for Sembanking, in: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, 2013, pp. 178– 186.
Eckart de Castilho, R., Mujdricza-Maydt, E., Yimam, S.M., Hartmann, S., Gurevych, I., Frank, A., Biemann, C. A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures, in: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities, Osaka, Japan, 2016, pp. 76–84.
Fillmore, C.J., Johnson, C.R., Petruck, M.R.L. Background to FrameNet, International Journal of Lexicography 16(3) (2003), 235–250.
Gruzitis, N., Nespore-Berzkalne, G., Saulite, B. Creation of Latvian FrameNet based on Universal Dependencies, in: Proceedings of the International FrameNet Workshop 2018: Multilingual FrameNets and Constructicons (IFNW), Miyazaki, Japan, 2018, pp. 23–27.
Gruzitis, N., Pretkalnina, L., Saulite, B., Rituma, L., Nespore-Berzkalne, G., Znotins, A., Paikens, P. Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU, in: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, 2018, pp. 4506– 4513.
Nespore-Berzkalne, G., Saulite, B., Gruzitis, N. Latvian FrameNet: Cross-Lingual Issues, in: Human Language Technologies – The Baltic perspective: Proceedings of the Eighth International Conference Baltic HLT 2018, Tartu, Estonia, 2018, pp. 96–103.
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C.D., McDonald, D., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. Universal Dependencies v1: A Multilingual Treebank Collection, in: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), 2016, pp. 1659–1666.
Palmer, M., Gildea, D., Kingsbury, P. The Proposition Bank: An Annotated Corpus of Semantic Roles, Computational Linguistics 31(1) (2005), 71–106.
Click here to read more about Tour de CLARIN