Treffer: Linking Text Corpora to Lexicographical Resources Using Wikibase and OntoLex- Lemon, Artificial Intelligence Conference
Weitere Informationen
Linking Text Corpora to Lexicographical Resources Using Wikibase and OntoLex-Lemon Keywords: Linked Data, Wikibase, OntoLex-Lemon, Serbian, srpELTeC Topic: Knowledge representation, reasoning, and planning; semantic networks; knowledge representation techniques This abstract showcases a novel approach to linking textual data from a corpus to lexicographical resources using Wikibase as a collaborative platform. Specifically, we present a data model and an initial use case focused on the integration of a Serbian literary corpus (SrpELTeC) in NIF (Natural Language Interchange Format) with a Serbian dictionary following the OntoLex-Lemon lexicographical framework. Our work aims to combine linguistic annotations, including morphosyntactic, semantic, and philological information, with lexicon entries, facilitating a broader linguistic knowledge graph. Wikibase, a versatile extension of MediaWiki, serves as the infrastructure behind Wikidata, one of the largest crowdsourced, queriable knowledge graphs to date (Vrandečić & Krötzsch, 2014). Our proposed model aims to explore the synergy between textual corpora and lexical data. Our proposed use case shows how a Serbian text corpus can be linked to a Serbian dictionary, specifically leveraging the OntoLex-Lemon model, which has been increasingly utilized for linguistic linked data. The SrpELTeC corpus is enriched with linguistic annotations based on the NIF Ontology (Hellmann et al., 2013), including part of speech tags, lemmas, and named entity recognition (NER) results. For this, we employed the BEaST tagger for Serbian (Stanković et al. 2020, Stanković et. al. 2022), generating annotations such as lemma and part of speech categories mapped to the OLiA (Ontology for Linguistic Annotation) lexical categories. This mapping allows for the identification of candidate lexicon entries within the SrpMD Serbian dictionary (Krstev, 2008; Stanković et al., 2018), which is structured using the OntoLex-Lemon model (McCrae et al., 2017). The smooth integration of this model into ...