Treffer: Innovation in phraseomatics : DiCoP project and DiCoP-Text corpus for the enrichment of Language Models and Automatic Translation

Title:
Innovation in phraseomatics : DiCoP project and DiCoP-Text corpus for the enrichment of Language Models and Automatic Translation
Innovation en phraséomatique : projet DiCoP et DiCoP-Text pour l'enrichissement des modèles de langage et la traduction automatique
Contributors:
Lexiques, Textes, Discours, Dictionnaire - Centre Jean Pruvost (LT2D), CY Cergy Paris Université (CY), Laboratoire Ligérien de Linguistique (LLL), Bibliothèque nationale de France (BnF)-Université d'Orléans (UO)-Université de Tours (UT)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique, Image et Interaction - EA 2118 (L3I), La Rochelle Université (ULR), LT2D, Cergy Paris Université, LLL, Université d'Orléans, Kristina Štrkalj Despot, Ana Ostroški Anić, Ivana Brač, XXI EURALEX International Congress
Source:
EURALEX 2024 - 21st EURALEX International Congress Lexicography and Semantics. :227-234
Publisher Information:
CCSD, 2024.
Publication Year:
2024
Collection:
collection:SHS
collection:BNF
collection:UNIV-TOURS
collection:CNRS
collection:UNIV-ORLEANS
collection:UNIV-CERGY
collection:AO-LINGUISTIQUE
collection:LLL
collection:UNIV-ROCHELLE
collection:LT2D
collection:CY-ART-HUMANITES
collection:CY-MAISON-SHS
Subject Geographic:
Original Identifier:
HAL: hal-04664240
Document Type:
Konferenz conferenceObject<br />Conference papers
Language:
English
ISBN:
978-953-7967-77-2
Rights:
info:eu-repo/semantics/OpenAccess
Accession Number:
edshal.hal.04664240v1
Database:
HAL

Weitere Informationen

This article examines advances in phraseomatics (L. Chen, 2023) and digital phraseography through the DiCoP project and its DiCoP-Text corpus, aimed at enriching linguistic models and machine translation. The project evaluates the frequency of use of phraseological units (PUs) and improves their translation in different contexts, drawing on recent research in phraseotranslation (Sułkowska, 2022) and natural language processing (NLP). It emphasizes French-Chinese and Chinese-French language pairs. We integrated 549 PUs from the novel The Three-Body Problem by Liu Cixin for our tests. Various processes, such as tokenization, identification, alignment, and annotation, were used to improve the translation of PUs. DiCoP-Text, a comprehensive database including newspaper articles, literary works, and textbooks, aims to enhance the performance of language models (LMs).