Result: MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

Title:
MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents
Contributors:
Evaluations and Language resources Distribution Agency (ELDA), VicomTech, Information, Langue Ecrite et Signée (ILES), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues - LISN (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Pangeanic - PangeaMT, European Project: A2019/1927065,MAPA
Source:
Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Language Resources (LEGAL - MDLR 2022). :64-72
Publisher Information:
CCSD, 2022.
Publication Year:
2022
Collection:
collection:CNRS
collection:CENTRALESUPELEC
collection:UNIV-PARIS-SACLAY
collection:UNIVERSITE-PARIS-SACLAY
collection:LISN
collection:GS-COMPUTER-SCIENCE
collection:LISN-ILES
collection:LISN-STL
Subject Geographic:
Original Identifier:
HAL: hal-03873042
Document Type:
Conference conferenceObject<br />Conference papers
Language:
English
Relation:
info:eu-repo/grantAgreement//A2019/1927065/EU/Multilingual Anonymisation toolkit for Public Administrations/MAPA
Rights:
info:eu-repo/semantics/OpenAccess
URL: http://creativecommons.org/licenses/by-nc/
Accession Number:
edshal.hal.03873042v1
Database:
HAL

Further Information

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.