Result: MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents
Title:
MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents
Authors:
Contributors:
Evaluations and Language resources Distribution Agency (ELDA), VicomTech, Information, Langue Ecrite et Signée (ILES), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues - LISN (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Pangeanic - PangeaMT, European Project: A2019/1927065,MAPA
Source:
Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Language Resources (LEGAL - MDLR 2022). :64-72
Publisher Information:
CCSD, 2022.
Publication Year:
2022
Collection:
collection:CNRS
collection:CENTRALESUPELEC
collection:UNIV-PARIS-SACLAY
collection:UNIVERSITE-PARIS-SACLAY
collection:LISN
collection:GS-COMPUTER-SCIENCE
collection:LISN-ILES
collection:LISN-STL
collection:CENTRALESUPELEC
collection:UNIV-PARIS-SACLAY
collection:UNIVERSITE-PARIS-SACLAY
collection:LISN
collection:GS-COMPUTER-SCIENCE
collection:LISN-ILES
collection:LISN-STL
Subject Terms:
pseudonymisation, de-identification, sensitive information, deep learning, BERT, NER, annotated data, anonymisation de-identification sensitive information deep learning BERT NER annotated data, anonymisation, MESH: Natural Language Processing, [INFO.INFO-CL]Computer Science [cs], Computation and Language [cs.CL], [INFO.INFO-TT]Computer Science [cs], Document and Text Processing
Subject Geographic:
Original Identifier:
HAL: hal-03873042
Document Type:
Conference
conferenceObject<br />Conference papers
Language:
English
Relation:
info:eu-repo/grantAgreement//A2019/1927065/EU/Multilingual Anonymisation toolkit for Public Administrations/MAPA
Access URL:
Rights:
info:eu-repo/semantics/OpenAccess
URL: http://creativecommons.org/licenses/by-nc/
URL: http://creativecommons.org/licenses/by-nc/
Accession Number:
edshal.hal.03873042v1
Database:
HAL
Further Information
This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.