Result: MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

Title:

MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

Authors:

Arranz, Victoria, Choukri, Khalid, Cuadros, Montse, García-Pablos, Aitor, Gianola, Lucie, Grouin, Cyril, Herranz, Manuel, Paroubek, Patrick, Zweigenbaum, Pierre

Contributors:

Evaluations and Language resources Distribution Agency (ELDA), VicomTech, Information, Langue Ecrite et Signée (ILES), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues - LISN (STL), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Pangeanic - PangeaMT, European Project: A2019/1927065,MAPA

Source:

Joint Workshop on Legal and Ethical Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Language Resources (LEGAL - MDLR 2022). :64-72

Publisher Information:

CCSD, 2022.

Publication Year:

2022

Collection:

collection:CNRS
collection:CENTRALESUPELEC
collection:UNIV-PARIS-SACLAY
collection:UNIVERSITE-PARIS-SACLAY
collection:LISN
collection:GS-COMPUTER-SCIENCE
collection:LISN-ILES
collection:LISN-STL

Subject Terms:

pseudonymisation, de-identification, sensitive information, deep learning, BERT, NER, annotated data, anonymisation de-identification sensitive information deep learning BERT NER annotated data, anonymisation, MESH: Natural Language Processing, [INFO.INFO-CL]Computer Science [cs], Computation and Language [cs.CL], [INFO.INFO-TT]Computer Science [cs], Document and Text Processing

Subject Geographic:

Marseille, France

Original Identifier:

HAL: hal-03873042

Document Type:

Conference conferenceObject<br />Conference papers

Language:

English

Relation:

info:eu-repo/grantAgreement//A2019/1927065/EU/Multilingual Anonymisation toolkit for Public Administrations/MAPA

Access URL:

https://hal.science/hal-03873042
https://hal.science/hal-03873042v1/document
https://hal.science/hal-03873042v1/file/Arranz_LEGAL2022.pdf

Rights:

info:eu-repo/semantics/OpenAccess
URL: http://creativecommons.org/licenses/by-nc/

Accession Number:

edshal.hal.03873042v1

Database:

HAL

Further Information

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.

Result: MAPA Project: Ready-to-Go Open-Source Datasets and Deep Learning Technology to Remove Identifying Information from Text Documents

Further Information

Links

Additional functions