Treffer: Unsupervised Translation of Programming Languages ; Traduction Non Supervisée de Langages de Programmation

Title:
Unsupervised Translation of Programming Languages ; Traduction Non Supervisée de Langages de Programmation
Contributors:
Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision (LAMSADE), Université Paris Dauphine-PSL, Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Université Paris sciences et lettres, Tristan Cazenave
Source:
https://theses.hal.science/tel-03852612 ; Neural and Evolutionary Computing [cs.NE]. Université Paris sciences et lettres, 2022. English. ⟨NNT : 2022UPSLD015⟩.
Publisher Information:
CCSD
Publication Year:
2022
Collection:
Université Paris-Dauphine: HAL
Document Type:
Dissertation doctoral or postdoctoral thesis
Language:
English
Relation:
NNT: 2022UPSLD015
Rights:
info:eu-repo/semantics/OpenAccess
Accession Number:
edsbas.BA64F75B
Database:
BASE

Weitere Informationen

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this thesis, we propose methods to train effective and fully unsupervised neural transcompilers.Natural language translators are evaluated with metrics based on token co-occurences between the translation and the reference. We identify that they do not capture the semantics of programming languages. Hence, we build and release a test set composed of 852 parallel functions, along with unit tests to check the semantic correctness of translations. We first leverage objectives designed for natural languages to learn multilingual representations of source code, and train a model to translate, using source code from open source GitHub projects. This model outperforms rule-based methods for translating functions between C++, Java, and Python. Then, we develop an improved pre-training method, which leads the model to learn deeper semantic representations of source code. It results in enhanced performances on several tasks including unsupervised code translation. Finally, we use automated unit tests to automatically create examples ...