Treffer: CLAVE: A deep learning model for source code authorship verification with contrastive learning and transformer encoders.

Title:
CLAVE: A deep learning model for source code authorship verification with contrastive learning and transformer encoders.
Authors:
Álvarez-Fidalgo, David1 (AUTHOR) uo270571@uniovi.es, Ortin, Francisco1,2 (AUTHOR) ortin@uniovi.es
Source:
Information Processing & Management. May2025, Vol. 62 Issue 3, pN.PAG-N.PAG. 1p.
Database:
Education Research Complete

Weitere Informationen

Source code authorship verification involves determining whether two code fragments are written by the same programmer. It has many uses, including malware authorship analysis, copyright dispute resolution and plagiarism detection. Source code authorship verification is challenging because it must generalize to code written by programmers not included in its training data. In this paper, we present CLAVE (Contrastive Learning for Authorship Verification with Encoder representations), a novel deep learning model for source code authorship verification that leverages contrastive learning and a Transformer Encoder-based architecture. We initially pre-train CLAVE on a dataset of 270,602 Python source code files extracted from GitHub. Subsequently, we fine-tune CLAVE for authorship verification using contrastive learning on Python submissions from 61,956 distinct programmers in Google Code Jam and Kick Start competitions. This approach allows the model to learn stylometric representations of source code, enabling comparison via vector distance for authorship verification. CLAVE achieves an AUC of 0.9782, reduces the error of the state-of-the-art source code authorship verification systems by at least 23.4% and improves the AUC of cutting-edge source code LLMs by 21.9% to 40%. We also evaluate the main components of CLAVE on its AUC performance improvement: pre-training (1.8%), loss function (0.2%–2.8%), input length (0.1%–0.7%), model size (0.2%), and tokenizer (0.1%–0.7%). • CLAVE is a deep-learning model for source code authorship verification. • A large Python dataset is used to pre-train a Transformer Encoder. • Different customized tokenizers are specifically defined for Python. • Contrastive learning is used to learn stylometric representations. • CLAVE outperforms the state-of-the-art systems with 0.9782 AUC. [ABSTRACT FROM AUTHOR]

Copyright of Information Processing & Management is the property of Pergamon Press - An Imprint of Elsevier Science and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)