Result: Code semantic enrichment for deep code search.
Further Information
Code search aims to retrieve code snippets from a large-scale codebase, where the semantics of the searched code match developers' query intent. Code is a low-level implementation of programming intents, but query is always expressed as clear and high-level semantics, which makes it difficult for DL-based approaches to learn the semantic relationship between them. Through a large-scale empirical analysis on more than 2.2 million pairs of Java code and description, we found that the semantics of code and query can be aligned by enriching code with the descriptions of other code in terms of similar implementation. Based on the finding, we propose a code semantic enrichment approach for deep code search, named SemEnr. Specifically, we first enrich semantics for all code snippets in the training and testing data. We estimated the syntactic similarity of each code snippet from the training data and retrieved the most similar one for each. Thereafter, the semantics of one code snippet is represented by its code tokens and the description of the retrieved most similar code. During the model training, we used the attention mechanism to embed pairs of enriched code and query into the shared high-dimensional vector space. To enhance the quality of our learned representations, we integrated a multi-perspective co-attention mechanism, employing Convolutional Neural Networks (CNNs) to capture local correlations between code and query. Finally, we evaluated the effectiveness of our approach by performing experiments on two extensively used Java datasets. Our experimental results reveal that SemEnr achieves an MRR of 0.698 and 0.631, outperforming the best baseline CAT (a state-of-the-art DL-based model) by 19.93% and 18.83%, respectively. In addition, we conducted a user study involving 50 real-world queries to assess SemEnr's performance, and the findings suggest that SemEnr outperformed baseline models by returning more relevant code snippets. [Display omitted] • Finding that the code semantics can be enriched by incorporating with the description of its most similar code. • Proposing a code semantic enrichment approach named SemEnr for deep code search. • Evaluating the performance of SemEnr on two existing datasets and 50 real queries. • Making our implementation and the dataset used available. [ABSTRACT FROM AUTHOR]
Copyright of Journal of Systems & Software is the property of Elsevier B.V. and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)