Treffer: Insights and Analysis of Open-source License Violation Risks in LLMs Generated Code.
Weitere Informationen
The field of software engineering has been significantly influenced by the rapid development of large language models (LLMs). These models, which are pre-trained with a vast amount of code from open-source repositories, are capable of efficiently accomplishing tasks such as code generation and code completion. However, a large number of programs in the open-source software repositories are constrained by open-source licenses, bringing potential open-source license violation risks to the large models. This paper focuses on the license violation risks between code generated by LLMs and open-source repositories. A detection framework that supports the tracing of the source of code generated by large models and the identification of copyright infringement issues is developed based on code clone technology. For 135,000 Python code fragments generated by 9 mainstream code large models, the source is traced and the open-source license compatibility is detected in the open-source community by this framework. Through practical investigation of three research questions, the impact of large model code generation on the open-source software ecosystem is explored: (1) To what extent is the code generated by large models cloned from open-source software repositories? (2) Is there a risk of open-source license violations in the code generated by large models? (3) Is there a risk of open-source license violations in the large model-generated code included in real open-source software? The experimental results indicate that among the 43,130 and 65,900 Python code fragments longer than six lines generated by using functional descriptions and method signatures, 68.5% and 60.9% of the programs respectively are traced to have cloned open-source code segments. The CodeParrot and CodeGen series models have the highest clone ratios, while GPT-3.5-Turbo has the lowest. Besides, 92.7% of the code files generated by using functional descriptions lack license declaration. By comparing with the licenses of the traced code fragments, 81.8% of the code files have open-source license violation risks. Furthermore, among 229 program files generated by LLMs collected from GitHub, 136 code samples are traced to have open-source code segments, among which 38 are of Type1 and Type2 clone types, and 30 have open-source license violation risks. These issues are reported to the developers in the form of problem reports. Up to now, feedback has been received from eight developers. [ABSTRACT FROM AUTHOR]
Copyright of International Journal of Software & Informatics is the property of Institute of Software, Chinese Academy of Sciences and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)