Treffer: Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence

Title:
Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence
Contributors:
York University [Toronto]
Publisher Information:
CCSD, 2025.
Publication Year:
2025
Original Identifier:
HAL: hal-05183398
Document Type:
E-Ressource preprint<br />Preprints<br />Working Papers
Language:
English
Rights:
info:eu-repo/semantics/OpenAccess
URL: http://creativecommons.org/licenses/by/
Accession Number:
edshal.hal.05183398v1
Database:
HAL

Weitere Informationen

37 pages + references
Since the release of early Large Language Models (LLMs) such as GPT-2 and GPT-3 around 2020, rapid advancements in LLM capabilities have significantly impacted the field of code intelligence, enabling automation across a wide range of coding tasks, including code generation, program repair, debugging, and software testing etc. As the LLMs are widely used in coding tasks, benchmarking their capabilities in a meticulous and meaningful way is essential. In this work, we survey 142 related papers published between January 2020 and June 2025, covering 156 unique benchmarks and 32 different coding tasks, to provide a comprehensive review of coding benchmarks to evaluate LLMs, exploring their characteristics, strengths, limitations, dataset construction, task coverage, alignment with real world challenges, and performance metrics. Our analysis indicates that Python was the most prevalent programming language, featured in 77% of the datasets, and GitHub was the most common data source, used in 46% of the cases. Also, there is an increasing trend in the publication of new benchmark datasets for coding intelligence over the past three years. Prominent publication venues are machine learning conferences such as NeurIPS, ICLR, and ICML, while comparatively fewer datasets have appeared in software engineering venues such as ICSE, ASE, and MSR. In addition, most of the benchmarks focused on the code generation task, with 86 datasets dedicated to this area. We identify key gaps and limitations in existing benchmarks, including bias in datasets, lack of evolution support, lack of standardized evaluation protocols, and insufficient coverage of real-world challenges. Finally, we present a vision for the future of LLM benchmarking creation, advocating for more robust, adaptable, and better alignment with practical code intelligence demands that ensure the reliability and practicality of LLMs in coding workflows.