Treffer: Benchmarking the Performance of Generative AI Models on Fundamental Python Programming Tasks: Dataset and Evaluation Report

Title:

Benchmarking the Performance of Generative AI Models on Fundamental Python Programming Tasks: Dataset and Evaluation Report

Authors:

Ramadhan, Kautsar

Publisher Information:

Zenodo

Publication Year:

2025

Collection:

Zenodo

Subject Terms:

Artificial Intelligence, Large Language Models, Natural Language Processing, AI Comparison, Prompt Engineering, Responsiveness, Model Accuracy, Code Evaluation, Python Programming, ChatGPT, Gemini, Copilot, Meta AI, Human-Computer Interaction, Evaluation Metrics in AI, Programming Languages - Python, Software Engineering - Code Generation Tools, Computer Science - Artificial Intelligence

Document Type:

Fachzeitschrift text

Language:

Indonesian

Relation:

https://zenodo.org/records/15788783; oai:zenodo.org:15788783; https://doi.org/10.5281/zenodo.15788783

DOI:

10.5281/zenodo.15788783

Availability:

https://doi.org/10.5281/zenodo.15788783
https://zenodo.org/records/15788783

Rights:

Accession Number:

edsbas.BBBA3E52

Database:

BASE

Weitere Informationen

This dataset was developed as part of an undergraduate research project by Kautsar Ramadhan (17210623) at Universitas Bina Sarana Informatika. It serves as an open benchmark for evaluating the performance of five prominent generative AI platforms in responding to fundamental Python programming questions formulated in Indonesian. The primary goal is to provide a transparent and replicable resource for assessing the effectiveness of AI models in the context of programming education, particularly for the Indonesian-speaking community. The evaluation employs a quantitative approach, utilizing non-parametric statistical tests to ensure the validity of the results. The AI models benchmarked are: ChatGPT (OpenAI's GPT-4o) Gemini (Google's Gemini 2.5 Pro) GitHub Copilot Meta AI Perplexity AI (Sonar model) This repository contains: A corpus of 100 multiple-choice questions on basic Python programming, presented in Indonesian. The complete raw text responses generated by each of the five AI models for all questions. Pre-computed evaluation metrics, including: BLEU scores (syntactic similarity), BERTScore (semantic similarity), Response times (in milliseconds). Python source code for automated evaluation, including: bleu.py: BLEU score computation module. bert.py: Modified BERTScore module with chunking support for long answers. analisis_final.py: Statistical testing script for both Permutation ANOVA (with Bonferroni correction) and Paired Bootstrap Resampling. Data visualizations, such as forest plots, radar charts, and bootstrap distribution plots. An env.yaml file for environment replication using Conda. These components form a modular evaluation framework, allowing for automated, reproducible, and extensible analysis of AI performance. The framework can serve as a reusable toolkit for future benchmarking efforts involving generative AI models, particularly in low-resource language contexts like Indonesian. Associated Publications: 📝 Journal Article: Benchmarking AI Platforms in Answering Python Questions in Indonesian: ...

Treffer: Benchmarking the Performance of Generative AI Models on Fundamental Python Programming Tasks: Dataset and Evaluation Report

Weitere Informationen

Links

Zusatz-Funktionen