Treffer: Benchmarking the Performance of Generative AI Models on Fundamental Python Programming Tasks: Dataset and Evaluation Report

Title:
Benchmarking the Performance of Generative AI Models on Fundamental Python Programming Tasks: Dataset and Evaluation Report
Publisher Information:
Zenodo
Publication Year:
2025
Collection:
Zenodo
Document Type:
Fachzeitschrift text
Language:
Indonesian
DOI:
10.5281/zenodo.15788783
Rights:
Creative Commons Attribution Non Commercial 4.0 International ; cc-by-nc-4.0 ; https://creativecommons.org/licenses/by-nc/4.0/legalcode ; Copyright © 2025 Kautsar Ramadhan
Accession Number:
edsbas.BBBA3E52
Database:
BASE

Weitere Informationen

This dataset was developed as part of an undergraduate research project by Kautsar Ramadhan (17210623) at Universitas Bina Sarana Informatika. It serves as an open benchmark for evaluating the performance of five prominent generative AI platforms in responding to fundamental Python programming questions formulated in Indonesian. The primary goal is to provide a transparent and replicable resource for assessing the effectiveness of AI models in the context of programming education, particularly for the Indonesian-speaking community. The evaluation employs a quantitative approach, utilizing non-parametric statistical tests to ensure the validity of the results. The AI models benchmarked are: ChatGPT (OpenAI's GPT-4o) Gemini (Google's Gemini 2.5 Pro) GitHub Copilot Meta AI Perplexity AI (Sonar model) This repository contains: A corpus of 100 multiple-choice questions on basic Python programming, presented in Indonesian. The complete raw text responses generated by each of the five AI models for all questions. Pre-computed evaluation metrics, including: BLEU scores (syntactic similarity), BERTScore (semantic similarity), Response times (in milliseconds). Python source code for automated evaluation, including: bleu.py: BLEU score computation module. bert.py: Modified BERTScore module with chunking support for long answers. analisis_final.py: Statistical testing script for both Permutation ANOVA (with Bonferroni correction) and Paired Bootstrap Resampling. Data visualizations, such as forest plots, radar charts, and bootstrap distribution plots. An env.yaml file for environment replication using Conda. These components form a modular evaluation framework, allowing for automated, reproducible, and extensible analysis of AI performance. The framework can serve as a reusable toolkit for future benchmarking efforts involving generative AI models, particularly in low-resource language contexts like Indonesian. Associated Publications: 📝 Journal Article: Benchmarking AI Platforms in Answering Python Questions in Indonesian: ...