Treffer: Comparative Analysis of AI Models for Python Code Generation: A HumanEval Benchmark Study.
Weitere Informationen
This study conducts a comprehensive comparative analysis of six contemporary artificial intelligence models for Python code generation using the HumanEval benchmark. The evaluated models include GPT-3.5 Turbo, GPT-4 Omni, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude Sonnet 4, and Claude Opus 4. A total of 164 Python programming problems were utilized to assess model performance through a multi-faceted methodology incorporating automated functional correctness evaluation via the Pass@1 metric, cyclomatic complexity analysis, maintainability index calculations, and lines-of-code assessment. The results indicate that Claude Sonnet 4 achieved the highest performance with a success rate of 95.1%, followed closely by Claude Opus 4 at 94.5%. Across all metrics, models developed by Anthropic Claude consistently outperformed those developed by OpenAI GPT by margins exceeding 20%. Statistical analysis further confirmed the existence of significant differences between the model families (p < 0.001). Anthropic Claude models were observed to generate more sophisticated and maintainable solutions with superior syntactic accuracy. In contrast, OpenAI GPT models tended to adopt simpler strategies but exhibited notable limitations in terms of reliability. These findings offer evidence-based insights to guide the selection of AI-powered coding assistants in professional software development contexts. [ABSTRACT FROM AUTHOR]
Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)