Treffer: ChatGPT and Python Programming Homework

Title:
ChatGPT and Python Programming Homework
Language:
English
Authors:
Michael E. Ellis (ORCID 0000-0001-8682-3873), K. Mike Casey, Geoffrey Hill
Source:
Decision Sciences Journal of Innovative Education. 2024 22(2):74-87.
Availability:
Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
Peer Reviewed:
Y
Page Count:
14
Publication Date:
2024
Document Type:
Fachzeitschrift Journal Articles<br />Reports - Research
Education Level:
Higher Education
Postsecondary Education
DOI:
10.1111/dsji.12306
ISSN:
1540-4595
1540-4609
Entry Date:
2024
Accession Number:
EJ1420551
Database:
ERIC

Weitere Informationen

Large Language Model (LLM) artificial intelligence tools present a unique challenge for educators who teach programming languages. While LLMs like ChatGPT have been well documented for their ability to complete exams and create prose, there is a noticeable lack of research into their ability to solve problems using high-level programming languages. Like many other university educators, those teaching programming courses would like to detect if students submit assignments generated by an LLM. To investigate grade performance and the likelihood of instructors identifying code generated by artificial intelligence (AI) tools, we compare code generated by students and ChatGPT for introductory Python homework assignments. Our research reveals mixed results on both counts, with ChatGPT performing like a mid-range student on assignments and seasoned instructors struggling to detect AI-generated code. This indicates that although AI-generated results may not always be identifiable, they do not currently yield results approaching those of diligent students. We describe our methodology for selecting and evaluating the code examples, the results of our comparison, and the implications for future classes. We conclude with recommendations for how instructors of programming courses can mitigate student use of LLM tools as well as articulate the inherent value of preserving students' individual creativity in producing programming languages.

As Provided

AN0176537425;q1n01apr.24;2024Apr15.05:52;v2.2.500

ChatGPT and Python programming homework 

Large Language Model (LLM) artificial intelligence tools present a unique challenge for educators who teach programming languages. While LLMs like ChatGPT have been well documented for their ability to complete exams and create prose, there is a noticeable lack of research into their ability to solve problems using high‐level programming languages. Like many other university educators, those teaching programming courses would like to detect if students submit assignments generated by an LLM. To investigate grade performance and the likelihood of instructors identifying code generated by artificial intelligence (AI) tools, we compare code generated by students and ChatGPT for introductory Python homework assignments. Our research reveals mixed results on both counts, with ChatGPT performing like a mid‐range student on assignments and seasoned instructors struggling to detect AI‐generated code. This indicates that although AI‐generated results may not always be identifiable, they do not currently yield results approaching those of diligent students. We describe our methodology for selecting and evaluating the code examples, the results of our comparison, and the implications for future classes. We conclude with recommendations for how instructors of programming courses can mitigate student use of LLM tools as well as articulate the inherent value of preserving students' individual creativity in producing programming languages.

Keywords: analytics; course design; experiential learning; pedagogy

INTRODUCTION

ChatGPT has recently been at the forefront of discussions for educators across all levels and around the globe. Although it is not the only Large Language Model (LLM) artificial intelligence (AI) tool available, it is the one that has received the most attention as of late. Much of the discussion has surrounded its use and misuse in writing and communication‐based assignments (King & ChatGPT, [14]). While generating prose is undoubtedly a strength of these new AI tools, they can do much more. One such capability is writing programming source code. In fact, the ability to analyze, interpret, and troubleshoot source code is the first example given in OpenAI's introduction of the ChatGPT tool (OpenAI, [18]). This code‐generating capability forms the motivation for this research project. This article presents the results of an investigation into how well ChatGPT performs on homework assignments for an introductory‐level computer programming course.

This investigation is not intended to determine how well ChatGPT can perform computer programming tasks, that is well‐established. Instead, given the potential for misuse of these tools in academic exercises, we are interested in how students could use and, more specifically, misuse ChatGPT. Additionally, we are interested in whether instructors can differentiate the output generated by ChatGPT from that of students. The general style of students' typical computer programming homework submissions is individualized (Barros et al., [3]) and often recognizable. We hypothesize that this holds true for AI‐generated results from tools such as ChatGPT. However, this hypothesis may be born out of a naïve sense of hope necessary to protect our course‐related learning objectives.

To test our hypothesis, we compared ChatGPT's work product to that of students who completed an introductory‐level business programming course in the semester immediately before the media frenzy surrounding ChatGPT. We provided ChatGPT with prompts to generate programming source code intended to satisfy the requirements of three homework assignments taken from the beginning, middle, and end of the course. The AI‐generated files were then randomly distributed among student‐generated code files submitted for grading during the reference semester. All the anonymized code files were then graded against the assignment requirements and for the likelihood of being AI‐generated. The results are simultaneously heartening and chilling as they relate to recognizing this new form of potential cheating and ultimately preserving the integrity of our courses.

The rest of this article proceeds as follows. First, we provide background on ChatGPT and some of its uses, particularly in academics. Next, we detail the methodology employed for this study. This includes a description of the assignments used, how the student reference code was selected and prepared, and how the prompts provided to ChatGPT were created. We then provide the results obtained and discuss their implications. We conclude the article with the limitations of our study and suggestions for instructors of similar courses.

BACKGROUND

ChatGPT is a powerful Natural Language Processing tool capable of generating human‐like conversation, written communication, and, potentially, academic work. This revolutionary tool offers many opportunities to educators, students, and graduates, such as more efficient analysis of text data, advanced conversational agents, and the analysis of large datasets (Kalla & Smith, [13]). However, due to its relative newness, limited research has been conducted on the use of ChatGPT in academic environments.

A landmark study involved researchers testing ChatGPT's ability to complete law school exams and found that it performed at the level of a C+ student (Bommarito II & Katz, [4]). Researchers testing ChatGPT's mathematical prowess found that while it could correctly answer lower‐level math problems, its performance on upper‐level undergraduate and graduate material was "inconsistently bad" (Frieder et al., [7], p. 9, emphasis in original). Another study using questions from technical courses at MIT found the basic GPT‐3.5 LLM accurate in only a third of the instances tested. However, using GPT‐4 and prompt engineering techniques resulted in correct answers to all math, science, and computer science questions (Zhang et al., [26]). The mixed results seen in these early (i.e., not yet peer‐reviewed) studies show the range of capabilities exhibited by AI engines in educational settings.

A key concern of educators is whether students will use ChatGPT to violate academic integrity policies established at their institutions. Students have long engaged in academic cheating behaviors in higher education, including plagiarism, unauthorized collaboration, and buying solutions to assignments (Barbaranelli et al., [2]). In addition, a large, multicampus study found that academic dishonesty is influenced by a variety of contextual and individual factors, making it difficult to predict based on the individual, course, or assignment type (McCabe & Trevino, [17]). Other research shows that students exhibiting a proclivity to cheat in other situations would likely leverage ChatGPT to generate assignment submissions (Greitemeyer & Kastenmüller, [11]).

The preponderance of study thus far has focused on ChatGPT's performance with information recall and academic writing. However, ChatGPT is becoming increasingly popular for generating and debugging code in multiple programming languages like Java, C++, and Python. Recently, researchers evaluated ChatGPT's effectiveness in troubleshooting software bugs and found that it performed competitively with other debugging methods (Surameery & Shakor, [25]). ChatGPT's ability to accept additional information, such as expected outputs or error messages, contributed to its success. However, if the user interacting with ChatGPT does not have the requisite programming skills, ChatGPT cannot be expected to produce valid code. For example, a study focused on ChatGPT's potential involvement in engineering education hypothesized that the user must be able to ask the right questions because AI lacks the necessary critical thinking and problem‐solving skills required by users (Qadir, [22]).

With the previously mentioned research in mind, this study seeks to assess the ability of a tool like ChatGPT to complete basic programming tasks required of students enrolled in an introductory business programming course. Additionally, the study analyzes to what extent the user needs an understanding of programming techniques to generate a viable solution to the programming assignment or recognize if the generated output is a viable solution. Finally, the study asks how well equipped instructors are to distinguish student assignment submissions from those generated by ChatGPT.

METHODOLOGY

We used homework assignments from a recent offering of an Introduction to Python programming class to compare code produced by ChatGPT with code submitted by students from that class. The reference semester for the student submissions—fall of 2022—predates the flood of publicity surrounding ChatGPT, making it unlikely that students would have used it to prepare their code submissions.

The authors of this article have each taught this class more than once. One author created the class and original assignments, another taught the class during the reference semester, and the third is currently teaching it at the beginning of this study. The author who taught in the reference semester prepared the student submissions, created the ChatGPT prompts, and used ChatGPT to generate its output, while the other two authors served as graders of the submissions and rated them for the likelihood they were produced by ChatGPT.

The assignments

In addition to evaluations like chapter quizzes and exams, the class had seven graded homework assignments during this semester. We began the semester with the presumption that students had no programming experience, so the material covered ranged from basic introductory concepts to solving more complex problems. We selected the second, fifth, and seventh homework assignments to use in this study to sample a range of the programming skills covered during the semester.

The textbook used in this class (Gaddis, [8]) introduces the "turtle graphics" (Papert et al., [20]) library in an early chapter as a tool to introduce students to tasks that they will (hopefully) find more engaging. Students can draw objects by moving the reference cursor (the "turtle") around the screen using built‐in commands. The second assignment in the course (HW2) had the student use the turtle to draw a standard traffic stop sign on a colored background. In addition to working out the logical structure to solve the problem, this task required students to set the turtle's properties to change the appearance of the drawing. The results of HW2 should look like Figure 1.

dsji12306-fig-0001.jpg

The second assignment used in this study (HW5) extended the task from the HW2 assignment. HW5 required the student to create a modularized version of a drawing program to draw a shape selected by the user. It required the student to create two files to complete the assignment. A student‐created utility file contained four custom functions to handle the details of drawing the requested figure. The main program file handled user input, calling the drawing functions, and displaying the result. Where HW2 had students draw one figure in a step‐by‐step fashion, HW5 added decision‐making, looping, input validation, and function writing. Test values for input and the resulting output are shown below in Figure 2.

dsji12306-fig-0002.jpg

The third homework selected was the final assignment of the semester, number seven (HW7). This assignment required students to use a Python list data structure to store and process values entered by the user. The program displays a prompt on the screen that shows the name of each day of the week in turn and allows the user to input a sales amount for that day. The program stores the daily values in a list as they are entered. The list must be passed to a function, where the weekly total and daily average are calculated using a loop. This assignment also included input validation and the formatting of output values. The daily prompts with accompanying test values and results are shown in Figure 3 below.

dsji12306-fig-0003.jpg

Code submission preparation

The code submissions were intended to represent three coding skill levels. The first skill level we wanted to represent was of a student who was struggling with the programming concepts covered in class. This student may have understood some of the concepts but had a problem integrating them. They may also have had difficulty translating an assignment description into a completed program. This student usually realizes they need help but often does not know how to ask for it. The second level was intended to capture code produced at a mid‐range skill level. This student was more comfortable with the coding concepts presented in the class. They were able to produce solid code if given enough time to work on an assignment. They might also have had trouble with one or two coding details of an assignment while demonstrating competence on the bulk of the assignment. The final skill level represents a student who thoroughly understands the concepts. This student produced well‐structured, well‐commented code that was consistent with the assignment description.

Three students at each proficiency level were selected for a total of nine student reference submissions for each assignment. When the ChatGPT work product of three submissions per assignment was added, the collection of code submissions was twelve per assignment, which resulted in a total of 36 Python code submissions being evaluated.

Student submissions

The primary task in selecting student submissions was to identify student work from the previous semester that represents the three described skill levels. The approach was to select students who consistently exhibited low, medium, and high skill levels on these assignments. It is important to note that because of the other assignments that contribute to a student's grade, this does not necessarily mean they did poorly in the class. It is simply an indication of their relative coding skills at the time of the assignment.

Unfortunately, only seven students (in a class of 17) consistently performed at the same level on all three programming assignments. To complete the pool of student work product, we selected code from four other students that were originally graded at the appropriate performance level. Thus, we used code from 11 students to give us the 27 student submissions needed for the analysis.

The specific programming environment used in this course varies with the instructor. Both the Spyder (Raybaut & Cordoba, [23]) integrated development environment (IDE), installed as part of the Anaconda distribution of Python (Anaconda, [1]), and Eclipse, a full‐featured IDE used in production‐level programming with many languages (Geer, [9]), have been used previously. This semester's offering used the Jupyter Notebook as the programming environment. Jupyter Notebook is a browser‐based application that allows users to combine code, graphics, and documentation (Driscoll, [5]). The Notebook also has many characteristics that make it a good choice for beginning programmers in a business degree program (Ellis et al., [6]). However, in some cases, it requires or generates idiosyncratic code that would mark it as coming from the Notebook, thus readily identifying it to an experienced grader. Therefore, any code specific to the Notebook IDE was removed from the code files or converted to a generic form across all submissions while converting student code to generic Python files. Other properties of the submission files were cleaned as needed, including removing programmer headings and anonymizing the file names.

ChatGPT submissions

The prompts provided to ChatGPT aimed to represent the three previously described student skill levels as closely as possible. Because the AI does not allow for any adjustment of its skill level, the skill level must be manipulated indirectly through the complexity of the prompt it is given. To do this, we adapted the low, medium, and high skill levels exhibited by the students to corresponding prompt levels we refer to as the copy prompt, the naïve prompt, and the informed prompt. These prompts are shown in the Appendix.

The prompts represent our best estimate of the kind of prompts that would be written by a student at each level. The student who chooses to copy and paste the written assignment instructions into the chat (copy prompt) will likely perform at the low level identified earlier. The student using this prompt might be trying to grasp the concepts but is struggling with them. However, we feel this student is more likely to be the least engaged student. The student who copies and pastes the assignment description may be covering for lack of understanding or simply remembers at the last minute that they must submit an assignment. This prompt represents the least‐cost path to homework results that can be submitted for a grade.

The naïve prompt represents a student who does not fully understand the material and/or the assignment. They may assume that using the copy‐and‐paste strategy will likely give poor results, but they have little understanding of the programming concepts focal to the given assignment. In this prompt, the student would attempt to interpret the underlying meaning of the assignment and provide that to ChatGPT as their prompt.

The informed prompt is associated with a student using ChatGPT to assist their learning and understanding. This student has the underlying knowledge of the focal material to correctly interpret the assignments' intent. The student is also likely able to interpret and modify the results generated by ChatGPT to fulfill the assignment's expectations.

It should be noted that we purposefully did not use any iteration or prompt engineering when creating or using these prompts. Other studies using either of these techniques (e.g., Geng et al., [10]; Zhang et al., [26]) focused on students with some programming experience, while the students we work with typically see programming code for the first time in this course. These students are unlikely to be capable of engineering their ChatGPT prompts; therefore, we chose a single iteration while assuming no prior knowledge of programming for the creation of our prompts.

The created prompts were pasted into the chat window at the ChatGPT site. The code generated was saved in.py files with generic file names consistent with those specified for the student code files. We also added the same generic header to these code files as to the student files.

With three ChatGPT‐produced code submissions for each of the three assignments, we had nine submissions produced by the AI. When added to the student submissions, we had 36 total submissions for review by the graders.

Grading the submissions

All submissions were subsequently treated as if these files were received from students as homework submissions. The graders were provided the assignment descriptions and the key code originally used to grade the student submissions during the semester in question. Graders ran the code in an IDE using Python 3.9 to test its functionality. One grader used version 3.11.3 of the IDLE ([21]) environment; the other used Spyder version 5.4.2 (2022). The graders then used the descriptions, key code, and their experience teaching the course to assign the numeric score (0 to 100) they thought was appropriate. They were also asked to estimate the likelihood that ChatGPT created each submission. In total, the graders recorded these data points: the numeric score for the assignment, an explanation for any point deductions reflected in the score, the likeliness the code was AI‐generated, and the rationale they used in arriving at the AI likeliness.

We also processed every submission through an AI detector tool to provide another "opinion" on how likely it was that an AI generated the code. The AI detector used was the tool created by the company that created ChatGPT (OpenAI Text Classifier, [19]). While this tool (and others currently on the market) is targeted more toward evaluating text, it can also handle code submission. The code from each.py file was copied and pasted to the website to run the evaluation. The two files produced for HW5 were entered together. The detector tool uses a scale of "very unlikely," "unlikely," "unclear," "possibly," and "likely" the text is AI‐generated, and grader responses were converted to the same scale for consistency.

RESULTS

The grading and AI evaluation results are summarized in Table 1 for the HW2 stop sign assignment, Table 2 for the HW5 modularized drawing assignment, and Table 3 for the HW7 weekly sales assignment. The HW2 assignment was the shortest of the three assignments used for this study, and because it was taken from the beginning of the course, it was the simplest. The OpenAI text classifier required a minimum of 1000 characters of input text to conclude the likelihood that the text was created by an AI tool. Most of the submissions for HW2, including those produced by ChatGPT, were shorter than the minimum length. Where necessary, those results are noted with an asterisk (*) in the results tables.

1 TABLE Results for the stop sign assignment (HW2), with true positives shaded blue and false positives shaded orange.

<table><thead><tr><th /><th /><th>Grade assigned</th><th>AI detection</th></tr><tr><th /><th>HW2 author</th><th>Grader 1</th><th>Grader 2</th><th>Average</th><th>Grader 1</th><th>Grader 2</th><th>OpenAI</th></tr></thead><tbody><tr><td>High skill</td><td>Student 1</td><td>100</td><td>100</td><td>100</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td /><td>Student 2</td><td>100</td><td>100</td><td>100</td><td>Very unlikely</td><td>Very unlikely</td><td>*</td></tr><tr><td /><td>Student 3</td><td>100</td><td>100</td><td>100</td><td>Unlikely</td><td>Very unlikely</td><td>*</td></tr><tr><td>Medium skill</td><td>Student 4</td><td>93</td><td>80</td><td>86.5</td><td>Unlikely</td><td>Very unlikely</td><td>Possibly</td></tr><tr><td /><td>Student 5</td><td>100</td><td>100</td><td>100</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td /><td>Student 6</td><td>98</td><td>95</td><td>96.5</td><td>Unlikely</td><td>Very unlikely</td><td>*</td></tr><tr><td>Low skill</td><td>Student 7</td><td>98</td><td>95</td><td>96.5</td><td>Possibly</td><td>Very unlikely</td><td>*</td></tr><tr><td /><td>Student 8</td><td>85</td><td>70</td><td>77.5</td><td>Possibly</td><td>Very unlikely</td><td>*</td></tr><tr><td /><td>Student 10</td><td>90</td><td>85</td><td>87.5</td><td>Very unlikely</td><td>Very unlikely</td><td>*</td></tr><tr><td /><td>Copy prompt</td><td>95</td><td>80</td><td>87.5</td><td>Likely</td><td>Likely</td><td>*</td></tr><tr><td>ChatGPT</td><td>Na&#239;ve prompt</td><td>85</td><td>70</td><td>77.5</td><td>Likely</td><td>Likely</td><td>*</td></tr><tr><td /><td>Informed prompt</td><td>95</td><td>70</td><td>82.5</td><td>Likely</td><td>Likely</td><td>*</td></tr></tbody></table>

2 TABLE Results for the modularized drawing assignment (HW5), with true positives shaded blue and false positives shaded orange.

<table><thead><tr><th /><th /><th>Grade assigned</th><th>AI detection</th></tr><tr><th /><th>HW5 author</th><th>Grader 1</th><th>Grader 2</th><th>Average</th><th>Grader 1</th><th>Grader 2</th><th>OpenAI</th></tr></thead><tbody><tr><td>High skill</td><td>Student 1</td><td>95</td><td>100</td><td>97.5</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td /><td>Student 2</td><td>90</td><td>90</td><td>90</td><td>Unlikely</td><td>Very unlikely</td><td>Possibly</td></tr><tr><td /><td>Student 3</td><td>90</td><td>90</td><td>90</td><td>Very unlikely</td><td>Very unlikely</td><td>Likely</td></tr><tr><td>Medium skill</td><td>Student 4</td><td>75</td><td>77</td><td>76</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td /><td>Student 5</td><td>65</td><td>50</td><td>57.5</td><td>Possibly</td><td>Possibly</td><td>Very unlikely</td></tr><tr><td /><td>Student 6</td><td>88</td><td>97</td><td>92.5</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td>Low skill</td><td>Student 7</td><td>65</td><td>70</td><td>67.5</td><td>Very unlikely</td><td>Unlikely</td><td>Unlikely</td></tr><tr><td /><td>Student 9</td><td>95</td><td>97</td><td>96</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td /><td>Student 11</td><td>95</td><td>92</td><td>93.5</td><td>Very unlikely</td><td>Very unlikely</td><td>Unlikely</td></tr><tr><td /><td>Copy prompt</td><td>100</td><td>94</td><td>97</td><td>Unclear</td><td>Likely</td><td>Possibly</td></tr><tr><td>ChatGPT</td><td>Na&#239;ve prompt</td><td>77</td><td>50</td><td>63.5</td><td>Possibly</td><td>Likely</td><td>Likely</td></tr><tr><td /><td>Informed prompt</td><td>75</td><td>80</td><td>77.5</td><td>Unlikely</td><td>Very unlikely</td><td>Possibly</td></tr></tbody></table>

3 TABLE Results for the weekly sales assignment (HW7), with true positives shaded blue and false positives shaded orange.

<table><thead><tr><th align="left" /><th>HW7 author</th><th>Grade assigned</th><th>AI detection</th></tr><tr><th align="left" /><th>Grader 1</th><th>Grader 2</th><th>Average</th><th>Grader 1</th><th>Grader 2</th><th>OpenAI</th></tr></thead><tbody><tr><td align="left">High skill</td><td>Student 1</td><td>95</td><td>90</td><td>92.5</td><td>Very unlikely</td><td>Unlikely</td><td>Unclear</td></tr><tr><td>Student 2</td><td>98</td><td>95</td><td>96.5</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td>Student 3</td><td>83</td><td>85</td><td>84</td><td>Very unlikely</td><td>Very unlikely</td><td>*</td></tr><tr><td align="left">Medium skill</td><td>Student 4</td><td>87</td><td>90</td><td>88.5</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td>Student 5</td><td>90</td><td>80</td><td>85</td><td>Very unlikely</td><td>Very unlikely</td><td>Possibly</td></tr><tr><td>Student 6</td><td>98</td><td>90</td><td>94</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td align="left">Low skill</td><td>Student 7</td><td>100</td><td>90</td><td>95</td><td>Very unlikely</td><td>Very unlikely</td><td>Unclear</td></tr><tr><td>Student 8</td><td>93</td><td>85</td><td>89</td><td>Very unlikely</td><td>Very unlikely</td><td>*</td></tr><tr><td>Student 9</td><td>85</td><td>70</td><td>77.5</td><td>Possibly</td><td>Unclear</td><td>*</td></tr><tr><td align="left">ChatGPT</td><td>Copy prompt</td><td>73</td><td>60</td><td>66.5</td><td>Unlikely</td><td>Unlikely</td><td>Possibly</td></tr><tr><td>Na&#239;ve prompt</td><td>75</td><td>60</td><td>67.5</td><td>Likely</td><td>Possibly</td><td>Likely</td></tr><tr><td>Informed prompt</td><td>85</td><td>80</td><td>82.5</td><td>Unclear</td><td>Possibly</td><td>Likely</td></tr></tbody></table>

Table 1 shows that the basic programming problem of HW2 was both the easiest for students to complete and the easiest for graders to detect AI tool use. An AI detection opinion from either the graders or the detection tool reflecting a stronger than 50/50 chance (i.e., "possible" or "likely") was treated as an indication of suspected AI authorship. Both graders could identify the ChatGPT submissions, while incorrectly flagging student submissions as AI‐generated just twice.

The modularized assignment (HW5) led to the more mixed results in Table 2. Students scored lower on this assignment, but the graders flagged only one student's submission. Because this assignment was longer than HW2, the AI detection tool could also evaluate the submissions. The tool correctly flagged each ChatGPT submission and only incorrectly flagged two students' work. Interestingly, the incorrect students selected by the detector (Students 2 and 3) were different than the student flagged by the graders (Student 5).

Table 3 shows the results generated from the final assignment, HW7. As with the first assignment, some code files did not meet the 1000‐character minimum for the AI detector. However, the detector did flag all three ChatGPT submissions correctly while only incorrectly flagging one student's work. The graders had more mixed results on this assignment, with only one ChatGPT submission correctly flagged by both graders.

Tables 1–3 also show the submission scoring results. The graders scored the code files based on their interpretation of the assignment description, assigning the grade they felt each author's work would earn as a homework submission in their class. The authors take an individualized approach to teaching the course, with each of us emphasizing different aspects of the assignments. Differences like the choice of IDE or emphasis on some concepts over others result in a grading standard that varies between instructors. The score averages provide the simplest way to compare the AI‐generated results to the student‐produced results and to illustrate the trends in the grading results.

An overall comparison of assigned grades from student results compared to grades from ChatGPT‐generated code is shown in Figure 4 below. This plot reflects the unremarkable nature of the scoring results obtained using ChatGPT shown in the previous tables.

dsji12306-fig-0004.jpg

Given the small sample size, we must use nonparametric methods to investigate the presence of statistically significant differences between our code author categories. The Kruskal–Wallis (K–W; Kruskal & Wallis, [16]) test can be used to ascertain differences between three or more groups of independently sampled, nonnormal data with a continuous dependent variable. The K–W test carries with it a few basic assumptions, all of which our data satisfy. The dependent variable (grade average) is continuous, and the independent variable (skill level) is categorical. All the observations are independent in that none of the data points are present in multiple skill‐level groups. One thing of note, shown in Figure 5 below, is that the distributions of the grades across the groups are inconsistent. This means that the results of the K–W test can only be used to interpret the relative rankings between groups; the actual group median values cannot be compared.

dsji12306-fig-0005.jpg

The results of the K–W test, performed using SPSS (IBM SPSS Statistics for Windows, Version 27, [12]) and shown in Table 4 below, indicate that at least one of the groups is significantly different from at least one of the other groups (H(3, _I_n_i_ = 36) = 10.444, p = 0.015). Post hoc pairwise comparisons indicate that the difference between the ChatGPT and High groups is statistically significant (p = 0.001).

4 TABLE Kruskal–Wallis pairwise comparisons of skill level.

<table><thead><tr><th>Sample 1 and Sample 2</th><th align="center">Test statistic</th><th>Std. error</th><th align="center">Std. test statistic</th><th>Sig.</th><th>Adj. sig.0002</th></tr></thead><tbody><tr><td>gpt&#8208;low</td><td>&#8211;7.333</td><td>4.957</td><td>&#8211;1.479</td><td>0.139</td><td>0.834</td></tr><tr><td>gpt&#8208;med</td><td>&#8211;7.556</td><td>4.957</td><td>&#8211;1.524</td><td>0.127</td><td>0.765</td></tr><tr><td>gpt&#8208;high</td><td>&#8211;16.000</td><td>4.957</td><td>&#8211;3.228</td><td>0.001</td><td>0.007</td></tr><tr><td>low&#8208;med</td><td>&#8211;0.222</td><td>4.957</td><td>&#8211;0.045</td><td>0.964</td><td>1.000</td></tr><tr><td>low&#8208;high</td><td>8.667</td><td>4.957</td><td>1.748</td><td>0.080</td><td>0.483</td></tr><tr><td>med&#8208;high</td><td>8.444</td><td>4.957</td><td>1.703</td><td>0.088</td><td>0.531</td></tr></tbody></table>

1 Note: Each row tests the null hypothesis that the Sample 1 and Sample 2 distributions are the same. Asymptotic significances (two‐sided tests) are displayed. The significance level is 0.050.

2 a Significance values have been adjusted by the Bonferroni correction for multiple tests.

DISCUSSION

The results arising from the efforts of the graders to detect the AI‐generated code were mixed. Graders had no problem spotting the AI code on HW2. You can see the reason why in Figure 6 below, which shows the result of running the code from the ChatGPT submissions. A student seeing results like Figure 1 can adjust their code to make corrections; the AI cannot see that its code does not draw the expected picture and is, therefore, unable to produce the desired results.

dsji12306-fig-0006.jpg

The mixed results for HW5 are more interesting. The detection tool tagged five submissions as possibly or likely being AI‐written, including the three submissions that were. Grader results were mixed as well. Grader 1 had suspicions about two prompts but completely missed the informed prompt. Grader 2 felt it so unlikely to be AI‐generated that he did not even bother to record a reason why he came to that conclusion. Grader 1 noted this reasoning behind his "unlikely" label: "Shows lack of understanding of intent and functionality of the assignment and displays lack of programming technique, yet syntactically correct, suspicious but easily within expectations of a lost student copy/pasting from classroom examples with no understanding." Once we know that was a ChatGPT‐generated submission, it is easy to focus on the "syntactically correct" code and "lack of programming technique" as an AI red flag. However, as Grader 1 pointed out, we see the same results from students who lack understanding.

The graders and the detector also disagreed about the authorship of two student‐written submissions. Graders thought the submission from Student 5 was possibly AI‐generated, but the detector determined it was very unlikely. Grader 1: "There is a great deal of technique demonstrated given the thorough misunderstanding of the fundamental operation and intention of the assignment." Grader 2: "Everything about it just screams ChatGPT to me."

The disagreement went the other way, too. Graders knew the submission from Student 3 was very unlikely AI, but the detector determined it was likely. The graders pointed out the use of certain variable names that seemed to them to be indicative of a student submission rather than ChatGPT. We can only speculate about the "reasoning" that led the detector to determine that an AI had produced that same code. However, it might be because Student 3 was a strong performer in the class who was rigorous about code structure.

The split on authorship results continued with HW7. Grader 1 was only sure of one code's authorship, rating only the naïve prompt submission as likely AI‐generated, but also considering Student 9's work to be possibly from an AI. Grader 2 fared better, rating only the copy prompt as unlikely. The detection tool tagged four submissions as possibly or likely being AI‐written, including the three submissions that were (two "likely" and one "possibly") and the submission from Student 5 ("possibly"). Again, we must speculate about the detector's conclusion, but it seems likely that the programming style of Student 5 that led to the incorrect conclusion from Grader 1 on HW5 might be what the detector picked up on for this submission.

A summary of the numeric scores assigned by the graders is shown in Table 5. These scores show the grade on each assignment averaged across the two graders and the three submissions at each skill level. The results indicate that ChatGPT fared worse than all three categories of students, resulting in lower grades across all three assignments, with the singular exception being the "medium" skill‐level student for HW5. This indicates that although AI‐generated results may not always be identifiable, they will not, as of yet, yield results approaching that of a diligent student.

5 TABLE Summary of the average grade assigned and grade range by author skill level.

<table><thead><tr><th /><th>HW2</th><th>HW5</th><th>HW7</th></tr><tr><th>Author skill level</th><th>Average</th><th>Range</th><th>Average</th><th>Range</th><th>Average</th><th align="center">Range</th></tr></thead><tbody><tr><td>High</td><td>100.00</td><td>0.00</td><td>92.50</td><td>7.50</td><td>91.00</td><td>12.50</td></tr><tr><td>Medium</td><td>94.33</td><td>13.50</td><td>75.33</td><td>35.00</td><td>89.17</td><td>9.00</td></tr><tr><td>Low</td><td>87.17</td><td>19.00</td><td>85.67</td><td>28.50</td><td>87.17</td><td>17.50</td></tr><tr><td>ChatGPT</td><td>82.50</td><td>10.00</td><td>79.33</td><td>33.50</td><td>72.17</td><td>16.00</td></tr></tbody></table>

Interestingly, although ChatGPT suffered numerically on scores across the board, the generally lower scores only significantly differ from the high‐achieving group as evidenced by the previously mentioned K–W test. This confirms the difficulty experienced graders have with differentiating AI‐generated code from code created by average and low‐achieving beginner programming students. Additionally, it confirms the above statement that AI is not yet able to create results consistent with high‐achieving students. The limitations of the small sample size associated with our natural experiment preclude further interpretation of the median values or the magnitude of the numeric differences. However, additional research using larger samples may very well identify additional differences and meaning between the numeric values.

We summarize the AI detection results in Table 6 below. This table provides the number of positive predictions for each evaluator across the three submissions at each skill level. The true‐positive rate is the number of correct positive predictions divided by the number of actual ChatGPT submissions (i.e., three) for each assignment. The false‐positive rate is the number of incorrect positive predictions divided by the number of student submissions (i.e., nine) for each assignment.

6 TABLE Summary of AI prediction accuracy by author skill level.

<table><thead><tr><th /><th>HW5 positive predictions</th><th>HW7 positive predictions</th></tr><tr><th>Author skill level</th><th>Grader 1</th><th>Grader 2</th><th>Detector</th><th>Grader 1</th><th>Grader 2</th><th>Detector</th></tr></thead><tbody><tr><td>Student&#8212;high</td><td>1</td><td>0</td><td>2</td><td>0</td><td>0</td><td>00002</td></tr><tr><td>Student&#8212;medium</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><td>Student&#8212;low</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>00002</td></tr><tr><td>ChatGPT</td><td>1</td><td>2</td><td>3</td><td>1</td><td>2</td><td>3</td></tr><tr><td>True&#8208;positive rate</td><td>0.33</td><td>0.67</td><td>1.00</td><td>0.33</td><td>0.67</td><td>1.00</td></tr><tr><td>False&#8208;positive rate</td><td>0.22</td><td>0.11</td><td>0.22</td><td>0.11</td><td>0.00</td><td>0.11</td></tr></tbody></table>

3 Note: An opinion of possibly or likely is considered a positive prediction. 4 * Three submissions did not contain enough characters to be processed by the detector.

These results show the difficulties associated with identifying AI‐generated homework submissions. Although the AI detector was largely able to identify AI‐generated results accurately, it also had as many or more false positives as the human graders. Thus, simply abdicating the responsibility of identifying AI‐generated homework submissions to an AI detector is not a viable strategy. Instead, given its true‐positive rate, using it as a first step would be better, with the instructor assessing the code before forming any conclusions. This is similar to how many instructors use other tools to detect academic misconduct in writing assignments.

During an instructor's assessment of potentially AI‐generated submissions, some identifiable hallmarks may be helpful in determining authorship. Instructors might specify variable naming standards, filename expectations, comment requirements, and other conventions to which students must adhere. The AI tools will not know of these course‐specific requirements and thus, an instructor can use them to identify potentially AI‐generated submissions. However, a more advanced student will be able to adjust AI‐generated code to account for these expectations. Thus, like all other recommendations, these stylistic conventions would not be completely accurate.

We also submitted the instructor‐created key code files to the same AI detector that evaluated the other submissions. The authors have modified these code files over time as the class has been taught over the past few years, and we felt it would be interesting to see the detector's opinion of the code's authorship. Here, too, the results were mixed. The HW5 key code produced an "unclear" result, while the detector concluded that the key code from the other two assignments was "possibly" AI‐generated. This again highlights the problem of unquestioningly accepting the results of an AI detector and reinforces the requirement for rigorous instructor assessment of all suspicions of academic misconduct.

OpenAI is very forthcoming about the limitations of their detector of AI‐written text: "<bold>Our classifier is not fully reliable</bold>." (Kirchner et al., [15], emphasis in the original). Their tests of the detector resulted in a success rate with AI‐written text only 26% of the time. The human‐written text was incorrectly identified as AI‐written in 9% of the tests. We found it to perform better than that in our testing. Hopefully, these detection tools will also continue to improve.

SUGGESTIONS FOR INSTRUCTORS

Whether the use of AI tools is allowed in a class or not, instructors need to know how students use them. While we show that it is difficult to identify AI‐generated code with 100% accuracy, we identify strategies that we believe will assist instructors with that task.

The first suggestion is to modify the written assignment prompt to be less specific. This strategy will make it more difficult for students without an understanding of programming to receive a passing score by copying and pasting the assignment instructions into ChatGPT. ChatGPT's results on HW2 lend credence to this notion because the most important part of the instructions was based on the provided image of the stop sign, not the instruction text. Because students cannot copy and paste an image into the ChatGPT prompt, it produced code that (technically) did what it was asked but was clearly executed incorrectly. Including more images like those of desired output is one strategy. Instructors might also introduce assignments with videos to avoid giving students text to copy and paste.

The next suggestion is to require students to add robust comments throughout the program. This strategy addresses issues on multiple fronts. First, if a student does not understand the techniques used in their program, they will not be able to comment accurately on code produced by an AI engine. Although ChatGPT will produce comments if asked, it does so in a standardized way. Requiring students to comment their code in a specific manner for each assignment will help instructors determine if students created that code. Second, if a student can accurately apply a specific method of commenting, dictated by the instructor, to code generated by an AI tool, then we believe the student has likely learned the techniques required by the assignment. Although we may not like the fact that they could be using ChatGPT in this way, we feel we can give them the benefit of the doubt if they have thoroughly and accurately commented throughout.

Another suggestion is to create prompts in the manner used in this study and to use ChatGPT to generate sample responses. The most reliable way for the graders in this study to predict the AI‐generated responses was to find techniques, mistakes, or code fragments that stood out from the other results. If there were two or three unusual submissions, they prompted the graders to take a closer look. For example, two of the submissions for HW7 included the following line of code: "if __name__ = = "__main__"". This technique is common in general Python programming but not used in our introductory programming course. After seeing that code as part of a response, it is much easier to spot moving forward. If instructors do the same for the assignments in their classes, those types of patterns will reveal themselves.

The final suggestion to manage the unauthorized use of AI tools in a programming class is to specify that students are only allowed to use techniques discussed and taught in the classroom or covered in the text to complete the assignments. Students exploring outside sources for programming help or outright academic dishonesty is not new; therefore, many instructors may already use this strategy. Due to the nature of programming with high‐level languages, every problem can be solved in a multitude of ways. If instructors limit students to the techniques covered in course materials, they can easily identify students who have received help to produce their submission, whether from AI tools or user communities like Stack Overflow.

LIMITATIONS

This study is not without limitations. One such limitation is that the prompts were crafted by course instructors, which may reflect a more thorough understanding of the assignments than that of the typical student. Another constraint is the limited candidate pool for reference submissions as there were only seventeen students enrolled in the class. In addition, we use Python as the language of choice for our business programming courses because it is less syntactically demanding and easier for students to write in a narrative‐like style. We do not yet know if this study's results would hold for more syntactically rigid languages such as Java or C++.

Another limitation is that the graders knew there were three AI‐generated submissions for each assignment, which may have changed their approach to identifying the "fake" submissions. A regular instructor would not have that constraint to assist them. Additionally, the sanitizing process removed some information that would be helpful in determining the authorship of the code. For example, on HW5, both students and ChatGPT created custom names for their utility files, although they were explicitly told not to do that. While we treated this as a natural experiment, that assumption has limitations. Students could already be using internet search results, getting help from classmates, or buying solutions. Finally, there is a slight chance that an early adopter in this class might have already been utilizing ChatGPT or another AI tool.

CONCLUSION

Although the prospect of readily available LLMs is worrisome to instructors in all fields of study, things are likely not as concerning as many would believe. As AI tools and chatbots proliferate, educators must embrace the reality of these tools' existence and that our students will use them. Some school systems and institutions have elected to forbid access to such tools—an approach we believe is not a viable long‐term strategy. Students, for example, can access AI tools via other networks not controlled by academic institutions. Additionally, these tools are being made more accessible every day for even novice‐level users via internet search engines and many social media applications embedding AI tools into their interfaces.

A more sustainable approach to controlling the impact of LLMs is to integrate them into courses as learning tools. Such an approach allows us to walk students through the benefits and shortcomings of AI‐based code generation in a structured manner. Writing usable programming code is a skill based on rules and standards. That is precisely the type of activity AI tools are good for. But the AI tools cannot replicate the individual creativity that is vital to success in most fields of study (Shidiq, [24]). That creativity in programming—problem‐solving with a computer—is the fundamental skill we want our students to learn. If properly applied, these ChatGPT‐type tools hold the potential to usher in a new and productive era of creative disruption in education and technology.

APPENDIX CHATGPT PROMPTS USED

HW2 (stop sign) prompts

<table><tbody><tr><td>Copy prompt</td></tr><tr><td><p>Create a turtle graphics Python program that performs the following:</p><p>Draws a stop sign</p><p>Make sure your sign has the following characteristics:</p><p>8 equal sides (i.e., it is an octagon)</p><p>White border</p><p>The border should be approximately 5 pixels in width</p><p>Red fill color</p><p>The word STOP in bold upper&#8208;case letters centered in the sign</p><p>The word should be appropriately sized for the sign</p><p>Black sign post</p><p>The post should be approximately 10 pixels in width</p><p>The drawing window should be approximately 500 &#215; 600 pixels</p><p>Feel free to extend this size if you wish to add other drawing elements</p><p>The drawing window background should be a color not used in the other drawing elements. That means anything you like other than white, red, black, or green.</p><p>The drawing window should have an appropriate title.</p></td></tr><tr><td>Na&#239;ve prompt</td></tr><tr><td>Draw a stop sign with a post using Python.</td></tr><tr><td>Informed prompt</td></tr><tr><td>Create a Python program that draws a red stop sign with white border and a black sign post in a 500 by 600&#160;pixel window. The word "stop" should be in bold upper&#8208;case letters in the center.</td></tr></tbody></table>

HW5 (modularized drawing) prompts

<table><tbody><tr><td>Copy prompt</td></tr><tr><td>Create a modularized turtle graphics program that draws one of two shapes at coordinates and of the size indicated by the end user.Create a modularized turtle graphics Python program based on the following:Create a utility module in a separate.PY file (see naming info below) that includes the following four functions: setup&#95;window() setup&#95;turtle() draw&#95;circle() draw&#95;square()Create a Python program based upon the following: Import the drawing functions from the above utility module. Ask the end user for the following four data points, validating the first entry (see Figure&#160;1 below). <list list-type="Bullet"><list-item><p>What shape they want to draw: 1 for square or 2 for circle</p></list-item><list-item><p>What is the <italic>X</italic> coordinate for the shape's location</p></list-item><list-item><p>What is the <italic>Y</italic> coordinate for the shape's location</p></list-item><list-item><p>What is the size of the shape: length for square, radius for circle</p></list-item></list>Draws the requested shape at the desired location (see Figure&#160;2 below).The drawing window should be a color not used in the other drawing elements.The drawing window should have an appropriate title.The program should continue to run until the user does not want to draw another shape (see example in Figure&#160;1 below).Upload your two.PY code files to the appropriate Blackboard assignment link.Reminder: You will submit two files for this assignment! Turning in just one file will cost you 50% of the assignment points. You can attach them both to the same submission.Remember&#8212;no global variables are allowed! You must pass arguments and return values when needed.</td></tr><tr><td>Na&#239;ve prompt</td></tr><tr><td>Write a Python program to let a user choose to draw a triangle or a square at the location they enter as many times as they want. The program should use functions called setup&#95;window(), setup&#95;turtle(), draw&#95;circle(), and draw&#95;square() that are saved in a Python utility module.</td></tr><tr><td>Informed prompt</td></tr><tr><td>Create a modularized Python program that draws either a circle or a square. The code to set up the window and the turtle, plus the code to draw a circle or square, will be in separate functions named setup&#95;window(), setup&#95;turtle(), draw&#95;circle(), and draw&#95;square() saved in a utility module. The main program file will allow the user to choose which shape to draw, the coordinates where it should be drawn, and it will call the functions to draw it. The program should run until the user stops it.</td></tr></tbody></table>

HW 7 (weekly sales) prompts

<table><tbody><tr><td>Copy prompt</td></tr><tr><td><p>Create a program that asks the user to enter a store's sales data for each day of one week using the names of the days in the prompt.</p><p>The days of the week will be stored in a list. The programmer will create this list and provide its contents.</p><p>Daily sales data will be stored in a list. This list will be created by the programmer, but elements will be added as the user enters data.</p><p>Use a loop to control the data entry process.</p><p>The sales data entered by the user must be validated. Valid sales numbers are not negative.</p><p>There is no upper limit to them.</p><p>Pass the list as an argument to a function you will write called avg&#95;sales(). The function can be written in the first cell in your notebook.</p><p>This function will loop through the list and calculate the total weekly sales and average daily sales.</p><p>The two values will be returned to the main program for output to the screen.</p><p>Example input and output is shown in Figure&#160;1 below.</p><p>Helpful details:</p><p>You MUST use loops and functions. NO SHORTCUTS. Using any functionality that is beyond the scope of this course will result in a score of zero.</p><p>Because this is the final assignment of the class, extra attention will be paid to the details of your program. These details include how accurately you follow all the directions and specifications, how well you comment your code, and similar details we have covered throughout the semester.</p></td></tr><tr><td>Na&#239;ve prompt</td></tr><tr><td>Use Python to write a program to allow a user to enter one week's sales data. Validate the input data. Display the total for the week. Create a function called avg&#95;sales() to calculate the average sales for the week.</td></tr><tr><td>Informed prompt</td></tr><tr><td>Write a Python program with comments to use a list to store validated daily sales data entered by the user and calculate the total sales for the week using loops. Also, create a function named avg&#95;sales() to calculate the average weekly sales from the list of daily sales. The days of the week should be stored in a list and displayed in the sales input statements. Display the result on the screen.</td></tr></tbody></table>

REFERENCES

1 Anaconda. (2022) Anaconda Software Distribution (2022.05 64‐bit) [Python]. https://anaconda.com

2 Barbaranelli, C., Farnese, M.L., Tramontano, C., Fida, R., Ghezzi, V., Paciello, M. & Long, P. (2018) Machiavellian ways to academic cheating: A mediational and interactional model. Frontiers in Psychology, 9, 695. https://doi.org/10.3389/fpsyg.2018.00695

3 Barros, B., Conejo, R., Ruiz‐Sepulveda, A. & Triguero‐Ruiz, F. (2021) I explain, you collaborate, he cheats: An empirical study with social network analysis of study groups in a computer programming subject. Applied Sciences, 11 (19), 9328. https://doi.org/10.3390/app11199328

4 Bommarito II, M. & Katz, D.M. (2022) GPT takes the bar exam. ArXiv Preprint ArXiv:2212.14402. https://arxiv.org/abs/2212.14402

5 Driscoll, M. (2019, January 28) Jupyter Notebook: An introduction. Real Python. https://realpython.com/jupyter‐notebook‐introduction/

6 Ellis, M.E., Hill, G. & Barber, C.J. (2019) Using python for introductory business programming classes. Quarterly Review of Business Disciplines, 6 (3), 237 – 254.

7 Frieder, S., Pinchetti, L., Griffiths, R.‐R., Salvatori, T., Lukasiewicz, T., Petersen, P.C., Chevalier, A. & Berner, J. (2023) Mathematical capabilities of ChatGPT. ArXiv Preprint ArXiv:2301.13867. https://arxiv.org/abs/2301.13867

8 Gaddis, T. (2017) Starting out with Python (4th ed.). Pearson.

9 Geer, D. (2005) Eclipse becomes the dominant Java IDE. Computer, 38 (7), 16 – 18. https://doi.org/10.1109/MC.2005.228

Geng, C., Yihan, Z., Pientka, B., & Si, X. (2023). Can ChatGPT Pass An Introductory Level Functional Language Programming Course? arXiv Preprint arXiv:2305.02230.

Greitemeyer, T. & Kastenmüller, A. (2023) HEXACO, the dark triad, and Chat GPT: Who is willing to commit academic cheating? Heliyon, 9, 9.

IBM SPSS Statistics for Windows, Version 27. (2020) [Windows]. IBM Corp.

Kalla, D. & Smith, N. (2023) Study and analysis of Chat GPT and its impact on different fields of study. International Journal of Innovative Science and Research Technology, 8 (3), 827 – 833.

King, M.R. & ChatGPT. (2023) A conversation on artificial intelligence, chatbots, and plagiarism in higher education. Cellular and Molecular Bioengineering, 16 (1), 1 – 2.

Kirchner, J.H., Ahmad, L., Aaronson, S. & Leike, J. (2023, January 31) New AI classifier for indicating AI‐written text. OpenAI Blog. https://openai.com/blog/new‐ai‐classifier‐for‐indicating‐ai‐written‐text

Kruskal, W.H. & Wallis, W.A. (1952) Use of ranks in one‐criterion variance analysis. Journal of the American Statistical Association, 47 (260), 583 – 621.

McCabe, D.L. & Trevino, L.K. (1997) Individual and contextual influences on academic dishonesty: A multicampus investigation. Research in Higher Education, 38, 379 – 396.

OpenAI. (2022, November 30) Introducing ChatGPT. https://openai.com/blog/chatgpt

OpenAI Text Classifier. (2023) Computer software. OpenAI. https://platform.openai.com/ai‐text‐classifier

Papert, S., & Solomon, C.. (1989). Twenty things to do with a computer. In E. Soloway & J. C. Spohrer (Eds.), Studying the novice programmer (pp. 3 – 28). Psychology Press.

Python's Integrated Development and Learning Environment (IDLE) (3.11.3). (2023) Computer software. Python Software Foundation. https://docs.python.org/3/library/idle.html

Qadir, J. (2023) Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education. 2023 IEEE Global Engineering Education Conference (EDUCON). IEEE; pp. 1 – 9.

Raybaut, P. & Cordoba, C. (2022) Spyder: The Scientific Python Development Environment (5.4.2) [Computer software]. https://pypi.org/project/spyder/

Shidiq, M. (2023) The use of artificial intelligence‐based Chat‐GPT and its challenges for the world of education; from the viewpoint of the development of creative writing skills. Proceeding of International Conference on Education, Society and Humanity, 1 (1), 353 – 357.

Surameery, N.M.S. & Shakor, M.Y. (2023) Use Chat GPT to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455–5290, 3 (01), 17 – 22.

Zhang, S.J., Florin, S., Lee, A.N., Niknafs, E., Marginean, A., Wang, A., Tyser, K., Chin, Z., Hicke, Y., Singh, N., Udell, M., Kim, Y., Buonassisi, T., Solar‐Lezama, A. & Drori, I. (2023) Exploring the MIT mathematics and EECS curriculum using large language models. ArXiv:2306.08997, 20. https://doi.org/10.48550/arXiv.2306.08997

By Michael E. Ellis; K. Mike Casey and Geoffrey Hill

Reported by Author; Author; Author

Michael E. Ellis is Associate Professor of Computer Information Systems & Analytics at the University of Central Arkansas. His work experience includes positions in sales, management, consulting, and as an entrepreneur before joining academia. His published research includes work in technology adoption, research methodology, and teaching pedagogy. Dr. Ellis' current teaching focuses on data analytics and machine learning, and he frequently conducts hands‐on workshops on these topics for academic and professional groups.

K. Mike Casey is Assistant Professor of Computer Information Systems & Analytics at the University of Central Arkansas. He completed his PhD in Computer and Information Science in 2019. He is an active researcher with publications focusing on business pedagogy, online learning, software and technology adoption, and environmental and social governance. Dr. Casey has taught online and traditional face‐to‐face courses in various areas, including business statistics, systems analysis, and programming.

Geoffrey Hill is Associate Professor of Computer Information Systems & Analytics at the University of Central Arkansas. Prior to joining academia, he had over 20 years of experience designing, implementing, supporting, and using very large‐scale information systems in capital‐intensive and life‐critical environments across the public and private sectors. His research interests include open‐source software, knowledge management, technology adoption, social media utilization, and text analytics. His research has been published in multiple journals, including The International Journal of Strategic Decision Sciences and The Journal of Organizational and End User Computing.