Result: Detecting AI-generated texts using machine learning models.

Title:
Detecting AI-generated texts using machine learning models.
Authors:
Kubrusly, J.1 (AUTHOR) jessicakubrusly@id.uff.br, dos Santos, Erica Dias1 (AUTHOR), Pelegrino, Livia Santiago1 (AUTHOR)
Source:
Communications in Statistics: Case Studies & Data Analysis. 2025, Vol. 11 Issue 4, p495-512. 18p.
Database:
Business Source Premier

Further Information

The remarkable success of ChatGPT and its ability to generate text not only in English but also in multiple languages have made it an essential tool, significantly aiding various tasks. However, its capacity to produce opinion articles raises critical concerns, particularly regarding ethics and plagiarism. This project aims to leverage Text Mining and Machine Learning techniques to develop a program capable of distinguishing human-written texts from those generated by ChatGPT. To accomplish this, we utilized a public dataset provided by Guo et al., consisting of questions and answers. Both humans and ChatGPT responded to these questions, and the responses were labeled accordingly. Three document vectorization methods—BoW, TF-IDF, and doc2vec—were applied in combination with the XGBoost classification algorithm. The analysis was conducted using R. The dataset was split into 10 subsets for cross-validation. In each iteration, one subset served as the test set, while the remaining nine were used for training. The training set handled vectorization, model training, and ROC-based cutoff determination. Predictions were made on the test set, and evaluation metrics were computed. The doc2vec + XGBoost combination achieved outstanding results, with AUC exceeding 0.99, accuracy above 0.95, and precision surpassing 0.97 in all 10 iterations. [ABSTRACT FROM AUTHOR]

Copyright of Communications in Statistics: Case Studies & Data Analysis is the property of Taylor & Francis Ltd and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)