Treffer: Comparative Analysis of Natural Language Processing Techniques in the Classification of Press Articles.
Weitere Informationen
The study undertook a comprehensive review and comparative analysis of natural language processing techniques for news article classification, with a particular focus on Java language libraries. The dataset comprised an excess of 200,000 items of news metadata sourced from The Huffington Post. The traditional algorithms based on mathematical statistics and deep machine learning were evaluated. The libraries chosen for tests were Apache OpenNLP, Stanford CoreNLP, Waikato Weka, and the Huggingface ecosystem with the Pytorch backend. The efficacy of the trained models in forecasting specific topics was evaluated, and diverse methodologies for the feature extraction and analysis of word-vector representations were explored. The study considered aspects such as hardware resource management, implementation simplicity, learning time, and the quality of the resulting model in terms of detection, and it examined a range of techniques for attribute selection, feature filtering, vector representation, and the handling of imbalanced datasets. Advanced techniques for word selection and named entity recognition were employed. The study compared different models and configurations in terms of their performance and the resources they consumed. Furthermore, it addressed the difficulties encountered when processing lengthy texts with transformer neural networks, and it presented potential solutions such as sequence truncation and segment analysis. The elevated computational cost inherent to Java-based languages may present challenges in machine learning tasks. OpenNLP model achieved 84% accuracy, Weka and CoreNLP attained 86% and 88%, respectively, and DistilBERT emerged as the top performer, with an accuracy rate of 92%. Deep learning models demonstrated superior performance, training time, and ease of implementation compared to conventional statistical algorithms. [ABSTRACT FROM AUTHOR]
Copyright of Applied Sciences (2076-3417) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)