Treffer: Big data analytics approaches for treatment of imbalance and missing values problems on high dimensionality dataset.
Weitere Informationen
The telecommunications industry faced challenges with their datasets, primarily due to their high dimensionality and other issues such as imbalanced classes and missing values. These deficiencies led to inaccurate predictions and a decline in performance when the datasets were not handled properly. Due to the significant disparity in size between the churned customer class and the active customer class, the accuracy paradox arose. Consequently, despite the model's accuracy metrics reaching 90%, this level of performance aligned with the actual distribution of classes. In addition, the presence of numerous features significantly prolonged the time required for learning and computation. This was due to the inclusion of redundant and unnecessary features, which created disarray and hindered the learning process. Therefore, the purpose of this study was to determine the effect of feature selection, imputation data, and techniques for dealing with imbalanced data on model performance. This study proposed the improvement of the techniques for developing voluntary churn models by combining techniques for dealing with imbalance and missing data with high dimensionality. Thus, when compared to other combinations of models, the combination of Decision Trees+Mode Imputation+SMOTE with Random Undersampling methods and Random Forest as the classifier builder produced the highest classification accuracy, AUC, and F1-Score. Additionally, this study suggested the use of Dask or PySpark for processing the large telecommunication dataset to allow for the faster and more effective execution of other machine learning algorithms in Python via parallel computing. [ABSTRACT FROM AUTHOR]
Copyright of AIP Conference Proceedings is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)