Treffer: Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach.
Weitere Informationen
Numerous studies emphasize accuracy in machine learning regression models, yet scalability and execution efficiency are often overlooked, critical for large datasets or extensive computations. This paper introduces a scalable, distributed Spark MLlib regression model through the best subset selection approach to predict Covid-19 statistics in India, demonstrating high accuracy, scalability, and execution efficiency. Notably, limited research focuses on tree-based regression, particularly gradient boost regression, in the context of the Covid-19 dataset. The proposed work optimizes regression models for accuracy and execution time on Spark clusters of varying sizes using the best subset selection approach. Evaluation encompasses Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R<sup>2</sup> Error for accuracy, and execution time analysis. Results indicate superior prediction accuracy in tree-based regression, with Gradient Boosted Tree Regression (GBTR) leading, and Random Forest Regression (RFR) surpassing Decision Tree Regression (DTR). Accuracy remains consistent across Python library, Spark MLlib on a single machine, and clusters of varying sizes, with Spark MLlib displaying lower execution times than Python's machine learning library on a single machine. Furthermore, execution times decrease substantially within Spark clusters, particularly for the iterative GBTR. This research uncovers scalability and execution efficiency aspects, highlighting tree-based regression's accuracy and advocating for Spark MLlib's efficacy in enhancing execution efficiency, especially across multi-node clusters. [ABSTRACT FROM AUTHOR]
Copyright of Multimedia Tools & Applications is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)