Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Ahmed N; Barczak ALC; Rashid MA; Susnjak T

Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

dc.citation.issue	1
dc.citation.volume	9
dc.contributor.author	Ahmed N
dc.contributor.author	Barczak ALC
dc.contributor.author	Rashid MA
dc.contributor.author	Susnjak T
dc.date.accessioned	2023-11-22T22:13:54Z
dc.date.accessioned	2024-07-25T06:51:10Z
dc.date.available	2022-05-19
dc.date.available	2023-11-22T22:13:54Z
dc.date.available	2024-07-25T06:51:10Z
dc.date.issued	2022-12
dc.description.abstract	Due to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.
dc.description.confidential	false
dc.description.notes	Due to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.
dc.edition.edition	December 2022
dc.identifier.author-url	https://doi.org/10.1186/s40537-022-00623-1
dc.identifier.citation	Ahmed N, Barczak ALC, Rashid MA, Susnjak T. (2022). Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models. Journal of Big Data. 9. 1.
dc.identifier.doi	10.1186/s40537-022-00623-1
dc.identifier.eissn	2196-1115
dc.identifier.elements-type	journal-article
dc.identifier.number	67
dc.identifier.pii	s40537-022-00623-1
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/71011
dc.language	English
dc.publisher	BioMed Central Ltd
dc.publisher.uri	https://journalofbigdata.springeropen.com/articles/10.1186/s40537-022-00623-1
dc.relation.isPartOf	Journal of Big Data
dc.rights	(c) 2022 The Author/s
dc.rights	CC BY 4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Big data
dc.subject	Performance prediction
dc.subject	Machine learning
dc.subject	System confguration
dc.subject	HiBench
dc.subject	Apache Spark
dc.subject	Extrapolation and interpolation
dc.title	Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models
dc.type	Journal article
pubs.elements-id	454384
pubs.organisational-group	Other

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Published version
Size:: 2.63 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Journal Articles