Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand

dc.confidentialEmbargo : Noen_US
dc.contributor.advisorBarczak, Andre
dc.contributor.authorAhmed, Nasim
dc.date.accessioned2022-07-04T03:38:21Z
dc.date.accessioned2022-10-11T20:14:33Z
dc.date.available2022-07-04T03:38:21Z
dc.date.available2022-10-11T20:14:33Z
dc.date.issued2022
dc.description.abstractBig Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions.en_US
dc.identifier.urihttp://hdl.handle.net/10179/17614
dc.publisherMassey Universityen_US
dc.rightsThe Authoren_US
dc.subjectSpark (Electronic resource : Apache Software Foundation)en
dc.subjectApache Hadoopen
dc.subjectBig dataen
dc.subjectCluster analysisen
dc.subjectData processingen
dc.subjectElectronic data processingen
dc.subjectDistributed processingen
dc.subjectParallel processing (Electronic computers)en
dc.subjectMathematical modelsen
dc.subject.anzsrc460699 Distributed computing and systems software not elsewhere classifieden
dc.titlePerformance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealanden_US
dc.typeThesisen_US
massey.contributor.authorAhmed, Nasimen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.grantorMassey Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophy (PhD)en_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AhmedPhDThesis.pdf
Size:
8.07 MB
Format:
Adobe Portable Document Format
Description: