An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand

dc.contributor.authorLiu, YuJia
dc.date.accessioned2020-03-26T00:47:38Z
dc.date.available2020-03-26T00:47:38Z
dc.date.issued2019
dc.descriptionSome possibly copyrighted figures have been retained for clarity of illustration.en_US
dc.description.abstractNowadays, big data has become a hot topic around the world. Thus, how to store, process and analysis this big volume of data has become a challenge to different companies. The advent of distributive computing frameworks provides one efficient solution for the problem. Among the frameworks, Hadoop and Spark are the two that widely used and accepted by the big data community. Based on that, we conduct a research to compare the performance between Hadoop and Spark and how parameters tuning can affect the results. The main objective of our research is to understand the difference between Spark and MapReduce as well as find the ideal parameters that can improve the efficiency. In this paper, we extend a novel package called HiBench suite which provides multiple workloads to test the performance of the clusters from many aspects. Hence, we select three workloads from the package that can represent the most common application in our daily life: Wordcount (aggregation job),TeraSort (shuffle/sort job) and K-means (iterative job). Through a large number of experiments, we find that Spark is superior to Hadoop for aggreation and iterative jobs while Hadoop shows its advantages when processing the shuffle/sort jobs. Besides, we also provide many suggestions for the three workloads to improve the efficiency by parameter tuning. In the future, we are going to further our research to find out whether there are some other factors that may affect the efficiency of the jobs.en_US
dc.identifier.urihttp://hdl.handle.net/10179/15304
dc.language.isoenen_US
dc.publisherMassey Universityen_US
dc.rightsThe Authoren_US
dc.subjectMapReduce (Computer file)en_US
dc.subjectApache Hadoopen_US
dc.subjectSpark (Electronic resource : Apache Software Foundation)en_US
dc.subjectBig dataen_US
dc.subjectComputer programsen_US
dc.subjectElectronic data processingen_US
dc.subjectDistributed processingen_US
dc.subjectbig dataen_US
dc.subjectHiBench suiteen_US
dc.subjectparameters tuningen_US
dc.titleAn empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealanden_US
dc.typeThesisen_US
massey.contributor.authorLiu, YuJia
thesis.degree.disciplineInformation Sciencesen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Science (MSc)en_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
LiuMScThesis.pdf
Size:
3.1 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.32 KB
Format:
Item-specific license agreed upon to submission
Description: