An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand

Liu, YuJia

An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand

dc.contributor.author	Liu, YuJia
dc.date.accessioned	2020-03-26T00:47:38Z
dc.date.available	2020-03-26T00:47:38Z
dc.date.issued	2019
dc.description	Some possibly copyrighted figures have been retained for clarity of illustration.	en_US
dc.description.abstract	Nowadays, big data has become a hot topic around the world. Thus, how to store, process and analysis this big volume of data has become a challenge to diﬀerent companies. The advent of distributive computing frameworks provides one eﬃcient solution for the problem. Among the frameworks, Hadoop and Spark are the two that widely used and accepted by the big data community. Based on that, we conduct a research to compare the performance between Hadoop and Spark and how parameters tuning can aﬀect the results. The main objective of our research is to understand the diﬀerence between Spark and MapReduce as well as ﬁnd the ideal parameters that can improve the eﬃciency. In this paper, we extend a novel package called HiBench suite which provides multiple workloads to test the performance of the clusters from many aspects. Hence, we select three workloads from the package that can represent the most common application in our daily life: Wordcount (aggregation job),TeraSort (shuﬄe/sort job) and K-means (iterative job). Through a large number of experiments, we ﬁnd that Spark is superior to Hadoop for aggreation and iterative jobs while Hadoop shows its advantages when processing the shuﬄe/sort jobs. Besides, we also provide many suggestions for the three workloads to improve the eﬃciency by parameter tuning. In the future, we are going to further our research to ﬁnd out whether there are some other factors that may aﬀect the eﬃciency of the jobs.	en_US
dc.identifier.uri	http://hdl.handle.net/10179/15304
dc.identifier.wikidata	Q112949299
dc.identifier.wikidata-uri	https://www.wikidata.org/wiki/Q112949299
dc.language.iso	en	en_US
dc.publisher	Massey University	en_US
dc.rights	The Author	en_US
dc.subject	MapReduce (Computer file)	en_US
dc.subject	Apache Hadoop	en_US
dc.subject	Spark (Electronic resource : Apache Software Foundation)	en_US
dc.subject	Big data	en_US
dc.subject	Computer programs	en_US
dc.subject	Electronic data processing	en_US
dc.subject	Distributed processing	en_US
dc.subject	big data	en_US
dc.subject	HiBench suite	en_US
dc.subject	parameters tuning	en_US
dc.title	An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand	en_US
dc.type	Thesis	en_US
massey.contributor.author	Liu, YuJia
thesis.degree.discipline	Information Sciences	en_US
thesis.degree.level	Masters	en_US
thesis.degree.name	Master of Science (MSc)	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: LiuMScThesis.pdf
Size:: 3.1 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 3.32 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations