A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Ahmed N; Barczak ALC; Susnjak T; Rashid MA

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

dc.citation.issue	1
dc.citation.volume	7
dc.contributor.author	Ahmed N
dc.contributor.author	Barczak ALC
dc.contributor.author	Susnjak T
dc.contributor.author	Rashid MA
dc.date.available	14/12/2020
dc.date.issued	14/12/2020
dc.description.abstract	Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
dc.description.publication-status	Published
dc.identifier	http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000599799400001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=c5bb3b2499afac691c2e3c1a83ef6fef
dc.identifier	ARTN 110
dc.identifier.citation	JOURNAL OF BIG DATA, 2020, 7 (1)
dc.identifier.doi	10.1186/s40537-020-00388-5
dc.identifier.eissn	2196-1115
dc.identifier.elements-id	436695
dc.identifier.harvested	Massey_Dark
dc.identifier.uri	https://hdl.handle.net/10179/16008
dc.publisher	BioMed Central Ltd
dc.relation.isPartOf	JOURNAL OF BIG DATA
dc.subject	HiBench
dc.subject	BigData
dc.subject	Hadoop
dc.subject	MapReduce
dc.subject	Benchmark
dc.subject	Spark
dc.subject.anzsrc	08 Information and Computing Sciences
dc.title	A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
dc.type	Journal article
pubs.notes	Not known
pubs.organisational-group	/Massey University
pubs.organisational-group	/Massey University/College of Sciences
pubs.organisational-group	/Massey University/College of Sciences/School of Food and Advanced Technology
pubs.organisational-group	/Massey University/College of Sciences/School of Mathematical and Computational Sciences

Files

Original bundle

Now showing 1 - 1 of 1

Name:: A comprehensive performance analysis of Apache Hadoop.pdf
Size:: 1.91 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Journal Articles