Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 3 of 3
  • Item
    Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand
    (Massey University, 2022) Ahmed, Nasim
    Big Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions.
  • Item
    Building privacy-preservation models for distributed processing platforms : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy (Ph.D.) in Computer Science, Massey University, New Zealand
    (Massey University, 2020) Bazai, Sibghat Ullah
    The widespread proliferation of data collection has increased a serious privacy concern in recent years. Data anonymization approaches have been proposed as a privacy-preserving technique to preserve the privacy of data. However, most existing data anonymization approaches have been designed to work with a small number of datasets within a single machine environment thus often not suitable for big data. To resolve these limitations, many scalable data anonymization solutions that can work with the distributed processing platform (e.g., MapReduce and Spark) has emerged to take advantage of scalability and other supports required for big data. However, due to lack of inherent support for the algorithms involved in data anonymization techniques, these existing proposals often encounter many implementation and performance bottlenecks. In the studies presented in this thesis, we propose a set of novel data anonymization approaches that can work well in the most popular distributed processing platforms for big data such as MapReduce and Spark. Our first set of studies address the privacy concerns involved in MapReduce platform that processes sensitive data without an appropriate privacy protection which may allow adversaries to break two very important security principals such as data confidentiality and integrity. Firstly, we propose a privacy-preservation platform as an extra layer on MapReduce to provide a set of privacy services to produce different sets of privacy-preserving anonymized datasets that can be safely processed by MapReduce. Secondly, we also offer a privacy-preserving $k$-NN based classifier for MapReduce. Instead of working with plaintext, our $k$-NN classifier can work on any anonymized datasets to protect the privacy concern of input data while still providing accurate classification results. In our second set of studies, we address the concerns in Apache Spark that lack appropriate supports for many popular data anonymization techniques. We first investigate the requirement for the types of support required for many data anonymization approaches which often demand multiple read and write operations. We argue that existing approaches fail to provide supports for caching intermediate data in memory which found to contribute performance degradation. To address this problem, we propose a Resilient Distributed Dataset (RDD) based data anonymization model that avoids expensive disk I/O. We also argue that many existing methods do not provide support for iterative intensive operations which appear in many data anonymization technique such as subtree generalization. We propose a generic approach for implementing subtree-based data anonymization techniques for Spark that provide more effective support for iteration intensive operations. Extending from this, we also provide a novel hybrid approach that can more effectively apply different data anonymization techniques for multi-dimensional data. We argue that our hybrid approach offers much better control for the expensive RDD creation and the size of partitions attached for each RDD which is much better suited to reduce many overheads such as involved in re-computation, shuffle operations, message exchange, and cache management. The experimental studies confirm that our novel privacy-preserving models implemented on both MapReduce and Spark provide high performance and scalability while supporting high levels of data privacy and utility.
  • Item
    An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand
    (Massey University, 2019) Liu, YuJia
    Nowadays, big data has become a hot topic around the world. Thus, how to store, process and analysis this big volume of data has become a challenge to different companies. The advent of distributive computing frameworks provides one efficient solution for the problem. Among the frameworks, Hadoop and Spark are the two that widely used and accepted by the big data community. Based on that, we conduct a research to compare the performance between Hadoop and Spark and how parameters tuning can affect the results. The main objective of our research is to understand the difference between Spark and MapReduce as well as find the ideal parameters that can improve the efficiency. In this paper, we extend a novel package called HiBench suite which provides multiple workloads to test the performance of the clusters from many aspects. Hence, we select three workloads from the package that can represent the most common application in our daily life: Wordcount (aggregation job),TeraSort (shuffle/sort job) and K-means (iterative job). Through a large number of experiments, we find that Spark is superior to Hadoop for aggreation and iterative jobs while Hadoop shows its advantages when processing the shuffle/sort jobs. Besides, we also provide many suggestions for the three workloads to improve the efficiency by parameter tuning. In the future, we are going to further our research to find out whether there are some other factors that may affect the efficiency of the jobs.