Massey Documents by Type
Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294
Browse
7 results
Search Results
Item Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models(BioMed Central Ltd, 2022-12) Ahmed N; Barczak ALC; Rashid MA; Susnjak TDue to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.Item Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand(Massey University, 2022) Ahmed, NasimBig Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions.Item Transforming scientific research and development in precision agriculture : the case of hyperspectral sensing and imaging : a thesis presented in partial fulfilment of the requirements for the degree of Doctor in Philosophy in Agriculture at Massey University, Manawatū, New Zealand(Massey University, 2021) Cushnahan, MeganThere has been increasing social and academic debate in recent times surrounding the arrival of agricultural big data. Capturing and responding to real world variability is a defining objective of the rapidly evolving field of precision agriculture (PA). While data have been central to knowledge-making in the field since its inception in the 1980s, research has largely operated in a data-scarce environment, constrained by time-consuming and expensive data collection methods. While there is a rich tradition of studying scientific practice within laboratories in other fields, PA researchers have rarely been the explicit focal point of detailed empirical studies, especially in the laboratory setting. The purpose of this thesis is to contribute to new knowledge of the influence of big data technologies through an ethnographic exploration of a working PA laboratory. The researcher spent over 30 months embedded as a participant observer of a small PA laboratory, where researchers work with nascent data rich remote sensing technologies. To address the research question: “How do the characteristics of technological assemblages affect PA research and development?” the ethnographic case study systematically identifies and responds to the challenges and opportunities faced by the science team as they adapt their scientific processes and resources to refine value from a new data ecosystem. The study describes the ontological characteristics of airborne hyperspectral sensing and imaging data employed by PA researchers. Observations of the researchers at work lead to a previously undescribed shift in the science process, where effort moves from the planning and performance of the data collection stage to the data processing and analysis stage. The thesis develops an argument that changing data characteristics are central to this shift in the scientific method researchers are employing to refine knowledge and value from research projects. Importantly, the study reveals that while researchers are working in a rapidly changing environment, there is little reflection on the implications of these changes on the practice of science-making. The study also identifies a disjunction to how science is done in the field, and what is reported. We discover that the practices that provide disciplinary ways of doing science are not established in this field and moments to learn are siloed because of commercial constraints the commercial structures imposed in this case study of contemporary PA research.Item Design of a novel X-section architecture for FX-correlator in large interferometers : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Auckland, New Zealand(Massey University, 2021) Balu, Vignesh RajaIn large radio-interferometers it is considerably challenging to perform signal correlations at input data-rates of over 11 Tbps, which involves vast amount of storage, memory bandwidth and computational hardware. The primary objective of this research work is to focus on reducing the memory-access and design complexity in matrix architectural Big Data processing of the complex X-section of an FX-correlator employed in large array radio-telescopes. This thesis presents a dedicated correlator-system-multiplier-and -accumulator (CoSMAC) cell architecture based on the real input samples from antenna arrays which produces two 16-bit complex multiplications in the same clock cycle. The novel correlator cell optimization is achieved by utilizing the flipped mirror relationship between Discrete Fourier transform (DFT) samples owing to the symmetry and periodicity of the DFT coefficient vectors. The proposed CoSMAC structure is extended to build a new processing element (PE) which calculates both cross- correlation visibilities and auto-correlation functions simultaneously. Further, a novel mathematical model and a hardware design is derived to calculate two visibilities per baseline for the Quadrature signals (IQ sampled signals, where I is In-phase signal and Q is the 90 degrees phase shifted signal) named as Processing Element for IQ sampled signals (PE_IQ). These three proposed dedicated correlator cells minimise the number of visibility calculations in a baseline. The design methodology also targets the optimisation of the multiplier size in order to reduce the power and area further in the CoSMAC, PE and PE_IQ. Various fast and efficient multiplier algorithms are compared and combined to achieve a novel multiplier named Modified-Booth-Wallace-Multiplier and implemented in the CoSMAC and PE cells. The dedicated multiplier is designed to mostly target the area and power optimisations without degrading the performance. The conventional complex-multiplier-and-accumulators (CMACs) employed to perform the complex multiplications are replaced with these dedicated ASIC correlator cells along with the optimized multipliers to reduce the overall power and area requirements in a matrix correlator architecture. The proposed architecture lowers the number of ASIC processor cells required to calculate the overall baselines in an interferometer by eliminating the redundant cells. Hence the new matrix architectural minimization is very effective in reducing the hardware complexity by nearly 50% without affecting the overall speed and performance of very large interferometers like the Square Kilometre Array (SKA).Item Management decision making in the age of big data : an exploration of the roles of analytics and human judgment : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Management at Massey University, Auckland, New Zealand(Massey University, 2019) Gressel, SimoneThis thesis explores the effects of data analytics and human judgment on management decision making in an increasingly data-driven environment. In recent years, the topics of big data and advanced analytics have gained traction and wide-spread interest among practitioners and academics. Today, big data is considered a buzzword by some and an essential prerequisite for future business success by others. Recent research highlights the potential of big data analytics for decision making, but also points out critical challenges and risks. The aim of this research is to take an in-depth look at management decision making by using qualitative case studies and critical incidents to carefully examine managers' decision-making processes. This exploration evolves around the two main research questions: i) How do managers perceive the role of advanced analytics and big data in the decision-making process? ii) How do managers perceive the alignment of advanced analytics and big data with more traditional decision-making approaches such as human judgment? The content and thematic analyses of data from 25 semi-structured interviews with managers, executives, and business analysts from nine organizations provided several key insights. Managers were found to rely on data and human judgment in their decision making to varying extents and in different roles. The processes followed by the decision makers depended on the decisions at hand, the managers’ characteristics and preferences, as well as environmental factors. The findings empirically support the development of an ecological systems framework, which provides a holistic picture of managerial decision making in the age of big data. The study contributes by applying the dual process theory to the context of data-driven decision making. Practical implications for organizations are derived from the findings and identify organizational considerations and prerequisites. The influence of the managers’ environments on decision making emphasizes the organizations’ need to utilize a holistic approach when adopting a data-driven decision-making culture.Item Alignment of big data perceptions in New Zealand healthcare : a thesis presented in partial fulfilment of the requirement for the degree of Doctor of Philosophy in Management at Massey University, Albany, New Zealand(Massey University, 2019) Wannitilake Mudiyanselage, Kasuni Gayara WeerasingheThe growing use of information systems (IS) in the healthcare sector, on top of increasing patient populations, diseases and complicated medication regimens, is generating enormous amounts of unstructured and complex data that have the characteristics of ‘big data’. Until recent times data driven approaches in healthcare to make use of large volumes of complex healthcare data were considered difficult, if not impossible, because available technology was not mature enough to handle such data. However, recent technological developments around big data have opened promising avenues for healthcare to make use of its big-healthcare-data for more effective healthcare delivery, in areas such as measuring outcomes, population health analysis, precision medicine, clinical care and research and development. Being a recent IT phenomenon, big data research has leaned towards technical dynamics such as analytics, data security and infrastructure. However, to date, the social dynamics of big data (such as peoples’ understanding and their perceptions of its value, application, challenges and the like) have not been adequately researched. This thesis addresses the research gap through exploring the social dynamics around the concept of big data at the level of policy-makers (identified as the macro level), funders and planners (identified as the meso level), and clinicians (identified as the micro level) in the New Zealand (NZ) healthcare sector. Investigating and comparing social dynamics of big data across these levels is important, as big data research has highlighted the importance of business-IT alignment to the successful implementation of big data technologies. Business-IT alignment is important and can be investigated through many different dimensions. This thesis adopts a social dimension lens to alignment, which promotes investigating alignment through people’s understanding of big data and its role in their work. Taking a social dimension lens to alignment fits well with the aim of this thesis, which is to understand perceptions around the notion of big data technologies that could influence the alignment of big data in healthcare policy and practice. With this understanding, the research question addressed is: how do perceptions of big data influence alignment across macro, meso, and micro levels in the NZ healthcare sector? This thesis is by publication with four research articles that answer these questions as a body of knowledge. A qualitative exploratory approach was taken to conduct an empirical study. Thirty-two in-depth interviews with policy makers, senior managers and physicians were conducted across the NZ healthcare sector. Purposive and snowball sampling techniques were used. The interviews were transcribed verbatim and analysed using general inductive thematic analysis. Data were first analysed within each group (macro, meso, and micro) to understand perceptions of big data, then across groups to understand alignment. In order to investigate perceptions, Social Representations Theory (SRT), a theory from social psychology, was used as the basis for data collection. However, data analysis led to the decision to integrate SRT with Sociotechnical Systems Theory (SST), a well-known IS theory. This integration of SRT with SST developed the Theory of Sociotechnical Representations (TSR), which is a key theoretical contribution of this research. The thesis presents the concept and application of TSR, by using it to frame the study’s findings around perceptions of big data across macro, meso and micro levels of the NZ healthcare sector. The practical contribution of this thesis is the demonstration of areas of alignment and misalignment of big data perceptions across the healthcare sector. Across the three levels, alignment was found in the shared understanding of the importance of data quality, the increasing challenges of privacy and security, and the importance of new types of data in measuring health outcomes. Aspects of misalignment included the differing definitions of big data, as well as perceptions around data ownership, data sharing, use of patient-generated data and interoperability. While participants identified measuring outcomes, clinical decision making, population health, and precision medicine as potential areas of application for big data technologies, the three groups expressed varying levels of interest, which could cause misalignment issues with implications for policy and practice.Item An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand(Massey University, 2019) Liu, YuJiaNowadays, big data has become a hot topic around the world. Thus, how to store, process and analysis this big volume of data has become a challenge to different companies. The advent of distributive computing frameworks provides one efficient solution for the problem. Among the frameworks, Hadoop and Spark are the two that widely used and accepted by the big data community. Based on that, we conduct a research to compare the performance between Hadoop and Spark and how parameters tuning can affect the results. The main objective of our research is to understand the difference between Spark and MapReduce as well as find the ideal parameters that can improve the efficiency. In this paper, we extend a novel package called HiBench suite which provides multiple workloads to test the performance of the clusters from many aspects. Hence, we select three workloads from the package that can represent the most common application in our daily life: Wordcount (aggregation job),TeraSort (shuffle/sort job) and K-means (iterative job). Through a large number of experiments, we find that Spark is superior to Hadoop for aggreation and iterative jobs while Hadoop shows its advantages when processing the shuffle/sort jobs. Besides, we also provide many suggestions for the three workloads to improve the efficiency by parameter tuning. In the future, we are going to further our research to find out whether there are some other factors that may affect the efficiency of the jobs.
