Massey Documents by Type
Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294
Browse
6 results
Search Results
Item Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand(Massey University, 2022) Ahmed, NasimBig Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions.Item An empirical comparison between MapReduce and Spark : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Information Sciences at Massey University, Auckland, New Zealand(Massey University, 2019) Liu, YuJiaNowadays, big data has become a hot topic around the world. Thus, how to store, process and analysis this big volume of data has become a challenge to different companies. The advent of distributive computing frameworks provides one efficient solution for the problem. Among the frameworks, Hadoop and Spark are the two that widely used and accepted by the big data community. Based on that, we conduct a research to compare the performance between Hadoop and Spark and how parameters tuning can affect the results. The main objective of our research is to understand the difference between Spark and MapReduce as well as find the ideal parameters that can improve the efficiency. In this paper, we extend a novel package called HiBench suite which provides multiple workloads to test the performance of the clusters from many aspects. Hence, we select three workloads from the package that can represent the most common application in our daily life: Wordcount (aggregation job),TeraSort (shuffle/sort job) and K-means (iterative job). Through a large number of experiments, we find that Spark is superior to Hadoop for aggreation and iterative jobs while Hadoop shows its advantages when processing the shuffle/sort jobs. Besides, we also provide many suggestions for the three workloads to improve the efficiency by parameter tuning. In the future, we are going to further our research to find out whether there are some other factors that may affect the efficiency of the jobs.Item Function block programming for distributed control : a thesis presented in complete fulfilment of the requirements for the Master of Engineering, 216.899 thesis at Massey University, Wellington, New Zealand(Massey University, 2004) Meek, Andrew RobertThis report discusses research and development using the draft IEC 61499 function block standard for distributed control with embedded microprocessor applications. This is a function block programming language that is currently under development for programming distributed control systems. The report covers what is required to develop an IEC 61499 compliant product and its suitablity for use with distributed control systems. To utilise the IEC 61499 standard, research and development of an embedded Java platform was performed. This required porting a Java virtual machine to run on an embedded microprocessor. An existing industrial network protocol DeviceNet was chosen for distributing the data between the network of control devices. To achieve this an upgrade was required to an existing DeviceNet communications engine to support distributed control. A third party IEC 61499 software application engine was ported to run on an embedded microprocessor. This option was chosen rather than completely developing a software engine as a commercial decision by the developer company. It also allowed support from other companies and researchers working with this standard. To test distributed control using this function block programming standard a test application consisting of a conveyor and three axis robot was developed. The test application demonstrated the feasibility of distributed control using IEC 61499 function blocks and some of the advantages of distributed control. Further outcomes of this research have highlighted some of the problems that require rectifying before this function block programming standard is feasible for commercial products.Item A Java implementation of a Linda-like Tuplespace system with nested transactions : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Albany, New Zealand(Massey University, 2006) Yao, YinanThe Tuplespace model is considered a powerful option for the design and implementation of loosely coupled distributed systems. In this report, the features of the Tuplespace model are examined as well as the issues involved in implementing such a Tuplespace system based on Java. The system presented includes the function of Transactions: a collection of operations that either all succeed or all fail. The system also permits Nested Transactions: an extension of transactions. Nested transactions have a multi-level grouping structure: each nested transaction consists of zero or more operations and possibly some nested transactions. The key advantages offered by nested transactions include that they enable the failure of an operation to be isolated within a certain scope without necessarily aborting the entire transaction, and they allow programmers to sub-divide a complex operation into a number of smaller and simpler concurrent operations. The other features of nested transactions are also examined in this report. Finally, the testing results indicate that it is possible to build an efficient, scalable, and transaction secured distributed application that relies on the Tuplespace model and the system developed for this research.Item J2EE application for clustered servers : focus on balancing workloads among clustered servers : a thesis presented in partial fulfilment of the requirements for the degree of Master of Information Science in Computer Science at Massey University, Albany, New Zealand(Massey University, 2006) Chen, XiJ2EE has become a de facto platform for developing enterprise applications not only by its standard based methodology but also by reducing the cost and complexity of developing multi-tier enterprise applications. J2EE based application servers keep business logic separate from the front-end applications (client-side) and back-end database servers. The standardized components and containers simplify J2EE application design. The containers automatically manage the fundamental system level services for its components, which enable the components design to focus on the business requirement and business logic. This study applies the latest J2EE technologies to configure an online benchmark enterprise application - MG Project. The application focuses on three types of components design including Servlet, entity bean and session bean. Servlets run on the web server Tomcat, EJB components, session beans and entity beans run on the application server JBoss and the database runs on the database server Postgre SQL. This benchmark application is used for testing the performance of clustered JBoss due to various load-balancing policies applied at the EJB level. This research also focuses on studying the various load-balancing policies effect on the performance of clustered JBoss. As well as the four built-in load-balancing policies i.e. First Available, First Available Identical All Proxies, Random Robin and Round Robin, the study also extend the JBoss Load balance Policy interface to design two dynamic load-balancing policies. They are dynamic and dynamic weight-based load-balancing policies. The purpose of dynamic load-balancing policies design is to ensure minimal response time and obtain better performance by dispatching incoming requests to the appropriate server. However, a more accurate policy usually means more communications and calculations, which give an extra burden to a heavily loaded application server that can lead to drops in the performance.Item Scalable motif search in graphs using distributed computing : a thesis presented in partial fulfilment of the requirements for the degree of a Masters in Computer Science, Massey University, Turitea, New Zealand(Massey University, 2012) Esler, AndrewMotif detection allows software engineers to detect antipatterns in software. By decreasing the number of antipattern instances in a piece of software, the overall quality of the software is improved. Current methods to nd these antipatterns are slow and return results only when all antipatterns have been found. The GUERY framework is able to perform motif detection using multiple cores and deliver results as they are generated. By scaling GUERY to run on multiple machines, it was hoped that research requiring many queries on a graph could be performed signi cantly faster than is currently possible. The objective of this thesis was to research and prototype mechanisms whereby GUERY could be run using a cluster of computers and results delivered as a stream to interested systems. A system capable of running on a cluster of machines and delivering a stream of results as they are computed was developed.
