Browsing by Author "Ahmed N"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- ItemA comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench(BioMed Central Ltd, 14/12/2020) Ahmed N; Barczak ALC; Susnjak T; Rashid MABig Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
- ItemEffects of modulation techniques RZ, NRZ, and CSRZ on the operation of hybrid OCDMA/WDM system for gigabit passive optical networks(2017 IAMOT, 2017-07) Ahmed N; Rashid MAIn this paper, the performance of hybrid optical code division multiple access/wavelength division multiplexing (OCDMA/WDM) system is evaluated for gigabit passive optical network (GPON). We have investigated, compared and analyzed various modulation techniques for 5 km distance with channel transmission rates at 2.5Gbps and 5 Gbps for OCDMA and WDM respectively. The Enhance Double Weight (EDW) code is used as a signature address for this system for studying the system limitations, benefits, and capabilities in order to transmit signal and handle high data traffic for the future multi gigabit optical networks. Simulation results revealed that Non-return to Zero (NRZ) modulation format provides better performance considering Bit-Error-Rate of 10E-13 and 11.608 dBm received optical power. The overall system performance using NRZ is increased by 17% and 33% against RZ and CSRZ.
- ItemRuntime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models(BioMed Central Ltd, 2022-12) Ahmed N; Barczak ALC; Rashid MA; Susnjak TDue to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.