Massey Documents by Type
Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294
Browse
7 results
Search Results
Item pyRforest: a comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R.(Oxford University Press, 2024-10-07) Kolisnik T; Keshavarz-Rahaghi F; Purcell RV; Smith ANH; Silander OKRandom Forest models are widely used in genomic data analysis and can offer insights into complex biological mechanisms, particularly when features influence the target in interactive, nonlinear, or nonadditive ways. Currently, some of the most efficient Random Forest methods in terms of computational speed are implemented in Python. However, many biologists use R for genomic data analysis, as R offers a unified platform for performing additional statistical analysis and visualization. Here, we present an R package, pyRforest, which integrates Python scikit-learn "RandomForestClassifier" algorithms into the R environment. pyRforest inherits the efficient memory management and parallelization of Python, and is optimized for classification tasks on large genomic datasets, such as those from RNA-seq. pyRforest offers several additional capabilities, including a novel rank-based permutation method for biomarker identification. This method can be used to estimate and visualize P-values for individual features, allowing the researcher to identify a subset of features for which there is robust statistical evidence of an effect. In addition, pyRforest includes methods for the calculation and visualization of SHapley Additive exPlanations values. Finally, pyRforest includes support for comprehensive downstream analysis for gene ontology and pathway enrichment. pyRforest thus improves the implementation and interpretability of Random Forest models for genomic data analysis by merging the strengths of Python with R. pyRforest can be downloaded at: https://www.github.com/tkolisnik/pyRforest with an associated vignette at https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf.Item The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Dogs (Canis familiaris): A Validation Study(MDPI (Basel, Switzerland), 2024-09-13) Redmond C; Smit M; Draganova I; Corner-Thomas R; Thomas D; Andrews C; Fullwood DT; Bowden AEAssessing the behaviour and physical attributes of domesticated dogs is critical for predicting the suitability of animals for companionship or specific roles such as hunting, military or service. Common methods of behavioural assessment can be time consuming, labour-intensive, and subject to bias, making large-scale and rapid implementation challenging. Objective, practical and time effective behaviour measures may be facilitated by remote and automated devices such as accelerometers. This study, therefore, aimed to validate the ActiGraph® accelerometer as a tool for behavioural classification. This study used a machine learning method that identified nine dog behaviours with an overall accuracy of 74% (range for each behaviour was 54 to 93%). In addition, overall body dynamic acceleration was found to be correlated with the amount of time spent exhibiting active behaviours (barking, locomotion, scratching, sniffing, and standing; R2 = 0.91, p < 0.001). Machine learning was an effective method to build a model to classify behaviours such as barking, defecating, drinking, eating, locomotion, resting-asleep, resting-alert, sniffing, and standing with high overall accuracy whilst maintaining a large behavioural repertoire.Item The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Cats (Felis catus): A Validation Study(MDPI (Basel, Switzerland), 2023-08-14) Smit M; Ikurior SJ; Corner-Thomas RA; Andrews CJ; Draganova I; Thomas DG; Vanwanseele BAnimal behaviour can be an indicator of health and welfare. Monitoring behaviour through visual observation is labour-intensive and there is a risk of missing infrequent behaviours. Twelve healthy domestic shorthair cats were fitted with triaxial accelerometers mounted on a collar and harness. Over seven days, accelerometer and video footage were collected simultaneously. Identifier variables (n = 32) were calculated from the accelerometer data and summarized into 1 s epochs. Twenty-four behaviours were annotated from the video recordings and aligned with the summarised accelerometer data. Models were created using random forest (RF) and supervised self-organizing map (SOM) machine learning techniques for each mounting location. Multiple modelling rounds were run to select and merge behaviours based on performance values. All models were then tested on a validation accelerometer dataset from the same twelve cats to identify behaviours. The frequency of behaviours was calculated and compared using Dirichlet regression. Despite the SOM models having higher Kappa (>95%) and overall accuracy (>95%) compared with the RF models (64-76% and 70-86%, respectively), the RF models predicted behaviours more consistently between mounting locations. These results indicate that triaxial accelerometers can identify cat specific behaviours.Item Lost in the Forest: Encoding categorical variables and the absent levels problem(Springer Nature, 2024-04-10) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JC; Gama JLevels of a predictor variable that are absent when a classification tree is grown can not be subject to an explicit splitting rule. This is an issue if these absent levels are present in a new observation for prediction. To date, there remains no satisfactory solution for absent levels in random forest models. Unlike missing data, absent levels are fully observed and known. Ordinal encoding of predictors allows absent levels to be integrated and used for prediction. Using a case study on source attribution of Campylobacter species using whole genome sequencing (WGS) data as predictors, we examine how target-agnostic versus target-based encoding of predictor variables with absent levels affects the accuracy of random forest models. We show that a target-based encoding approach using class probabilities, with absent levels designated the highest rank, is systematically biased, and that this bias is resolved by encoding absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordinal encoding predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are encoded according to their similarity to each of the other levels in the training data. We show that the PCO-encoding method performs at least as well as the target-based approach and is not biased.Item Lost in the Forest(Cold Spring Harbor Laboratory, 2022) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JCTo date, there remains no satisfactory solution for absent levels in random forest models. Absent levels are levels of a predictor variable encountered during prediction for which no explicit rule exists. Imposing an order on nominal predictors allows absent levels to be integrated and used for prediction. The ordering of predictors has traditionally been via class probabilities with absent levels designated the lowest order. Using a combination of simulated data and pathogen source-attribution models using whole-genome sequencing data, we examine how the method of ordering predictors with absent levels can (i) systematically bias a model, and (ii) affect the out-of-bag error rate. We show that the traditional approach is systematically biased and underestimates out-of-bag error rates, and that this bias is resolved by ordering absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordering predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are designated an order according to their similarity to each of the other levels in the training data. We show that the PCO method performs at least as well as the traditional approach of ordering and is not biased.Item Clustering algorithm for D2D communication in next generation cellular networks : thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering, Massey University, Auckland, New Zealand(Massey University, 2021) Aslam, SaadNext generation cellular networks will support many complex services for smartphones, vehicles, and other devices. To accommodate such services, cellular networks need to go beyond the capabilities of their previous generations. Device-to-Device communication (D2D) is a key technology that can help fulfil some of the requirements of future networks. The telecommunication industry expects a significant increase in the density of mobile devices which puts more pressure on centralized schemes and poses risk in terms of outages, poor spectral efficiencies, and low data rates. Recent studies have shown that a large part of the cellular traffic pertains to sharing popular contents. This highlights the need for decentralized and distributive approaches to managing multimedia traffic. Content-sharing via D2D clustered networks has emerged as a popular approach for alleviating the burden on the cellular network. Different studies have established that D2D communication in clusters can improve spectral and energy efficiency, achieve low latency while increasing the capacity of the network. To achieve effective content-sharing among users, appropriate clustering strategies are required. Therefore, the aim is to design and compare clustering approaches for D2D communication targeting content-sharing applications. Currently, most of researched and implemented clustering schemes are centralized or predominantly dependent on Evolved Node B (eNB). This thesis proposes a distributed architecture that supports clustering approaches to incorporate multimedia traffic. A content-sharing network is presented where some D2D User Equipment (DUE) function as content distributors for nearby devices. Two promising techniques are utilized, namely, Content-Centric Networking and Network Virtualization, to propose a distributed architecture, that supports efficient content delivery. We propose to use clustering at the user level for content-distribution. A weighted multi-factor clustering algorithm is proposed for grouping the DUEs sharing a common interest. Various performance parameters such as energy consumption, area spectral efficiency, and throughput have been considered for evaluating the proposed algorithm. The effect of number of clusters on the performance parameters is also discussed. The proposed algorithm has been further modified to allow for a trade-off between fairness and other performance parameters. A comprehensive simulation study is presented that demonstrates that the proposed clustering algorithm is more flexible and outperforms several well-known and state-of-the-art algorithms. The clustering process is subsequently evaluated from an individual user’s perspective for further performance improvement. We believe that some users, sharing common interests, are better off with the eNB rather than being in the clusters. We utilize machine learning algorithms namely, Deep Neural Network, Random Forest, and Support Vector Machine, to identify the users that are better served by the eNB and form clusters for the rest of the users. This proposed user segregation scheme can be used in conjunction with most clustering algorithms including the proposed multi-factor scheme. A comprehensive simulation study demonstrates that with such novel user segregation, the performance of individual users, as well as the whole network, can be significantly improved for throughput, energy consumption, and fairness.Item A Machine Learning Approach to Enhance the Performance of D2D-Enabled Clustered Networks(IEEE, 20/01/2021) Aslam S; Alam F; Hasan SF; Rashid MAClustering has been suggested as an effective technique to enhance the performance of multicasting networks. Typically, a cluster head is selected to broadcast the cached content to its cluster members utilizing Device-to-Device (D2D) communication. However, some users can attain better performance by being connected with the Evolved Node B (eNB) rather than being in the clusters. In this article, we apply machine learning algorithms, namely Support Vector Machine, Random Forest, and Deep Neural Network to identify the users that should be serviced by the eNB. We therefore propose a mixed-mode content distribution scheme where the cluster heads and eNB service the two segregated groups of users to improve the performance of existing clustering schemes. A D2D-enabled multicasting scenario has been set up to perform a comprehensive simulation study that demonstrates that by utilizing the mixed-mode scheme, the performance of individual users, as well as the whole network, improve significantly in terms of throughput, energy consumption, and fairness. This study also demonstrates the trade-off between eNB loading and performance improvement for various parameters.

