Journal Articles

Permanent URI for this collectionhttps://mro.massey.ac.nz/handle/10179/7915

Browse

Search Results

Now showing 1 - 6 of 6
  • Item
    pyRforest: a comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R.
    (Oxford University Press, 2024-10-07) Kolisnik T; Keshavarz-Rahaghi F; Purcell RV; Smith ANH; Silander OK
    Random Forest models are widely used in genomic data analysis and can offer insights into complex biological mechanisms, particularly when features influence the target in interactive, nonlinear, or nonadditive ways. Currently, some of the most efficient Random Forest methods in terms of computational speed are implemented in Python. However, many biologists use R for genomic data analysis, as R offers a unified platform for performing additional statistical analysis and visualization. Here, we present an R package, pyRforest, which integrates Python scikit-learn "RandomForestClassifier" algorithms into the R environment. pyRforest inherits the efficient memory management and parallelization of Python, and is optimized for classification tasks on large genomic datasets, such as those from RNA-seq. pyRforest offers several additional capabilities, including a novel rank-based permutation method for biomarker identification. This method can be used to estimate and visualize P-values for individual features, allowing the researcher to identify a subset of features for which there is robust statistical evidence of an effect. In addition, pyRforest includes methods for the calculation and visualization of SHapley Additive exPlanations values. Finally, pyRforest includes support for comprehensive downstream analysis for gene ontology and pathway enrichment. pyRforest thus improves the implementation and interpretability of Random Forest models for genomic data analysis by merging the strengths of Python with R. pyRforest can be downloaded at: https://www.github.com/tkolisnik/pyRforest with an associated vignette at https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf.
  • Item
    The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Dogs (Canis familiaris): A Validation Study
    (MDPI (Basel, Switzerland), 2024-09-13) Redmond C; Smit M; Draganova I; Corner-Thomas R; Thomas D; Andrews C; Fullwood DT; Bowden AE
    Assessing the behaviour and physical attributes of domesticated dogs is critical for predicting the suitability of animals for companionship or specific roles such as hunting, military or service. Common methods of behavioural assessment can be time consuming, labour-intensive, and subject to bias, making large-scale and rapid implementation challenging. Objective, practical and time effective behaviour measures may be facilitated by remote and automated devices such as accelerometers. This study, therefore, aimed to validate the ActiGraph® accelerometer as a tool for behavioural classification. This study used a machine learning method that identified nine dog behaviours with an overall accuracy of 74% (range for each behaviour was 54 to 93%). In addition, overall body dynamic acceleration was found to be correlated with the amount of time spent exhibiting active behaviours (barking, locomotion, scratching, sniffing, and standing; R2 = 0.91, p < 0.001). Machine learning was an effective method to build a model to classify behaviours such as barking, defecating, drinking, eating, locomotion, resting-asleep, resting-alert, sniffing, and standing with high overall accuracy whilst maintaining a large behavioural repertoire.
  • Item
    The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Cats (Felis catus): A Validation Study
    (MDPI (Basel, Switzerland), 2023-08-14) Smit M; Ikurior SJ; Corner-Thomas RA; Andrews CJ; Draganova I; Thomas DG; Vanwanseele B
    Animal behaviour can be an indicator of health and welfare. Monitoring behaviour through visual observation is labour-intensive and there is a risk of missing infrequent behaviours. Twelve healthy domestic shorthair cats were fitted with triaxial accelerometers mounted on a collar and harness. Over seven days, accelerometer and video footage were collected simultaneously. Identifier variables (n = 32) were calculated from the accelerometer data and summarized into 1 s epochs. Twenty-four behaviours were annotated from the video recordings and aligned with the summarised accelerometer data. Models were created using random forest (RF) and supervised self-organizing map (SOM) machine learning techniques for each mounting location. Multiple modelling rounds were run to select and merge behaviours based on performance values. All models were then tested on a validation accelerometer dataset from the same twelve cats to identify behaviours. The frequency of behaviours was calculated and compared using Dirichlet regression. Despite the SOM models having higher Kappa (>95%) and overall accuracy (>95%) compared with the RF models (64-76% and 70-86%, respectively), the RF models predicted behaviours more consistently between mounting locations. These results indicate that triaxial accelerometers can identify cat specific behaviours.
  • Item
    Lost in the Forest: Encoding categorical variables and the absent levels problem
    (Springer Nature, 2024-04-10) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JC; Gama J
    Levels of a predictor variable that are absent when a classification tree is grown can not be subject to an explicit splitting rule. This is an issue if these absent levels are present in a new observation for prediction. To date, there remains no satisfactory solution for absent levels in random forest models. Unlike missing data, absent levels are fully observed and known. Ordinal encoding of predictors allows absent levels to be integrated and used for prediction. Using a case study on source attribution of Campylobacter species using whole genome sequencing (WGS) data as predictors, we examine how target-agnostic versus target-based encoding of predictor variables with absent levels affects the accuracy of random forest models. We show that a target-based encoding approach using class probabilities, with absent levels designated the highest rank, is systematically biased, and that this bias is resolved by encoding absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordinal encoding predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are encoded according to their similarity to each of the other levels in the training data. We show that the PCO-encoding method performs at least as well as the target-based approach and is not biased.
  • Item
    Lost in the Forest
    (Cold Spring Harbor Laboratory, 2022) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JC
    To date, there remains no satisfactory solution for absent levels in random forest models. Absent levels are levels of a predictor variable encountered during prediction for which no explicit rule exists. Imposing an order on nominal predictors allows absent levels to be integrated and used for prediction. The ordering of predictors has traditionally been via class probabilities with absent levels designated the lowest order. Using a combination of simulated data and pathogen source-attribution models using whole-genome sequencing data, we examine how the method of ordering predictors with absent levels can (i) systematically bias a model, and (ii) affect the out-of-bag error rate. We show that the traditional approach is systematically biased and underestimates out-of-bag error rates, and that this bias is resolved by ordering absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordering predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are designated an order according to their similarity to each of the other levels in the training data. We show that the PCO method performs at least as well as the traditional approach of ordering and is not biased.
  • Item
    A Machine Learning Approach to Enhance the Performance of D2D-Enabled Clustered Networks
    (IEEE, 20/01/2021) Aslam S; Alam F; Hasan SF; Rashid MA
    Clustering has been suggested as an effective technique to enhance the performance of multicasting networks. Typically, a cluster head is selected to broadcast the cached content to its cluster members utilizing Device-to-Device (D2D) communication. However, some users can attain better performance by being connected with the Evolved Node B (eNB) rather than being in the clusters. In this article, we apply machine learning algorithms, namely Support Vector Machine, Random Forest, and Deep Neural Network to identify the users that should be serviced by the eNB. We therefore propose a mixed-mode content distribution scheme where the cluster heads and eNB service the two segregated groups of users to improve the performance of existing clustering schemes. A D2D-enabled multicasting scenario has been set up to perform a comprehensive simulation study that demonstrates that by utilizing the mixed-mode scheme, the performance of individual users, as well as the whole network, improve significantly in terms of throughput, energy consumption, and fairness. This study also demonstrates the trade-off between eNB loading and performance improvement for various parameters.