Journal Articles
Permanent URI for this collectionhttps://mro.massey.ac.nz/handle/10179/7915
Browse
5 results
Search Results
Item pyRforest: a comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R.(Oxford University Press, 2024-10-07) Kolisnik T; Keshavarz-Rahaghi F; Purcell RV; Smith ANH; Silander OKRandom Forest models are widely used in genomic data analysis and can offer insights into complex biological mechanisms, particularly when features influence the target in interactive, nonlinear, or nonadditive ways. Currently, some of the most efficient Random Forest methods in terms of computational speed are implemented in Python. However, many biologists use R for genomic data analysis, as R offers a unified platform for performing additional statistical analysis and visualization. Here, we present an R package, pyRforest, which integrates Python scikit-learn "RandomForestClassifier" algorithms into the R environment. pyRforest inherits the efficient memory management and parallelization of Python, and is optimized for classification tasks on large genomic datasets, such as those from RNA-seq. pyRforest offers several additional capabilities, including a novel rank-based permutation method for biomarker identification. This method can be used to estimate and visualize P-values for individual features, allowing the researcher to identify a subset of features for which there is robust statistical evidence of an effect. In addition, pyRforest includes methods for the calculation and visualization of SHapley Additive exPlanations values. Finally, pyRforest includes support for comprehensive downstream analysis for gene ontology and pathway enrichment. pyRforest thus improves the implementation and interpretability of Random Forest models for genomic data analysis by merging the strengths of Python with R. pyRforest can be downloaded at: https://www.github.com/tkolisnik/pyRforest with an associated vignette at https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf.Item The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Dogs (Canis familiaris): A Validation Study(MDPI (Basel, Switzerland), 2024-09-13) Redmond C; Smit M; Draganova I; Corner-Thomas R; Thomas D; Andrews C; Fullwood DT; Bowden AEAssessing the behaviour and physical attributes of domesticated dogs is critical for predicting the suitability of animals for companionship or specific roles such as hunting, military or service. Common methods of behavioural assessment can be time consuming, labour-intensive, and subject to bias, making large-scale and rapid implementation challenging. Objective, practical and time effective behaviour measures may be facilitated by remote and automated devices such as accelerometers. This study, therefore, aimed to validate the ActiGraph® accelerometer as a tool for behavioural classification. This study used a machine learning method that identified nine dog behaviours with an overall accuracy of 74% (range for each behaviour was 54 to 93%). In addition, overall body dynamic acceleration was found to be correlated with the amount of time spent exhibiting active behaviours (barking, locomotion, scratching, sniffing, and standing; R2 = 0.91, p < 0.001). Machine learning was an effective method to build a model to classify behaviours such as barking, defecating, drinking, eating, locomotion, resting-asleep, resting-alert, sniffing, and standing with high overall accuracy whilst maintaining a large behavioural repertoire.Item The Use of Triaxial Accelerometers and Machine Learning Algorithms for Behavioural Identification in Domestic Cats (Felis catus): A Validation Study(MDPI (Basel, Switzerland), 2023-08-14) Smit M; Ikurior SJ; Corner-Thomas RA; Andrews CJ; Draganova I; Thomas DG; Vanwanseele BAnimal behaviour can be an indicator of health and welfare. Monitoring behaviour through visual observation is labour-intensive and there is a risk of missing infrequent behaviours. Twelve healthy domestic shorthair cats were fitted with triaxial accelerometers mounted on a collar and harness. Over seven days, accelerometer and video footage were collected simultaneously. Identifier variables (n = 32) were calculated from the accelerometer data and summarized into 1 s epochs. Twenty-four behaviours were annotated from the video recordings and aligned with the summarised accelerometer data. Models were created using random forest (RF) and supervised self-organizing map (SOM) machine learning techniques for each mounting location. Multiple modelling rounds were run to select and merge behaviours based on performance values. All models were then tested on a validation accelerometer dataset from the same twelve cats to identify behaviours. The frequency of behaviours was calculated and compared using Dirichlet regression. Despite the SOM models having higher Kappa (>95%) and overall accuracy (>95%) compared with the RF models (64-76% and 70-86%, respectively), the RF models predicted behaviours more consistently between mounting locations. These results indicate that triaxial accelerometers can identify cat specific behaviours.Item Lost in the Forest: Encoding categorical variables and the absent levels problem(Springer Nature, 2024-04-10) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JC; Gama JLevels of a predictor variable that are absent when a classification tree is grown can not be subject to an explicit splitting rule. This is an issue if these absent levels are present in a new observation for prediction. To date, there remains no satisfactory solution for absent levels in random forest models. Unlike missing data, absent levels are fully observed and known. Ordinal encoding of predictors allows absent levels to be integrated and used for prediction. Using a case study on source attribution of Campylobacter species using whole genome sequencing (WGS) data as predictors, we examine how target-agnostic versus target-based encoding of predictor variables with absent levels affects the accuracy of random forest models. We show that a target-based encoding approach using class probabilities, with absent levels designated the highest rank, is systematically biased, and that this bias is resolved by encoding absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordinal encoding predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are encoded according to their similarity to each of the other levels in the training data. We show that the PCO-encoding method performs at least as well as the target-based approach and is not biased.Item Lost in the Forest(Cold Spring Harbor Laboratory, 2022) Smith HL; Biggs PJ; French NP; Smith ANH; Marshall JCTo date, there remains no satisfactory solution for absent levels in random forest models. Absent levels are levels of a predictor variable encountered during prediction for which no explicit rule exists. Imposing an order on nominal predictors allows absent levels to be integrated and used for prediction. The ordering of predictors has traditionally been via class probabilities with absent levels designated the lowest order. Using a combination of simulated data and pathogen source-attribution models using whole-genome sequencing data, we examine how the method of ordering predictors with absent levels can (i) systematically bias a model, and (ii) affect the out-of-bag error rate. We show that the traditional approach is systematically biased and underestimates out-of-bag error rates, and that this bias is resolved by ordering absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordering predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are designated an order according to their similarity to each of the other levels in the training data. We show that the PCO method performs at least as well as the traditional approach of ordering and is not biased.
