Journal Articles

Permanent URI for this collectionhttps://mro.massey.ac.nz/handle/10179/7915

Browse

Search Results

Now showing 1 - 6 of 6
  • Item
    pyRforest: a comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R.
    (Oxford University Press, 2024-10-07) Kolisnik T; Keshavarz-Rahaghi F; Purcell RV; Smith ANH; Silander OK
    Random Forest models are widely used in genomic data analysis and can offer insights into complex biological mechanisms, particularly when features influence the target in interactive, nonlinear, or nonadditive ways. Currently, some of the most efficient Random Forest methods in terms of computational speed are implemented in Python. However, many biologists use R for genomic data analysis, as R offers a unified platform for performing additional statistical analysis and visualization. Here, we present an R package, pyRforest, which integrates Python scikit-learn "RandomForestClassifier" algorithms into the R environment. pyRforest inherits the efficient memory management and parallelization of Python, and is optimized for classification tasks on large genomic datasets, such as those from RNA-seq. pyRforest offers several additional capabilities, including a novel rank-based permutation method for biomarker identification. This method can be used to estimate and visualize P-values for individual features, allowing the researcher to identify a subset of features for which there is robust statistical evidence of an effect. In addition, pyRforest includes methods for the calculation and visualization of SHapley Additive exPlanations values. Finally, pyRforest includes support for comprehensive downstream analysis for gene ontology and pathway enrichment. pyRforest thus improves the implementation and interpretability of Random Forest models for genomic data analysis by merging the strengths of Python with R. pyRforest can be downloaded at: https://www.github.com/tkolisnik/pyRforest with an associated vignette at https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf.
  • Item
    SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data.
    (Microbiology Society, 2018-06-15) Epping L; van Tonder AJ; Gladstone RA; The Global Pneumococcal Sequencing Consortium; Bentley SD; Page AJ; Keane JA
    Streptococcus pneumoniae is responsible for 240 000-460 000 deaths in children under 5 years of age each year. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines. Recent efforts have been made to infer serotypes directly from genomic data but current software approaches are limited and do not scale well. Here, we introduce a novel method, SeroBA, which uses a k-mer approach. We compare SeroBA against real and simulated data and present results on the concordance and computational performance against a validation dataset, the robustness and scalability when analysing a large dataset, and the impact of varying the depth of coverage on sequence-based serotyping. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98 % concordance using a k-mer-based method, can process 10 000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 15-21×. SeroBA is implemented in Python3 and is freely available under an open source GPLv3 licence from: https://github.com/sanger-pathogens/seroba.
  • Item
    Visual Integration of Genome-Wide Association Studies and Differential Expression Results with the Hidecan R Package.
    (MDPI (Basel, Switzerland), 2024-09-25) Angelin-Bonnet O; Vignes M; Biggs PJ; Baldwin S; Thomson S; Hojsgaard D
    Background/Objectives: We present hidecan, an R package for generating visualisations that summarise the results of one or more genome-wide association studies (GWAS) and differential expression analyses, as well as manually curated candidate genes, e.g., extracted from the literature. This tool is applicable to all ploidy levels; we notably provide functionalities to facilitate the visualisation of GWAS results obtained for autotetraploid organisms with the GWASpoly package. Results: We illustrate the capabilities of hidecan with examples from two autotetraploid potato datasets. Conclusions: The hidecan package is implemented in R and is publicly available on the CRAN repository and on GitHub. A description of the package, as well as a detailed tutorial, is made available alongside the package. It is also part of the VIEWpoly tool for the visualisation and exploration of results from polyploids computational tools.
  • Item
    A multi-objective genetic algorithm to find active modules in multiplex biological networks
    (PLOS, 2021-08-30) Novoa-Del-Toro EM; Mezura-Montes E; Vignes M; Térézol M; Magdinier F; Tichit L; Baudot A; Jensen P
    The identification of subnetworks of interest-or active modules-by integrating biological networks with molecular profiles is a key resource to inform on the processes perturbed in different cellular conditions. We here propose MOGAMUN, a Multi-Objective Genetic Algorithm to identify active modules in MUltiplex biological Networks. MOGAMUN optimizes both the density of interactions and the scores of the nodes (e.g., their differential expression). We compare MOGAMUN with state-of-the-art methods, representative of different algorithms dedicated to the identification of active modules in single networks. MOGAMUN identifies dense and high-scoring modules that are also easier to interpret. In addition, to our knowledge, MOGAMUN is the first method able to use multiplex networks. Multiplex networks are composed of different layers of physical and functional relationships between genes and proteins. Each layer is associated to its own meaning, topology, and biases; the multiplex framework allows exploiting this diversity of biological networks. We applied MOGAMUN to identify cellular processes perturbed in Facio-Scapulo-Humeral muscular Dystrophy, by integrating RNA-seq expression data with a multiplex biological network. We identified different active modules of interest, thereby providing new angles for investigating the pathomechanisms of this disease.
  • Item
    LineageSpecificSeqgen: generating sequence data with lineage-specific variation in the proportion of variable sites
    (Biomed Central, 2008-11-21) Grievink, Liat Shavit; Penny, David; Hendy, Mike D; Holland, Barbara R
    Background: Commonly used phylogenetic models assume a homogeneous evolutionary process throughout the tree. It is known that these homogeneous models are often too simplistic, and that with time some properties of the evolutionary process can change (due to selection or drift). In particular, as constraints on sequences evolve, the proportion of variable sites can vary between lineages. This affects the ability of phylogenetic methods to correctly estimate phylogenetic trees, especially for long timescales. To date there is no phylogenetic model that allows for change in the proportion of variable sites, and the degree to which this affects phylogenetic reconstruction is unknown. Results: We present LineageSpecificSeqgen, an extension to the seq-gen program that allows generation of sequences with both changes in the proportion of variable sites and changes in the rate at which sites switch between being variable and invariable. In contrast to seq-gen and its derivatives to date, we interpret branch lengths as the mean number of substitutions per variable site, as opposed to the mean number of substitutions per site (which is averaged over all sites, including invariable sites). This allows specification of the substitution rates of variable sites, independently of the proportion of invariable sites. Conclusion: LineageSpecificSeqgen allows simulation of DNA and amino acid sequence alignments under a lineage-specific evolutionary process. The program can be used to test current models of evolution on sequences that have undergone lineage-specific evolution. It facilitates the development of both new methods to identify such processes in real data, and means to account for such processes. The program is available at: http://awcmee.massey.ac.nz/downloads.htm.
  • Item
    Augmented reality for pedestrian evacuation research: Promises and limitations
    (Elsevier Ltd, 2020-08) Lovreglio R; Kinateder M
    Evacuation effectively mitigates potential harm for building occupants in case of emergencies. Virtual and Augmented Reality (VR and AR) have emerged as research tools and means to enhance evacuation preparedness and effectiveness. Unlike VR, where users are immersed in computer-generated environments, the more novel AR technology allows users to experience digital content merged into the real world. Here, we review current (2020) relevant literature on AR as a tool to study and improve building evacuation triggered by a variety of disasters such as fires, earthquakes or tsunami. Further, we provide an overview of application goals, existing hardware and what evacuation stages can be influenced by AR applications. Finally, we discuss strengths, weaknesses, and opportunities (SWOT) of AR to study evacuation behaviour and for research purposes.