Journal Articles

Permanent URI for this collectionhttps://mro.massey.ac.nz/handle/10179/7915

Browse

Search Results

Now showing 1 - 5 of 5
  • Item
    pyRforest: a comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R.
    (Oxford University Press, 2024-10-07) Kolisnik T; Keshavarz-Rahaghi F; Purcell RV; Smith ANH; Silander OK
    Random Forest models are widely used in genomic data analysis and can offer insights into complex biological mechanisms, particularly when features influence the target in interactive, nonlinear, or nonadditive ways. Currently, some of the most efficient Random Forest methods in terms of computational speed are implemented in Python. However, many biologists use R for genomic data analysis, as R offers a unified platform for performing additional statistical analysis and visualization. Here, we present an R package, pyRforest, which integrates Python scikit-learn "RandomForestClassifier" algorithms into the R environment. pyRforest inherits the efficient memory management and parallelization of Python, and is optimized for classification tasks on large genomic datasets, such as those from RNA-seq. pyRforest offers several additional capabilities, including a novel rank-based permutation method for biomarker identification. This method can be used to estimate and visualize P-values for individual features, allowing the researcher to identify a subset of features for which there is robust statistical evidence of an effect. In addition, pyRforest includes methods for the calculation and visualization of SHapley Additive exPlanations values. Finally, pyRforest includes support for comprehensive downstream analysis for gene ontology and pathway enrichment. pyRforest thus improves the implementation and interpretability of Random Forest models for genomic data analysis by merging the strengths of Python with R. pyRforest can be downloaded at: https://www.github.com/tkolisnik/pyRforest with an associated vignette at https://github.com/tkolisnik/pyRforest/blob/main/vignettes/pyRforest-vignette.pdf.
  • Item
    Visual Integration of Genome-Wide Association Studies and Differential Expression Results with the Hidecan R Package.
    (MDPI (Basel, Switzerland), 2024-09-25) Angelin-Bonnet O; Vignes M; Biggs PJ; Baldwin S; Thomson S; Hojsgaard D
    Background/Objectives: We present hidecan, an R package for generating visualisations that summarise the results of one or more genome-wide association studies (GWAS) and differential expression analyses, as well as manually curated candidate genes, e.g., extracted from the literature. This tool is applicable to all ploidy levels; we notably provide functionalities to facilitate the visualisation of GWAS results obtained for autotetraploid organisms with the GWASpoly package. Results: We illustrate the capabilities of hidecan with examples from two autotetraploid potato datasets. Conclusions: The hidecan package is implemented in R and is publicly available on the CRAN repository and on GitHub. A description of the package, as well as a detailed tutorial, is made available alongside the package. It is also part of the VIEWpoly tool for the visualisation and exploration of results from polyploids computational tools.
  • Item
    DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network
    (BioMed Central Ltd, 2023-09-18) Zhang J; Liu B; Wu J; Wang Z; Li J
    Understanding gene expression processes necessitates the accurate classification and identification of transcription factors, which is supported by high-throughput sequencing technologies. However, these techniques suffer from inherent limitations such as time consumption and high costs. To address these challenges, the field of bioinformatics has increasingly turned to deep learning technologies for analyzing gene sequences. Nevertheless, the pursuit of improved experimental results has led to the inclusion of numerous complex analysis function modules, resulting in models with a growing number of parameters. To overcome these limitations, it is proposed a novel approach for analyzing DNA transcription factor sequences, which is named as DeepCAC. This method leverages deep convolutional neural networks with a multi-head self-attention mechanism. By employing convolutional neural networks, it can effectively capture local hidden features in the sequences. Simultaneously, the multi-head self-attention mechanism enhances the identification of hidden features with long-distant dependencies. This approach reduces the overall number of parameters in the model while harnessing the computational power of sequence data from multi-head self-attention. Through training with labeled data, experiments demonstrate that this approach significantly improves performance while requiring fewer parameters compared to existing methods. Additionally, the effectiveness of our approach  is validated in accurately predicting DNA transcription factor sequences.
  • Item
    A multi-objective genetic algorithm to find active modules in multiplex biological networks
    (PLOS, 2021-08-30) Novoa-Del-Toro EM; Mezura-Montes E; Vignes M; Térézol M; Magdinier F; Tichit L; Baudot A; Jensen P
    The identification of subnetworks of interest-or active modules-by integrating biological networks with molecular profiles is a key resource to inform on the processes perturbed in different cellular conditions. We here propose MOGAMUN, a Multi-Objective Genetic Algorithm to identify active modules in MUltiplex biological Networks. MOGAMUN optimizes both the density of interactions and the scores of the nodes (e.g., their differential expression). We compare MOGAMUN with state-of-the-art methods, representative of different algorithms dedicated to the identification of active modules in single networks. MOGAMUN identifies dense and high-scoring modules that are also easier to interpret. In addition, to our knowledge, MOGAMUN is the first method able to use multiplex networks. Multiplex networks are composed of different layers of physical and functional relationships between genes and proteins. Each layer is associated to its own meaning, topology, and biases; the multiplex framework allows exploiting this diversity of biological networks. We applied MOGAMUN to identify cellular processes perturbed in Facio-Scapulo-Humeral muscular Dystrophy, by integrating RNA-seq expression data with a multiplex biological network. We identified different active modules of interest, thereby providing new angles for investigating the pathomechanisms of this disease.
  • Item
    Whole-genome sequencing and ad hoc shared genome analysis of Staphylococcus aureus isolates from a New Zealand primary school
    (Springer Nature Limited, 2021-10-13) Scott P; Zhang J; Anderson T; Priest PC; Chambers S; Smith H; Murdoch DR; French N; Biggs PJ
    Epidemiological studies of communicable diseases increasingly use large whole-genome sequencing (WGS) datasets to explore the transmission of pathogens. It is important to obtain an initial overview of datasets and identify closely related isolates, but this can be challenging with large numbers of isolates and imperfect sequencing. We used an ad hoc whole-genome multi locus sequence typing method to summarise data from a longitudinal study of Staphylococcus aureus in a primary school in New Zealand. Each pair of isolates was compared and the number of genes where alleles differed between isolates was tallied to produce a matrix of "allelic differences". We plotted histograms of the number of allelic differences between isolates for: all isolate pairs; pairs of isolates from different individuals; and pairs of isolates from the same individual. 340 sequenced isolates were included, and the ad hoc shared genome contained 445 genes. There were between 0 and 420 allelic differences between isolate pairs and the majority of pairs had more than 260 allelic differences. We found many genetically closely related S. aureus isolates from single individuals and a smaller number of closely-related isolates from separate individuals. Multiple S. aureus isolates from the same individual were usually very closely related or identical over the ad hoc shared genome. Siblings carried genetically similar, but not identical isolates. An ad hoc shared genome approach to WGS analysis can accommodate imperfect sequencing of the included isolates, and can provide insights into relationships between isolates in epidemiological studies with large WGS datasets containing diverse isolates.