Bioinformatics tools and explainable machine learning approaches for colorectal cancer genomic and metagenomic data analysis : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences at Massey University, Albany Campus, Auckland, New Zealand

Loading...
Thumbnail Image
Date
2024-10-28
DOI
Open Access Location
Journal Title
Journal ISSN
Volume Title
Publisher
Massey University
Rights
© The Author
Abstract
Colorectal cancer (CRC) is a leading cause of cancer-related mortality worldwide and is influenced by complex interactions between genetic factors and the microbiome. The advent of high-throughput sequencing technologies has led to the generation of vast amounts of genomic and metagenomic data, providing opportunities to uncover novel biological markers (biomarkers) for CRC diagnosis, prognosis, and treatment. However, analyzing such datasets poses significant computational and interpretive challenges, necessitating the development of efficient and user-friendly bioinformatics tools. Recent advancements in machine learning, particularly Random Forest (RF) models, have shown promise in identifying predictive features in genomic data. Yet, existing implementations often face limitations in scaling and interpretability, especially when applied to large genomic studies. Additionally, integrating host genomic and microbial metagenomic data remains a complex task due to the heterogeneity of data types and sophisticated analytical methods required. This thesis focuses on the development of computational tools and the application of machine learning techniques to enhance the analysis of genomic and metagenomic data for colorectal cancer research. Firstly, I present the MetaFunc App, an interactive Shiny application designed to facilitate the exploration of data generated from the MetaFunc pipeline, an analysis pipeline for host and microbiome transcriptome data. The app provides a user-friendly interface for visualizing and analyzing microbial taxonomic profiles alongside host gene expression data, linking functional annotations to specific microbial taxa. This integration facilitates a deeper understanding of microbial contributions to a designated target outcome, e.g. cancer versus normal, and aids in identifying potential microbial biomarkers and eliciting their functions. Secondly, I apply Random Forest machine learning models to identify genomic and microbial biomarkers that differentiate right-sided colorectal cancer (RCC) from left-sided colorectal cancer (LCC). Utilizing RNA-seq data for 58,677 coding and non-coding human genes, and count data for 28,577 microbial taxa from 308 patient tumour samples, I develop three models: a genes-only model, a microbes-only model, and a combined genes-and-microbes model. The genes-only model achieves an accuracy of 90%, identifying significant genomic features such as PRAC1, HOXB13, HOXC4, and HOXC6, which are associated with colorectal cancer location and development. The microbes-only model achieves an accuracy of 70%, identifying significant microbial features including Ruminococcus gnavus and Fusobacterium nucleatum. The combined model achieves an accuracy of 87%, which may reflect an association between microbial communities and host gene expression in CRC. Finally, I present pyRforest, an R package that integrates Python’s scikit-learn RandomForestClassifier into the R environment via the reticulate package to enhance computational efficiency and memory management when using Random Forest models in R to analyze large genomic datasets. pyRforest also includes a novel rank-based permutation method for calculating p-values of individual features for feature identification. Additionally, it includes the capacity for calculating and plotting SHapley Additive exPlanations (SHAP) to interpret the contribution of each feature to model predictions, enhancing the explainability of Random Forest model results. The utility of pyRforest is demonstrated through a case study, where it is used to identify candidate biomarkers and provide insights into biological significance. Collectively, this work advances the understanding of genomic and microbial factors influencing colorectal cancer and provides advanced computational tools that can be used in other analyses. The MetaFunc App and pyRforest package facilitate the integration and interpretation of complex genomic and metagenomic data, and represent valuable resources for biomarker discovery. By addressing current challenges in data analysis and in bioinformatics software development, this thesis lays the groundwork for future research in bioinformatics and oncology, ultimately aiming to create and implement tools for improved genomic and metagenomic cancer dataset analysis.
Description
Keywords
Bioinformatics, Biostatistics, Machine Learning, Computer Science, Biomarker Identification, Random Forest Models, Colorectal Cancer, Genomics, Metagenomics
Citation