Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. Bioinformatics Tools and Explainable Machine Learning Approaches for Colorectal Cancer Genomic and Metagenomic Data Analysis A thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Mathematical and Computational Sciences at Massey University, Albany Campus, Auckland, New Zealand Tyler Kolisnik 2024 Copyright: © 2024 Tyler Kolisnik. This work is licensed under a Creative Commons Attribution 4.0 International License: https://creativecommons.org/licenses/by/4.0/. Supervisory Team: Dr. Olin K. Silander, PhD, Senior Research Fellow, Liggins Institute, University of Auckland, Auckland, New Zealand Dr. Sebastian Schmeier, PhD, Senior Lecturer in Bioinformatics, School of Natural and Computational Sciences, Massey University, Auckland, New Zealand (former) Dr. Steven Jones, PhD, Head of Bioinformatics and Co-Director, Canada’s Michael Smith Genome Sciences Centre at BC Cancer & Director, Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada Dr. Adam N. H. Smith, PhD, Senior Lecturer in Statistics, School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand Examination Panel: Dr. Justin O’Sullivan, PhD, Director, Liggins Institute, University of Auckland, Auckland, New Zealand Dr. Xochitl Morgan, PhD, Director, Harvard Chan Microbiome Analysis Core, Harvard University, USA Dr. Jonathan Marshall, PhD, Associate Professor in Statistics, School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand ii https://creativecommons.org/licenses/by/4.0/ Abstract Colorectal cancer (CRC) is a leading cause of cancer-related mortality worldwide and is influenced by complex interactions between genetic factors and the microbiome. The advent of high-throughput sequencing technologies has led to the generation of vast amounts of genomic and metagenomic data, providing opportunities to uncover novel biological markers (biomarkers) for CRC diagnosis, prognosis, and treatment. However, analyzing such datasets poses significant computational and interpretive challenges, necessitating the development of efficient and user-friendly bioinformatics tools. Recent advancements in machine learning, particularly Random Forest (RF) models, have shown promise in identifying predictive features in genomic data. Yet, existing implementations often face limitations in scaling and interpretability, especially when applied to large genomic studies. Additionally, integrating host genomic and microbial metagenomic data remains a complex task due to the heterogeneity of data types and sophisticated analytical methods required. This thesis focuses on the development of computational tools and the application of machine learning techniques to enhance the analysis of genomic and metagenomic data for colorectal cancer research. Firstly, I present the MetaFunc App, an interactive Shiny application designed to facilitate the exploration of data generated from the MetaFunc pipeline, an analysis iii pipeline for host and microbiome transcriptome data. The app provides a user-friendly interface for visualizing and analyzing microbial taxonomic profiles alongside host gene expression data, linking functional annotations to specific microbial taxa. This integration facilitates a deeper understanding of microbial contributions to a designated target outcome, e.g. cancer versus normal, and aids in identifying potential microbial biomarkers and eliciting their functions. Secondly, I apply Random Forest machine learning models to identify genomic and microbial biomarkers that differentiate right-sided colorectal cancer (RCC) from left-sided colorectal cancer (LCC). Utilizing RNA-seq data for 58,677 coding and non-coding human genes, and count data for 28,577 microbial taxa from 308 patient tumour samples, I develop three models: a genes-only model, a microbes-only model, and a combined genes-and-microbes model. The genes-only model achieves an accuracy of 90%, identifying significant genomic features such as PRAC1, HOXB13, HOXC4, and HOXC6, which are associated with colorectal cancer location and development. The microbes-only model achieves an accuracy of 70%, identifying significant microbial features including Ruminococcus gnavus and Fusobacterium nucleatum. The combined model achieves an accuracy of 87%, which may reflect an association between microbial communities and host gene expression in CRC. Finally, I present pyRforest, an R package that integrates Python’s scikit-learn RandomForestClassifier into the R environment via the reticulate package to enhance computational efficiency and memory management when using Random Forest models in R to iv analyze large genomic datasets. pyRforest also includes a novel rank-based permutation method for calculating p-values of individual features for feature identification. Additionally, it includes the capacity for calculating and plotting SHapley Additive exPlanations (SHAP) to interpret the contribution of each feature to model predictions, enhancing the explainability of Random Forest model results. The utility of pyRforest is demonstrated through a case study, where it is used to identify candidate biomarkers and provide insights into biological significance. Collectively, this work advances the understanding of genomic and microbial factors influencing colorectal cancer and provides advanced computational tools that can be used in other analyses. The MetaFunc App and pyRforest package facilitate the integration and interpretation of complex genomic and metagenomic data, and represent valuable resources for biomarker discovery. By addressing current challenges in data analysis and in bioinformatics software development, this thesis lays the groundwork for future research in bioinformatics and oncology, ultimately aiming to create and implement tools for improved genomic and metagenomic cancer dataset analysis. v Acknowledgements I would like to express my sincerest thanks to my doctoral supervisors, Dr. Olin Silander, Dr. Adam Smith, Dr. Sebastian Schmeier, and Dr. Steven Jones, this thesis would not have been possible without their guidance and support. I’m profoundly thankful to Dr. Jones, whose supervision at the British Columbia Genome Sciences Centre allowed me to return home to Canada to continue my studies at Massey from “abroad” during the uncertain times of the global COVID-19 pandemic. I am especially thankful to all my fellow students and colleagues at Massey University, especially Arielle Sulit, and my fellow students at the Genome Sciences Centre, especially Faeze Keshavarz-Rahaghi. I would like to also thank Lucia Lam, who has been an amazing mentor and has helped me so much with my career in bioinformatics. I would like to thank my high school math and physics teacher, Roger Pavey, for everything he taught me that set me up for success with my education. I would like to thank my amazing neighbours and friends Samina and Aodhan for making us feel welcome in New Zealand. My PhD studies would not have been possible without a PhD scholarship from the Massey University School of Natural Sciences, for which I am profoundly grateful. Thank you to my parents Teresa and Steven, and my brother Kelly for always providing me with encouragement, and for always being supportive of my decisions. Thank you to my partner Layton and my cat Fry for their love and support and for coming on the journey abroad to New Zealand with me. I would also like to express my deepest appreciation for the support of my friends and extended family throughout my PhD. Thank you to my amazing friends Shyy, Stefanie, Rachel, and Naomi. Thank you to my friend Jeff for making sure that not all of my time on computers is spent coding. Thank you to my in-laws Caron and Blaine for always being there to help us during tough times. A special thank you goes out to my cousins, Elise, Mercedes, Mike, Troy, Jen, and Jade, my aunts and uncles, especially my aunts Rita and Rose, and my uncle Aaron, and all those who sent me packages from Canada while I was studying abroad, it made me feel a lot closer to home. “Things are only impossible until they are not.” - Captain Jean-Luc Picard “Tea, earl grey, hot.” - Also Captain Jean-Luc Picard vi List of Publications The following published works were completed during my doctoral candidature. The results presented in Chapter 2 of this thesis contributed to the following publication, which is attached in Appendix A: Sulit, A. K., Kolisnik, T., Frizelle, F. A., Purcell, R., & Schmeier, S. (2023). MetaFunc: taxonomic and functional analyses of high throughput sequencing for microbiomes. Gut Microbiome, 4(4), 1-21. https://doi.org/10.1017/gmb.2022.12. The results presented in Chapter 3 of this thesis were published as: Kolisnik, T., Sulit, A. K., Schmeier, S., Frizelle, F., Purcell, R., Smith, A., & Silander, O. (2023). Identifying important microbial and genomic biomarkers for differentiating right- versus left-sided colorectal cancer using random forest models. BMC Cancer, 23(647), 1-11. https://doi.org/10.1186/s12885-023-10848-9. The results presented in Chapter 4 of this thesis were published as: Kolisnik, T., Keshavarz-Rahaghi, F., Purcell, R., Smith, A., & Silander, O. (2024). pyRforest: A comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R. Briefings in Functional Genomics, 2024(38), 1-9. https://doi.org/10.1093/bfgp/elae038. In addition, during my time studying at the BC Genome Sciences Centre I contributed to the following publication: Keshavarz-Rahaghi, F., Pleasance, E., Kolisnik, T., & Jones, S. J. M. (2022). A p53 transcriptional signature in primary and metastatic cancers derived using machine learning. Frontiers in Genetics, 13, 987238. https://doi.org/10.3389/fgene.2022.987. vii https://doi.org/10.1017/gmb.2022.12 https://doi.org/10.1186/s12885-023-10848-9 https://doi.org/10.1093/bfgp/elae038 https://doi.org/10.1093/bfgp/elae038 https://doi.org/10.3389/fgene.2022.987238 List of Abbreviations AI Artificial Intelligence APC Adenomatous Polyposis Coli (Gene) AUC Area Under the Curve AUROC Area Under the Receiver Operator Characteristic Curve BRAF B-Raf proto-oncogene (Gene) CMS Consensus Molecular Subtypes CPM Counts-Per-Million CRC Colorectal Cancer DE Differential Expression DNA Deoxyribonucleic Acid EMT Epithelial-Mesenchymal Transition EGFR Epidermal Growth Factor Receptor (Gene) FMT Fecal Microbiota Transplant FPKM Fragments Per Kilobase of transcript per Million mapped reads GB Gigabyte GHz Gigahertz HOXB13 Homeobox B13 (Gene) HOXC4 Homeobox C4 (Gene) HOXC6 Homeobox C6 (Gene) HOXC8 Homeobox C8 (Gene) KEGG Kyoto Encyclopedia of Genes and Genomes KNN k-Nearest Neighbours KRAS Kirsten RAt Sarcoma LCC Left-sided Colorectal Cancer ML Machine Learning MLH1 MultiL Homolog 1 (Gene) MSH1 MutS Homolog 1 (Gene) MSH6 MutS Homolog 6 (Gene) viii MSI Microsatellite Instability NCBI National Center for Biotechnology Information NeSI New Zealand e-Science Infrastructure PCA Principal Component Analysis PIK3CA Phosphatidylinositol-4,5-Bisphosphate 3-Kinase Catalytic Subunit Alpha (Gene) PMS2 Postmeiotic Segregation Increased 2 (Gene) PRAC1 Prostate Cancer Susceptibility Candidate 1 (Gene) RCC Right-sided Colorectal Cancer RF Random Forest RNA Ribonucleic Acid RNA-seq Ribonucleic Acid Sequencing RNN Recurrent Neural Networks SHAP SHapley Additive exPlanations SVM Support Vector Machines TGFBR2 Transforming Growth Factor Beta Receptor 2 (Gene) TNM Tumour Node Metastasis (Cancer staging system) TPM Transcripts-Per-Million WES Whole-Exome Sequencing WGS Whole-Genome Sequencing ix Table of Contents Abstract iii Acknowledgements vi List of Publications vii List of Abbreviations viii Table of Contents x List of Figures xiv List of Tables xvi Chapter 1 - Introduction 1 1.1 Colorectal Cancer 1 1.1.1 Epidemiology 1 1.1.2 Colorectal Cancer Nomenclature 1 1.1.3 CRC Sidedness and Clinical Implications 2 1.1.4 Molecular and Genetic Differences between RCC and LCC 3 1.1.5 CRC Treatments 4 1.1.6 The Gut Microbiome in CRC 5 1.1.7 Advances in Genomic Technologies 7 1.1.7.1 High-Throughput Sequencing Technologies 7 1.1.7.2 Challenges in Data Normalization and Scaling 8 1.2 Computational Approaches in Genomic Research 9 1.2.1 Use of Applications for Exploring Metagenomic Data 9 1.2.2 Existing Applications for Exploring Metagenomic Data 10 1.2.3 Application of Machine Learning Approaches in CRC Research 11 1.2.3.1 Historical use of ML for CRC Research 11 1.2.3.2 Random Forest Models 12 1.2.3.3 Features in Random Forests 17 1.2.3.4 Feature Importance Scoring in Random Forests 18 1.2.3.5 Permutation Importance 20 1.2.3.6 SHAP values 21 1.2.3.7 Functional Analysis 21 1.2.3.8 Biomarker Validation 22 1.3 Overview and Structure of this Thesis 23 Chapter 2 - The MetaFunc App: An R Shiny Application for Joint Exploration of Microbial Diversity, Function, and Host Gene Expression 26 2.1 Introduction 28 2.1.1 Overview 28 2.1.2 Development History and Specific Contributions 29 x 2.1.3 Motivation and Research Questions 31 2.2 Background 32 2.2.1 Shiny Applications 32 2.2.2 Shiny Apps in Bioinformatics 33 2.3 Methodologies 34 2.3.1 Workflow Integration and Application Launch 34 2.3.2 Database Development & Design 34 2.3.3 Downloading and Installation of the MetaFunc App 36 2.3.4 Technical implementation of the MetaFunc App 37 2.3.4.1 Project Structure 37 2.3.4.2 Front-End Development 37 2.3.4.3 Back-End Development 38 2.3.5 Speed and Scalability Optimization 40 2.3.6 Data Availability 41 2.3.7 Testing and Validation 41 2.4 Results 43 2.4.1 Overview 43 2.4.2 User Interface and Case Study Exploration 43 2.4.2.1 Home Page 43 2.4.2.2 Microbiome - Abundances 45 2.4.2.3 Microbiome - Gene Ontology 47 2.4.2.4 Microbiome - GO to TaxIDs 49 2.4.2.5 Microbiome - TaxIDs to GO 50 2.4.2.6 Single Sample vs Grouped Analysis 52 2.4.2.7 Host - Abundances 53 2.4.3 Benchmarking Results 55 2.5 Discussion 56 2.5.1 Summary 56 2.5.2 Implications for Data Visualization and Analysis 57 2.5.3 Comparisons with Other Tools 57 2.5.4 Limitations and Future Directions 58 2.6 Conclusion 59 Chapter 3 - Identifying important microbial and genomic biomarkers for differentiating right- versus left-sided colorectal cancer using Random Forest models 60 3.1 Abstract 62 3.2 Introduction 63 3.2.1 Background 63 3.3 Methods & Materials 65 xi 3.3.1 Patients, Samples and Processing 65 3.3.2 Random Forest Model Generation 65 3.3.3 Feature Importance and Retention 68 3.3.4 Differential Expression, Feature Side-Assignment and Heatmap Generation 69 3.4 Results 69 3.4.1 Random Forest Model Performance 69 3.4.2 Significant Model Features 72 3.5 Discussion 77 3.5.1 Patterns in RCC 79 3.5.2 Patterns in LCC 80 3.6 Conclusions 82 3.7 Declarations 82 3.8 Supplementary Material 84 Chapter 4 - pyRforest: A comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R 97 4.1 Abstract 99 4.2 Introduction 100 4.3 Methods and Design 102 4.3.1 Integration of R and Python 102 4.3.2 Dataset Preparation and Hyperparameter Optimization 103 4.3.3 Model Tuning and Evaluation 103 4.3.4 Post-Hoc Feature Importance Significance Testing 104 4.3.5 SHAPley Additive exPlanations (SHAP) 106 4.3.6 Biological Interpretation 106 4.4 Comparisons with Alternative Implementations 107 4.5 Case Study 110 4.5.1 RNA-seq Data Analysis in Colorectal Cancer 110 4.5.2 Data Preparation and Model Training 110 4.5.3 Model Performance 111 4.5.4 Identifying Significantly Important Features 111 4.5.5 SHAP Analysis 113 4.5.6 clusterProfiler and g:Profiler Analysis 115 4.6 Discussion 117 4.7 Conclusion 121 4.8 Author Statements 121 4.9 Supplementary Material 123 Chapter 5 Conclusions 126 5.1. Overview 126 xii 5.2 Summary of Findings 126 5.2.1 Development of the MetaFunc App (Chapter 2) 126 5.2.2 Identifying Biomarkers for Colorectal Cancer Sidedness Using Random Forests (Chapter 3) 127 5.2.3 Development of the pyRforest R package (Chapter 4) 128 5.3 Future Directions 129 5.3.1 Enhancing the MetaFunc App 129 5.3.2 Expanding the use of Machine Learning for Biomarker Identification 130 5.3.3 Further Development of pyRforest 130 5.4 Concluding Remarks 131 References 132 Appendix A 152 xiii List of Figures Figure 1.1 Summary of differences between left and right-side colorectal cancer. 3 Figure 1.2 A visualization of the flow of testing and training data in 5-fold cross validation. 15 Figure 2.1 The MetaFunc Workflow. 30 Figure 2.2 The Home Page of the MetaFunc App. 44 Figure 2.3 The “Microbial - Abundances” subtab of the MetaFunc App. 46 Figure 2.4 The “Gene Ontology” subtab of the MetaFunc App. 48 Figure 2.5 The “GO to TaxIDs” subtab. 50 Figure 2.6 The “TaxIDs to GO” subtab. 51 Figure 2.7 Individual Microbial Abundances Table. 52 Figure 2.8 Grouped Microbial Abundances Table. 53 Figure 2.9 The “Host - Abundances” subtab of the MetaFunc App. 54 Figure 3.1 Receiver Operating Characteristic Curves (ROC) as calculated on the held-out validation set. 71 Figure 3.2 Feature importance plots showing rank-based feature importance scores of the permuted data and the scores of the real (unpermuted) data. 72 Figure 3.3 A heatmap of scaled gene expression values of the top-scoring genomic features discovered by the genes-only RF model and clinical characteristics. 76 Supplementary Figure 3.1 Genes-Only model mean test F1 scores over 8 hyperparameters set to a varying range of intervals. 84 xiv xv Supplementary Figure 3.2 Microbes-only model mean test F1 scores over 8 hyperparameters set to a varying range of intervals. 85 Supplementary Figure 3.3 Genes-and-microbes model mean test F1 scores over eight hyperparameters set to a varying range of intervals. 86 Supplementary Figure 3.4 A heatmap of scaled gene expression values of the top-scoring microbial features discovered by the microbes-only RF model and clinical characteristics. 87 Supplementary Figure 3.5 A heatmap of scaled genomic and microbial expression values of the top-scoring features discovered by the genes-and-microbes RF model and clinical characteristics. 88 Figure 4.1 Feature importance plot showing individual and mean rank-based feature importance scores of the permuted data. 113 Figure 4.2 SHAPley Additive Explanation plots which illustrate the impact of individual features on model prediction. 114 Figure 4.3 Gene Ontology enrichment analysis using clusterProfiler on the significant features identified by pyRforest. 116 Figure 4.4 Enrichment analysis Manhattan plot from g:Profiler calculated on the 83 significant features found in our CRC case study. 117 List of Tables Table 1.1 CMS Subtypes and their observed pathophysiological differences. 13 Table 1.2 Common statistical metrics for evaluating performance of machine learning classifiers and their definitions. 16 Table 2.1 Benchmarking results for the MetaFunc App. 55 Table 3.1 Patient Demographics & Cancer Characteristics. 66 Table 3.2 Random Forest Model Results. 70 Table 3.3 Top ranking features from the RF model trained on the genes-only dataset (Left). 73 Table 3.4. Top ranking features with p-values less than 0.05 and their importance scores discovered by our microbes-only model (Left). 74 Table 3.5 Top ranking features with p-values less than 0.05 and their importance scores discovered by our genes-and-microbes model (Left). 75 Supplementary Table 3.1 Full version of Table 3.4. 89 Supplementary Table 3.2 Full version of Table 3.5. 92 Supplementary Table 3.3 Patient Demographics & Cancer Characteristics Stratified by Cancer Side. 96 Table 4.1 A benchmarking analysis comparing memory usage and run time of Random Forest model fitting across different datasets. 107 Table 4.2 K-fold cross-validation scoring metrics for the RF model, stratified by dataset splits (Validation, Testing) on the CRC dataset from our case study. 111 Supplementary Table 4.1 A comparative analysis of feature identification approaches. 123 xvi Supplementary Table 4.2 Feature and Capability Comparison of RandomForest Implementations for Genomic Data Analysis. 123 Supplementary Table 4.3 The top-scoring model hyperparameters as optimized by the pyRforest `tune_and_train_rf_model` function which makes use of scikit-learn `GridSearchCV`. 125 xvii Chapter 1 Introduction 1.1 Colorectal Cancer 1.1.1 Epidemiology Worldwide, colorectal cancer (CRC) is the second most prevalent cancer in women, and the third most prevalent cancer in men, with approximately 2 million newly diagnosed cases occurring every year (WCRF International, 2022). CRC represents 10% of all global cancers (IARC, 2019). Age and family history are the greatest risk factors for CRC, with over 99% of cases occurring in people over 40. CRC is known to arise from polyps, which are groups of cells that begin as benign and then later become malignant (Ballinger & Anggiansah, 2007). Due to the invasive nature of colonoscopy, which is the primary diagnostic method for colorectal cancer, and the non-specific symptoms of CRC, many cases go undiagnosed until the disease has progressed to an advanced stage. 1.1.2 Colorectal Cancer Nomenclature Colorectal cancer refers to cancers that can arise in the colon, rectosigmoid junction or rectum (IARC, 2019; Mayo Clinic, 2022). The term CRC has become the preferred term by many clinicians and researchers because it encompasses all malignancies across the lower gastrointestinal tract. However, historically, there has been a lack of standardization in the nomenclature for CRC, which presents an issue when reconciling data from older studies with modern research (Paschke et al., 2018). Some clinicians prefer to refer to specific anatomical locations, leading to distinctions such as ‘colon cancer’ versus ‘rectal cancer’, while others continue to use the more general terms ‘bowel cancer’ or ‘colorectal cancer’. Recently, further distinctions within CRC have become an important topic in the research community. Terms such 1 as ‘proximal’ or ‘right-sided’ and ‘distal’ or ‘left-sided’ colorectal cancer are now becoming more widely used to describe tumours based on their specific location within the colon. This is due to emerging evidence that right-sided and left-sided CRCs exhibit distinct molecular characteristics, clinical presentations, and responses to treatment (Lee et al., 2015). 1.1.3 CRC Sidedness and Clinical Implications Cancers located in the proximal part of the colon, comprising cancers of the cecum, ascending colon, hepatic flexure, and proximal part of the transverse colon are referred to as right-sided colorectal cancers (RCC) (Mukund et al., 2020, Stintzing et al., 2017). Cancers located in the distal portion of the colon (beyond the splenic flexure), descending colon, sigmoid colon, and rectum, are referred to as left-sided colorectal cancers (LCC). These two portions of the colon, while linked in sequence, play different roles in the digestive process and have different developmental origins, with the right-sided colon developing from the embryonic midgut, and the left-sided colon developing from the embryonic hindgut (Kostouros et al., 2020). Figure 1.1 illustrates several differences in the pathology and treatment of right versus left-sided colorectal cancer (Baran et al., 2018; Lee et al., 2015). In terms of epidemiology, RCC is more common in women than in men, and typically presents with larger tumours, higher tumour node metastasis (TNM) classification stage and poorer overall survival outcomes (Liang et al., 2018; Yang et al., 2016; Zhao et al., 2020). In contrast, LCC is more common overall than RCC, especially in men, and it is often identified at a lower TNM (Tumour Node Metastasis) stage, with the tumours being smaller. LCCs generally have a more favourable prognosis than RCCs. 2 Figure 1.1 Summary of differences between left and right-side colorectal cancer. Note. This figure is a derivative work based on information from Lee et al. 2015, with additional information added from Baran et al. 2018. Permission to create a derivative work was obtained from Elsevier under license number 5896941135490. 1.1.4 Molecular and Genetic Differences between RCC and LCC The differences between RCC and LCC are not merely anatomical but extend to their molecular and genetic landscapes. RCC is characterized by a higher frequency of mutations caused by defects in DNA repair called microsatellite instability (MSI), BRAF (B-Raf proto-oncogene) mutations, and CpG island methylator phenotype positivity, all of which contribute to its rapid progression and poorer prognosis (Baran et al., 2018; Lee et al., 2015). RCC tumours tend to be more resistant to standard chemotherapy and respond better to immunotherapy. In contrast, LCC tumours typically exhibit chromosomal instability and 3 mutations in genes like the proto-oncogenes APC (Adenomatous Polyposis Coli) and KRAS (Kirsten RAt Sarcoma), which are more commonly targeted by conventional treatments like chemotherapy and EGFR (Epidermal Growth Factor Receptor) inhibitors (Natsume et al., 2018). Understanding the molecular genetic distinctions between RCC and LCC is critical as it impacts prognosis and shapes the therapeutic strategies available to patients. 1.1.5 CRC Treatments Treatment options for colorectal cancer can include surgery, chemotherapy, immunotherapy, and less commonly, radiotherapy (Kuipers et al., 2015). Early-stage CRC can often be treated with minimally invasive surgical techniques such as mucosal resection or submucosal dissection during colonoscopy. For later stage cancers, more radical surgical options such as partial colectomy are common. Chemotherapy is a cornerstone of treatment for CRC, and may be delivered in a neoadjuvant (before surgery) or adjuvant (after surgery) setting. Frequently used chemotherapy drugs include 5-flurouracil, capacitabine, oxaliplatin, irinotecan, and targeted biologics such as bevacizumab, aflibercept, cetuximab, and panitumumab, depending on the genetic profile of the tumour (Kumar et al., 2023). Radiation therapy is not commonly used, except for low-stage rectal cancers, due to the sensitivity of the bowel to radiation (Häfner & Debus, 2016). Among the most promising treatments in colorectal cancer are the immunotherapeutic options. A recent study on 12 patients with left-sided CRC showed that treatment with dostarlamib, an anti-PD-1 monoclonal antibody, achieved complete clinical remission for 12 months after their last dose (Cercek et al., 2022). While the clinically complete response in this trial is promising, there are two key caveats that these patients specifically had rectal adenocarcinoma (a form of LCC) with a genetic mutations for loss of expression of the MLH1, 4 MSH1, MSH6, and PMS2 genes. This study placed an emphasis on anatomic location, the microbiome, and gene expression in shaping treatment decisions. 1.1.6 The Gut Microbiome in CRC The human gastrointestinal tract hosts a diverse community of microorganisms, collectively known as the gut microbiome (Thursby & Juge, 2017). This microbial ecosystem plays an important role in maintaining homeostasis by aiding digestion, synthesizing vitamins, and modulating the immune system. The balance of microbial populations is crucial for health, and disruptions can lead to various diseases. In the past twenty years, projects such as The Human Microbiome Project (Human Microbiome Project Consortium, 2012) have started to characterize the importance of microbial interactions and colon epithelial cells in the human gut. However, the roles that specific microbes play in the development and progression of colorectal cancer remain poorly understood (Zou et al., 2018). Therapeutic interventions targeting the gut microbiome have shown significant success in other conditions; for instance, fecal microbiota transplants (FMT) are now a standard of care for recurrent Clostridium difficile infections, with cure rates of up to 90% (Rohlke & Stollman, 2012). This success raises the possibility of exploring similar microbiome-targeted therapies for CRC. Despite this success, stool transplants have not been extensively trialed for treatment of CRC, although some early studies such as those conducted in mice are showing promising results (Yu et al., 2023). In this study, FMT inhibited CRC progression in mice by reversing intestinal microbial dysbiosis, reducing tumour size and number, and significantly prolonging survival rates when compared to the group that did not receive FMT. The study also noted an increase in infiltration of CD8+ T cells, and inflammatory cytokines, suggesting an enhanced anti-cancer immune response in the FMT group. 5 The composition of the gut microbiome is unique to every individual, which presents a significant challenge when studying microbe-gene interactions (Gilbert et al., 2018). In addition to cancer, numerous other diseases, such as Crohn’s disease, irritable bowel syndrome (IBS), obesity and autism, have been suggested to have links to the gut microbiome; however, therapeutic options targeting the microbiome remain largely unsuccessful at treating these diseases, and the specific microbes involved and their roles in influencing gene expression remain somewhat of a mystery (Gilbert et al., 2018). In colorectal cancer, one of the leading theories postulates that a dysbiosis of the gut microbiome results in inflammation which then leads to tumour initiation, progression, and proliferation (Kostic et al., 2013). Inflammation is considered a hallmark of cancer development (Hanahan & Weinberg, 2011), and microbes have been shown to produce substances that induce cellular inflammation, contributing to tumourigenesis (Brennan et al., 2016; Grivennikov et al., 2010). In CRC, the gut microbiome plays a significant role by modulating inflammation, epithelial cell proliferation, and apoptosis (Gao et al., 2015). Certain microbial species have been implicated in CRC development through their ability to produce pro-inflammatory and genotoxic substances (Kostic et al., 2013). One such microbe is Fusobacterium nucleatum, an anaerobic Gram-negative bacterium found in higher abundance in CRC tissues compared to normal tissues (Kostic et al., 2013; Wang et al., 2022). F. nucleatum has been shown to promote carcinogenesis by adhering to and invading epithelial cells via FadA adhesin to E-cadherin binding and activation of β-catenin signalling pathways which promote inflammatory responses (Rubinstein et al., 2013). Additionally, certain strains of Escherichia coli can produce colibactin, a genotoxin that induces 6 DNA double-strand breaks, potentially promoting tumour initiation (Rosendahl Huber et al., 2024). Ruminococcus and Akkermansia species have also been associated with CRC. Both of these bacterial species are involved in the degradation of mucin and can disrupt the mucus layer, potentially facilitating contact between luminal contents and the epithelium, thereby promoting inflammation and subsequent carcinogenesis (Coleman et al., 2021). Identifying specific microbes associated with CRC can aid in early diagnosis and risk assessment (Zackular et al., 2014). Exploring microbial biomarkers enhances our understanding of disease mechanisms and opens up possibilities for microbiome-targeted interventions, such as dietary modifications, probiotics, prebiotics, antibiotics, and fecal transplants. There remains a significant gap in our understanding of how complex microbial communities interact with host factors to influence CRC development and progression (Brennan et al., 2016). Current research often focuses on individual bacterial species or specific pathways, but the synergistic effects of the microbiome as a whole and its interplay with gene expression have not been fully elucidated, especially while also considering the implications of right-side versus left-side CRC tumour location. This demonstrates the growing recognition of the importance of integrating host genomic data with not only tumour location, but also microbiome profiles to understand the complex interactions between the tumour, host, and microbiome (Garza et al., 2020). 1.1.7 Advances in Genomic Technologies 1.1.7.1 High-Throughput Sequencing Technologies The advent of next-generation sequencing (NGS) technologies has revolutionized genomic research by enabling rapid and cost-effective large-scale sequencing of DNA and RNA (Goodwin et al., 2016). Techniques such as whole-genome sequencing (WGS), whole-exome 7 sequencing (WES), and RNA sequencing (RNA-seq) have become indispensable tools in cancer genomics, facilitating the identification of genetic mutations, gene expression profiles, and alternative splicing events (Metzker, 2010). Metagenomic sequencing allows for characterization of microbial communities, providing insights into microbial diversity, function, and interactions with the host without the use of sample cultures (Sharpton, 2014). 1.1.7.2 Challenges in Data Normalization and Scaling The generation of large-scale genomic and metagenomic datasets presents significant analytical challenges. These datasets often contain tens of thousands of features (e.g. genes, microbial taxa) and require computational methods that can manage their complexity while minimizing potential sources of error. Data normalization is a critical step to adjust for technical variation such as sequencing depth, gene length, and library size, ensuring that comparisons reflect true biological differences (Conesa et al., 2016). For RNA-seq data, normalization approaches like transcripts per million (TPM) and fragments per kilobase of transcript per million mapped reads (FPKM) are commonly used to account for sequencing depth and gene length (Conesa et al., 2016). TPM normalization facilitates comparison of gene expression levels across samples by scaling for both the length of the gene and the total number of reads across samples, thus providing a consistent framework for downstream analysis. In metagenomic analyses, normalization techniques must account for differences in sequencing effort, sample composition, and the inherently uneven distribution of microbial taxa. As used in Chapter 2, microbial abundances can be expressed as percentages, calculated by dividing the count of each taxon by the total read counts mapped to all taxa in a sample and 8 multiplying by 100 (Sulit et al., 2023). This method provides a straightforward interpretation of the proportion of each microbial taxon within a sample. In Chapters 3 and 4, we utilize counts per million (CPM) normalization which adjusts for differences in library sizes by scaling raw counts to a common scale based on total counts, enabling comparison of relative abundances across samples, and ensuring consistency of scaling with the TPM normalized genomic data. 1.2 Computational Approaches in Genomic Research 1.2.1 Use of Applications for Exploring Metagenomic Data The analysis of metagenomic data presents significant challenges due to the complexity and vast diversity present in microbial communities. Metagenomic datasets typically contain a mixture of sequences from different bacterial, viral, and eukaryotic organisms (Lapidus et al., 2021). These sequences can be analyzed to identify genes, proteins, and infer potential functions, providing insights into the functional profiles of the gut microbiome (Sharpton, 2014). Extracting meaningful biological insights from these datasets requires appropriate analytical tools. When it comes to building custom pipelines for genomic data analysis, R Shiny has become a popular tool amongst bioinformaticians for building applications that make it possible for researchers without coding expertise to display and interact with biological data. Currently, there is an absence of integrated solutions that allow researchers to perform gene ontology-based functional microbiome analyses within a single platform. No pre-existing R Shiny tools enable the exploration of host and microbial abundances and their taxonomic profiles while linking these to gene ontology annotations. Consequently, researchers must employ multiple separate tools, resulting in fragmented workflows that take considerable time to set up. This lack of an integrated solution poses challenges for conducting holistic studies on 9 host-microbiome interactions, particularly in the context of diseases like colorectal cancer, where understanding the functional roles of microbial communities is essential. 1.2.2 Existing Applications for Exploring Metagenomic Data Despite the lack of integrated solutions, some applications have been developed to make certain aspects of microbiome analysis more accessible and interactive. Phyloseq (McMurdie & Holmes, 2012) is a widely used R shiny app that has tools for analysis and display of microbiome census data. It enables the production of graphical plots for ecological analyses. However, Phyloseq is primarily focused on phylogenetic analyses and lacks comprehensive functional annotation capabilities. It does not provide functionalities for displaying host gene expression data or performing gene ontology analyses, which limits its utility in studies aiming to understand host-microbiome interactions at the functional level. Animalcules (Zhao et al., 2020) is another R Shiny application designed for microbiome data analysis and visualization. It offers an interactive interface for exploring microbiome datasets, including functionalities for differential abundance analysis, diversity analysis, and visualization of microbiome community compositions. Animalcules does not offer functional insights into gene ontology analysis and requires that users preprocess their data from existing tools. While these applications have made certain aspects of microbiome analysis more accessible and interactive, they lack the ability to perform gene ontology based functional microbiome analysis. In addition to accessible tools for host-microbiome exploration, new methods that can lead to candidate biomarker discovery and new biological insights are required. 10 1.2.3 Application of Machine Learning Approaches in CRC Research Machine learning (ML) is a broad range of statistical techniques that enable computers to learn from data and fit models that can be used to make predictions, classify data, or to support decision-making (Davenport et al., 2019). Rather than being explicitly programmed to perform specific tasks, ML algorithms identify patterns from training data and can generalize to make predictions on unseen data. In biomedical research, ML excels at analyzing complex datasets to reveal insights that can improve diagnosis, prognosis, and inform treatment strategies. ML techniques have been applied to many aspects of CRC research, including early detection, classification of tumour subtypes, prediction of patient outcomes, and identification of candidate biomarkers (Alboaneen et al., 2023). 1.2.3.1 Historical use of ML for CRC Research ML algorithms have been utilized in CRC research to analyze many types of data, including genomic, transcriptomic, proteomic, and imaging data (Kourou et al., 2015). ML algorithms fall into different non-exclusive categories including unsupervised, supervised, deep learning and ensemble, each with their own strengths and weaknesses (Jordan & Mitchell, 2015). Supervised algorithms learn from labeled training data, where each input is associated with a known output label to predict labels on new unseen data. This includes approaches such as support vector machines (SVM) (Cortes et al., 1995), k-nearest neighbours (KNN) (Cover & Hart, 1967), decision trees (Quinlan, 1986), and Random Forests (Breiman, 2001). In unsupervised approaches, algorithms learn from data identifying patterns, data structures and relationships without any predefined outcome labels. Unsupervised techniques include clustering (e.g. hierarchical clustering (Lloyd, 1982) or k-means clustering (Johnson, 1967)), and dimensionality reduction techniques like principal component analysis (PCA) (Jolliffe, 2013). 11 Deep learning is another subset of machine learning which involves neural networks with multiple layers (called deep architectures) that can learn hierarchical representations of data (LeCun et al., 2015). This includes convolutional neural networks (CNNs) (O’Shea et al., 2015), recurrent neural networks (RNNs) (Sherstinsky, 2018). Lastly, ensemble approaches are another method which combine predictions from multiple models to improve overall performance, this includes methods like gradient boosting machines (Friedman, 2001), and Random Forests (Breiman, 2001), which are a primary topic of this thesis work. 1.2.3.2 Random Forest Models The Random Forest (RF) model is a supervised ensemble ML method developed by Breiman (2001) which constructs multiple decision trees during training. Each tree is built from a random subset of the data and features, and the final prediction is made by aggregating the outcomes (i.e., taking the majority vote) of all trees. This approach enhances the stability and accuracy of the model while reducing overfitting. RF models are particularly well-suited for genomic research due to their ability to handle large, complex datasets with many features (genes, microbial taxa, etc.) (Chen & Ishwaran, 2012). RFs are non-parametric, making no assumptions about data distribution, and are robust to outliers and noise typical of genomic data (Miniati et al., 2016). Historically, one of the most widely cited studies that makes use of RF models for colorectal cancer is the CMS stratification study titled “The consensus molecular subtypes of colorectal cancer” (Guinney et al., 2015). In this study, DNA microarray and RNA-sequencing data were used to develop a Random Forest classifier for colorectal cancer which classified it into four subtypes, now known as the consensus molecular subtypes (CMS). 12 This CMS classification is integrated into an R package, CMScaller (Eide et al., 2017), and provides a framework for understanding the molecular diversity of CRC by associating each sample with a subtype associated with specific molecular characteristics and clinical implications. Table 1.1 summarizes the observed pathophysiological differences amongst the CMS subtypes. Table 1.1 CMS Subtypes and their observed pathophysiological differences. CMS Subtype Molecular Characteristics Clinical Implications CMS 1 “MSI Immune” MSI High, BRAF & TGFBR2 mutations; Immune infiltration & activation Worse survival after relapse; better response to immunotherapy CMS 2 “Canonical” Chromosomal Instability, WNT/MYC activation; APC & TP53 mutations Good prognosis; responsive to conventional chemotherapy CMS 3 “Metabolic” Metabolic dysregulation, KRAS & APC mutations Variable prognosis; potential metabolic therapy targets CMS 4 “Mesenchymal” TGF-β activation, stromal invasion, EMT activation, angiogenesis Poor prognosis; high recurrence; prone to metastasis Note. This table is a derivative work based Guinney et al. 2015, with additional information from Inamura 2018 (Creative Commons Attribution License). Permission to create a derivative work from Guinney et al. 2015 was obtained from Springer Nature under license number 5896940821640. While the CMS classification offers valuable insights into the molecular heterogeneity of CRC, it has recently come under scrutiny for its lack of reproducible results and poor prognostic values (Eide et al., 2021). Multiple samples from the same CRC tumours frequently have different CMS subtype classifications (Mouillet-Richard et al., 2024). In addition, no CMS subtypes consistently correlate with LCC and RCC locations, while other studies have demonstrated that RCC and LCC exhibit distinct molecular and clinical features that impact 13 prognosis and treatment response (Lee et al., 2017). This discrepancy highlights a potential gap in the CMS subtype classification. Therefore, there is an additional need for research that explicitly focuses and accounts for tumour location to develop more precise classifiers and improve personalized treatment strategies for CRC patients. Moreover, while Guinney et al. (2015) utilized a RF classifier to develop the CMS subtypes, they provided little detail on the inner workings of their model. Specifically, the study treated the model as a “black box”, and did not extensively explain the feature importance scores or the specific genes that contributed most significantly to the classification. This lack of model interpretability makes it challenging to understand the biological drivers of each subtype and limits the ability to translate these findings in clinical practice. Therefore, new studies in the area must not only account for tumour location, but also emphasize the interpretability of the ML model used. One way to interpret RF models is to use feature importance, a type of scoring metric for each feature which can offer insights into which variables contribute most to the model predictions. The built-in feature importance scoring allows RF models to address some of the black-box issues by highlighting the most relevant features driving the predictions. This makes RF more interpretable compared to other ML methods such as Support Vector Machines (SVM) which relies on complex kernel functions, and deep learning which utilizes complex internal representations, limiting the ability to easily interpret how specific features influence final predictions (Breiman, 2001; Cortes et al., 1995; LeCun et al., 2015). However, even with these advantages, RF models, like many other ML models, can suffer from overfitting. Overfitting occurs when a model learns not only the underlying patterns in the training data but also the random noise specific to the training set, leading to poor 14 generalization on new, unseen data (Hastie et al., 2009). Overfitted models perform well on training data but fail to predict accurately on testing or real-world data. To mitigate overfitting and ensure model robustness, cross-validation techniques such as k-fold cross validation (as illustrated in Figure 1.2) can be applied. Cross-validation is a statistical method used to evaluate performance of a classifier by partitioning the data into multiple subsets (Hastie et al., 2009). In k-fold cross validation, the dataset is divided into k subsets (or folds). The model is trained on k-1 subsets and tested on the remaining one, and this process is repeated k times, with each subset serving as the test set once. Figure 1.2 A visualization of the flow of testing and training data in 5-fold cross validation. Note. This figure is a derivative work based on information from scikit-learn (2022) which is licensed under the Copyright BSD License 2007-2024, scikit-learn developers. This method also allows for the calculation of an estimate of the model’s performance on unseen data. Evaluating RF model performance typically employs six common statistical metrics: accuracy, precision, recall (sensitivity), specificity, F1 score, and the area under the 15 Receiver Operating Characteristic curve (ROC AUC). Table 1.2 summarizes these metrics and their definitions. Table 1.2 Common statistical metrics for evaluating performance of machine learning classifiers and their definitions. Metric Formula Description Accuracy 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 The percentage of overall correct classification. Precision 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 The proportion of positive cases which were classified correctly out of all positive cases. Recall (Sensitivity) 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 The proportion of correctly classified positive cases. Specificity 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 The proportion of correctly classified true negatives in negative cases. F1 Score 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 Conveys the balance between precision and recall. ROC AUC (Area Under the Receiver Operating Characteristic Curve) Calculated by taking the area under the receiver operating characteristic curve using the trapezoidal rule. An accuracy-like measurement for classifier output quality across all possible classification thresholds. Note. Where TP: True Positives; TN: True Negatives; FP: False Positives; FN: False Negatives. This table contains information derived from Pellegrino et al. 2021 with permission obtained under a Creative Commons Attribution 4.0 International License, with additional information added from Baratloo et al. 2015. In developing RF models, the data are typically partitioned into training, validation, and testing sets. Performance metrics (Table 1.2) are often recalculated on the testing and/or 16 validation set many times during model optimization, and then calculated a single time on the testing set for final performance reporting. 1.2.3.3 Features in Random Forests Improving the explainability of RF models fit to genomic data is crucial for translating the models into actionable biological insights. Feature selection plays a role in enhancing model interpretability by identifying the most relevant features (e.g., genes, proteins, microbes, or metabolites) that are significantly associated with patient outcomes, disease progression, or treatment responses. By focusing on the most important features to the RF model, it is possible to identify features that may have important biological mechanisms. Feature selection can be categorized into pre-hoc and post-hoc methods, each with unique advantages and challenges. Pre-hoc feature selection involves selecting or reducing features before model training, based on certain criteria, such as variance, correlation with the target variable, or domain knowledge. This approach can help in reducing dimensionality, minimizing noise, and improving model performance, however pre-hoc methods are prone to erroneously excluding important features (Saeys et al., 2007). In contrast, post-hoc feature selection involves evaluating the importance of features after the model has been trained. This method leverages the model’s inherent ability to assess feature importance, allowing for a more informed selection of relevant candidate biomarkers based on their contribution to the model’s predictive power. Unlike pre-hoc methods, post-hoc feature-selection methods can capture complex interactions and non-linear relationships between features and the target variable. Despite the prevalence of pre-hoc feature-selection techniques in machine-learning based genomics research, post-hoc methods, particularly in the form of feature-importance scoring in 17 RF models can be a powerful alternative for identifying important features without risking erroneously excluding features prior to model building. 1.2.3.4 Feature Importance Scoring in Random Forests RF models have the ability to provide feature-importance scores, indicating the contribution of each feature to the model’s predictive power. A popular choice for feature importance calculation in RF models is based on the Gini impurity, which was first pioneered as a measure of statistical dispersion to quantify income inequality by Corrado Gini (1912), but made popular for use in ML models by Breiman et al. 2001 & 2017. Gini impurity measures the likelihood of an incorrect classification of a new instance if it were randomly labeled according to the distribution of labels in the dataset. Gini impurity (Breiman et al. 2001, 2017; Homola 2017), at a node n is calculated by: IG(n)=1- 𝑖=1 𝐽 ∑ (𝑃 𝑖 )2 Where: IG(n): Gini Impurity at a node (n) J: Number of classification labels Pi: Proportion of samples belonging to class i This is first calculated for the root node, and then recursively for all subsequent child nodes. The further down the tree, the less the total weighted Gini Impurity, and the last layer of nodes has a Gini impurity of 0. The decrease in Gini impurity when a feature is used to split a node contributes to that feature’s importance. The importance nij of node j is calculated as: 18 nij= wj Cj-wleft(j)Cleft(j)–Wright(j)Cright(j) Where: wj: Weighted number of samples reaching node j Cj: Impurity value of node j leftj, rightj: Child node from split on node j This formula calculates the importance for each individual node in the tree. This must then be summarized to the level of each feature (i.e., gene or microbe in our dataset) in the tree, as multiple nodes may base their decision on information from the same feature. Individual feature importance can then be calculated by: 𝑓𝑖 𝑖 = ∑𝑗:𝑛𝑜𝑑𝑒 𝑗 𝑠𝑝𝑙𝑖𝑡𝑠 𝑜𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑖𝑛𝑖 𝑗 Σ𝑘ϵ 𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠 𝑛𝑖 𝑘 fii: The importance of feature i nij: The importance of node j nik: The importance of node k These values are then scaled to a value between 0 and 1 by dividing by the sum of all feature importance values: 𝑛𝑜𝑟𝑚𝑓𝑖 𝑖 = 𝑓𝑖 𝑖 Σ𝑗ϵ 𝑎𝑙𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑓𝑖 𝑗 The final task is to calculate the feature’s importance over the entire Random Forest. This is calculated as the sum of the normalized features importance in each individual tree, divided by the total number of trees. 𝑅𝐹𝑓𝑖 𝑖 = Σ𝑗ϵ 𝑎𝑙𝑙 𝑡𝑟𝑒𝑒𝑠 𝑛𝑜𝑟𝑚𝑓𝑖 𝑗 𝑇 Where: RFfi: Importance of feature i calculated from all trees in the RF model 19 normfiij: Normalized feature importance for i in tree j T: Total number of trees This provides a normalized feature importance score between 0 and 1, indicating the relative importance of each feature across the entire forest. By identifying the most important features and quantifying their relative importance, where these features represent biological inputs (e.g., genes, microbes), researchers can gain insights into their biological relevance at predicting the target (outcome) variable (Breiman et al., 2001, 2017). 1.2.3.5 Permutation Importance Despite the inherent feature importance scores included with most RF model implementations, there is often a need for more robust approaches to assessing the significance of each feature in the model. Unlike base feature importance scores, permutation importance offers an unbiased estimate of the feature’s true contribution to the model’s predictive performance. The traditional approach to permutation importance for RF models, as included in many implementations such as scikit-learn, involves randomly shuffling the values of each feature and observing the effect on the model’s predictive performance (Pedregosa et al., 2011). This disrupts any existing relationships between the feature and the target variable, but also disrupts the relationships between features themselves. In cases where features may be correlated (such as genomics), this disruption can lead to inappropriate null values of importance scores, thus this type of importance is not suitable for such applications (Altmann et al., 2010). Rank-based permutation importance approaches are more suitable for cases of complex interactions and correlations between features (Altmann et al., 2010). Rather than shuffling individual feature values, it ranks the features by their importance scores and compares the observed feature at each rank with the distribution of importance scores obtained through 20 permutations. Existing rank-based permutation implementations tend to evaluate the importance of each feature individually, comparing the observed importance to feature-specific null-distributions. While this is effective in many contexts, this approach may overlook the interactions between correlated features, which is not ideal for genomic datasets. There remains a need for rank-based permutation methods that better account for the interdependencies between features and consider the shared importance of correlated features. In addition to determining feature importance, it is critical to understand how these features interact and contribute to the predictions of the model, in terms of directionality, magnitude, and on an individual and combined basis. 1.2.3.6 SHAP values SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017) values have emerged as a powerful tool for enhancing the interpretability of ML models. Originating from game theory, SHAP values provide a framework for interpreting ML model output by assigning each feature an importance value for a particular prediction, reflecting both the magnitude and directionality of its effect. In the context of RF models, SHAP values provide global and local interpretability, allowing researchers to understand how each feature contributes to the prediction for individual samples or the sample set. Unlike traditional feature importance metrics, SHAP values offer more detailed insights into the specific influence of each feature which may facilitate the identification of candidate biomarkers in the context of genomics research. 1.2.3.7 Functional Analysis Understanding the functional roles of identified genes (or candidate biomarkers) is crucial for elucidating their biological significance. Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005) is a technique that can assess whether predefined sets of genes show 21 statistically significant differences between two biological states (e.g. tumour versus normal). This allows researchers to determine whether specific biological pathways or processes are overrepresented among the identified genes. Gene Ontology (GO) (Ashburner et al., 2000) is another technique for functional annotation, GO is a bioinformatics initiative for the creation of a standardized vocabulary to describe genes and gene products across different species. GO serves as a database resource that provides standardized annotations that classify gene function into three main categories: biological processes, cellular components, and molecular functions. Biological processes describe the pathways and larger processes made up of the activities of multiple gene products, such as cell cycle, signal transduction, or metabolism. Cellular components describe the locations relative to cellular structures in which a gene product performs a function, such as the nucleus, mitochondrion, or cell membrane. Molecular functions are the elemental activities of a gene product at the molecular level, such as binding or catalysis. By integrating GO and GSEA researchers can attribute biological function to a list of candidate biomarkers to attribute them with meaningful biological insights, assess worthiness for future study, and aid in the identification of therapeutic targets and the development of targeted intervention. 1.2.3.8 Biomarker Validation An ultimate goal of interpreting ML models in a cancer genomics setting is the identification of biomarkers which can inform diagnosis, prognosis, and treatment strategies. Biomarkers are biological indicators that can be objectively measured, such as genes, proteins, or microbes, that can influence specific disease states or outcomes (Strimbu & Tavel, 2010). By leveraging the feature importance scores from RF models, researchers can pinpoint the most influential features that contribute to the prediction of clinical end points (often called targets), 22 such as tumour location or patient prognosis. Biomarker identification can involve many steps, from identifying features of interest through feature selection, attributing statistical significance, comparisons or rankings with other features, biological validation through cross-referencing with existing literature, functional assays, and experimental studies. The ultimate test of the utility of a biomarker lies in its clinical validation during clinical trials, where its effectiveness and reliability are confirmed in studies on patient populations. 1.3 Overview and Structure of this Thesis Understanding the complex interplay between host genetics and the gut microbiome is crucial for advancing diagnosis, prognosis, and personalized treatment strategies in CRC. Despite significant advancements in genomic technologies and computational methods, challenges remain in effectively integrating and interpreting multi-omics data. The overarching goal of this thesis is to enhance the analysis and interpretation of genomic and metagenomic data within the context of CRC research by creating interactive applications for exploring results, integrating multi-omics datasets, developing code for efficient machine learning techniques in R, and facilitating biomarker discovery through functional analyses. As discussed above, while numerous studies have investigated the genomic and microbial aspects of CRC, many have not fully integrated host genomic data with microbiome profiles. Additionally, tumour location, an important factor influencing CRC biology and patient outcomes, was historically overlooked and is still often not adequately considered in existing analyses. Furthermore, there is a lack of user-friendly, integrated tools for exploring complex metagenomic and metatranscriptomic data related to host-microbe interactions in CRC. An opportunity to address these challenges is presented by our RNA-seq datasets which encompass both host gene expression and microbial profiles from CRC patient samples. By 23 developing interactive applications for data exploration, and code packages for uncovering candidate biomarkers via explainable RF models, this thesis seeks to facilitate the exploration of such complex datasets and uncover novel biomarkers associated with CRC tumour sidedness. The main aims of this thesis are to: Aim 1: Develop an interactive application for genomic and metagenomic data exploration, linking microbial profiles with gene ontology to enhance understanding of host-microbiome interactions in colorectal cancer; Aim 2: Apply Random Forest Models to identify candidate biomarkers that distinguish right- and left-sided colorectal cancer, and employ rank-based permutation testing to improve model interpretability by quantifying the contributions of individual features; Aim 3: Develop a code package to facilitate building efficient RF models in R for use in genomic biomarker discovery, while incorporating SHAP analysis and functional gene ontology assessment. This thesis is structured into five main chapters, each contributing to the objective of enhancing the analysis and interpretation of colorectal cancer genomic and metagenomic data through a combination of software development, discovery and hypothesis generation, and hypothesis testing. Chapter 2 focuses on software development, detailing the creation of the MetaFunc App, an interactive R Shiny application designed to facilitate the exploration of data generated by the MetaFunc pipeline. This application addresses the need for integrated tools that can analyze microbial abundances, taxonomic profiles, and link them to gene ontology annotations within a single platform. By streamlining the MetaFunc workflow, the MetaFunc App supports researchers in conducting comprehensive studies on host-microbiome interactions. 24 Chapter 3 involves discovery and hypothesis generation as well as hypothesis testing. By using RF machine learning models, this study aims to identify and explain the contributions of significant genomic and microbial candidate biomarkers that differentiate right-sided and left-sided CRC. This chapter also introduces a rank-based feature permutation test for post-hoc feature selection which is later integrated into the R package introduced in Chapter 4. Chapter 4 is dedicated to software development and hypothesis testing. This chapter introduces pyRforest, an R package that integrates Python’s optimized RF algorithms into the R environment. This code package provides users with an end-to-end workflow for building and analyzing RF models for genomic research. pyRforest features memory-efficient algorithms, advanced feature selection methods, SHAP analysis for model interpretability, and gene ontology analysis for functional assessment. Chapter 5 summarizes the findings, discusses their implications for genomic research and suggests directions for future work. This includes exploring additional machine learning techniques, integration with additional multi-omics datasets, and further validation of identified candidate biomarkers. 25 Chapter 2 The MetaFunc App: An R Shiny Application for Joint Exploration of Microbial Diversity, Function, and Host Gene Expression The MetaFunc App is featured in the following publication (Appendix A): Sulit, A. K., Kolisnik, T., Frizelle, F. A., Purcell, R., & Schmeier, S. (2023). MetaFunc: taxonomic and functional analyses of high throughput sequencing for microbiomes. Gut Microbiome, 4(4), 1-21. https://doi.org/10.1017/gmb.2022.12. Authors’ contributions to the above publication: Conceptualisation: A.K.S, R.P., S.S.; Formal analysis: A.K.S, F.A.F., R.P., S.S.; Funding Acquisition: R.P.; Investigation: A.K.S., R.P., S.S.; Methodology: A.K.S, T.K., R.P., S.S.; Software: A.K.S, T.K., S.S.; Supervision: R.P., S.S.; Validation: A.K.S, S.S.; Visualisation: A.K.S; Writing – original draft: A.K.S; Writing – review and editing: A.K.S, T.K., F.A.F, R.P., S.S. A.K.S. and S.S. developed and co-wrote the pipeline which ultimately led to MetaFunc, and were involved with the majority of the design. T.K. developed the shiny application that is integrated in the pipeline. A.K.S. wrote the manuscript with editorial input from T.K., S.S. and R.P. R.P. further contributed to the design of the pipeline. F.A.F. provided guidance about all clinical aspects of the manuscript. Author’s Note: While I hold a secondary authorship on the aforementioned manuscript, my contributions to the MetaFunc App are substantial and significant to the MetaFunc project, and include over 2000 lines of code. This chapter focuses on the development of the MetaFunc App, which I led, and presents novel, unpublished content that does not overlap with the Sulit et al. 2023 paper (Appendix A). I have full permissions from all MetaFunc project authors to include this chapter in my thesis work. Creative Commons Attribution License CC BY 4.0 International: https://creativecommons.org/licenses/by/4.0/. 26 https://doi.org/10.1017/gmb.2022.12 https://creativecommons.org/licenses/by/4.0/ 27 2.1 Introduction 2.1.1 Overview Metatranscriptomic RNA-sequencing has transformed the study of complex interactions between host organisms and their associated microbiomes and pathogens; yet, the vast scale and complexity of these datasets presents analytical challenges. Extracting meaningful insights from such data requires tools that can integrate microbial and host-gene functions to provide a deeper understanding of biological processes. MetaFunc (Sulit et al., 2023) (Appendix A) is a command line-implemented pipeline tool designed to facilitate this process, allowing functional annotation of microbial and host metagenomic and metatranscriptomic data by integrating the processes of protein prediction, functional annotation, and taxonomic classification (Appendix A). In order to ease the interpretation and exploration of MetaFunc results, we developed a user-friendly R Shiny application (the MetaFunc App). This Shiny app allows interactive exploration of host and microbial abundances, as well as taxonomic and gene ontology tables generated by MetaFunc. The MetaFunc App allows users to pinpoint specific Gene Ontology (GO) (Harris et al., 2004) terms and track the species linked with taxonomic annotations. It also features the ability to invert this process by keying in a species to obtain all associated GO terms. These features are particularly valuable for researchers aiming to understand the functional repertoire of a certain microbial species in the host environment, or compare functional profiles across different species. By linking GO terms to species, researchers can pinpoint specific biological processes within microbial species and track functional traits across them. Conversely, by linking species to GO terms, researchers can explore the functional contributions of individual species across broader phylogenetic contexts. 28 In this chapter, I present the motivation and development of MetaFunc, detail the methodologies used in the development and application of the MetaFunc App, and provide benchmarking results and a case study. Additionally, I discuss the significance of the MetaFunc App, compare it to existing applications, address its scope and limitations, and suggest future directions. 2.1.2 Development History and Specific Contributions In order to effectively analyze complex metatranscriptomic data, interactive data tools that can integrate multiple data types are crucial. One effective solution is the use of an R shiny application, a web-browser based framework that enables dynamic data displays. Visual inspection of genomic and metagenomic output is useful, including the ability to sort, search, and filter columns (e.g., counts of taxa or functional categories). Previous examples of similar applications that can rank, sort, and rearrange tabular data have proven useful in other contexts, such as the Pavian application (Breitwieser & Salzberg, 2020), which was designed for the exploration of microbial taxa in metagenomic or metabarcoding results. In addition, data exploration via Shiny negates the need for more complicated manipulation steps, for example, sorting or selecting taxa via R. 29 Figure 2.1. The MetaFunc Workflow. Note. Reproduced in part from Sulit et al. 2023 (includes Kolisnik) under the Creative Common Attribution License. Before detailing the design and behaviour of the MetaFunc App, I outline the MetaFunc pipeline itself. Briefly, the MetaFunc pipeline (Figure 2.1) begins by aligning reads to the human reference genome using the STAR aligner (Dobin et al., 2013). Human-mapped reads are used for host gene expression and functional analysis, while the remaining human unmapped reads are processed using the metagenomic classifier Kaiju (Menzel et al., 2016) for protein-based taxonomic classification. Kaiju classifies microbial sequences by translating them and matching them with protein sequences and associated accession numbers. These protein accession numbers are subsequently mapped to functional GO (Harris et al., 2004) terms via a custom database 30 (designated nrgo). The pipeline also calculates microbial species abundances per sample or condition and performs Spearman correlation analysis between host gene expression and microbial species abundances. The resulting data, including microbial abundances, functional annotations, and host-microbe interactions, are stored in an SQL database for efficient access and retrieval. Users can explore these results interactively with the MetaFunc App, which allows for the visualization of taxonomy and GO tables, facilitating deeper and more detailed investigation into the biological processes, molecular functions, and cellular components present in various species across different datasets and conditions. 2.1.3 Motivation and Research Questions The motivation for the development of the MetaFunc App was the need for a more intuitive way for the user to sort, search, and interact with the complex taxonomic and functional results generated by the MetaFunc pipeline. The output of this pipeline is more easily interpreted with interactive data displays. The design of the MetaFunc App was centered on fulfilling the following aims: 1. To optimize the accessibility and usability of the complex outputs of MetaFunc for users with varying levels of bioinformatics expertise. 2. To enable users to effectively explore and derive meaningful insights from host and microbial abundances, as well as taxonomic and gene ontology tables produced by the MetaFunc pipeline. 3. To minimise computational overhead and ensure the MetaFunc App remains accessible on platforms with varying compute power and scales to handle large datasets. 31 In the following section I discuss this development process, including the selection of R Shiny as the development framework, and the methods I implemented to fulfill the above aims. 2.2 Background 2.2.1 Shiny Applications R Shiny (Chang et al., 2024) was used to create an application to support data display and exploration with integration to the MetaFunc pipeline. Developed by Posit (formerly Rstudio), Shiny was chosen because of its versatility and accessibility, particularly for users without extensive computational backgrounds. It offers an interactive, user-friendly interface, which allows researchers to display and explore their data without needing advanced programming skills in R (R Core Team, 2023) or SQL (Gilmore, 2008). This enables users to focus on deriving meaningful insights from datasets rather than navigating complex computational tasks. Shiny addresses traditional limitations for data manipulation and analysis steps in R, which typically rely on static tools like ggplot2 and dplyr (Wickham 2016, 2019). Instead, Shiny allows for dynamic interactivity in a browser-based application interface in which users can interact with data in real time using features such as drop-down menus, sliders, and buttons, triggering actions such as recalculating results or refreshing data tables. Shiny integrates a server-side back end with a front-end user interface (UI) that enables interactive data manipulation and display. Additionally, Shiny’s server-side processing supports dynamic rendering, enabling the application to display only relevant portions of the data at any given time, improving both speed and performance especially when visualizing large data tables. Shiny apps can be hosted on local or on servers, allowing users to access them remotely without the need for reinstallation or complex configuration. 32 For more experienced users, Shiny offers a range of advanced functionalities and integration with other tools. For example, it can connect with tools such shinyjs (Attali, 2022), which allows users to store cookies for session management or enhance displays through JavaScript. Shiny also interfaces with R packages such as DBI (R Special Interest Group on Databases (R-SIG-DB) et al., 2024) to manage database connectivity with SQL databases. This versatility has contributed to Shiny’s widespread adoption in the development of bioinformatics applications as a powerful framework and useful tool for data analysis. 2.2.2 Shiny Apps in Bioinformatics Due to its versatility and usefulness, Shiny is used for a number of bioinformatics applications, including the following examples. shinyGEO (Dumas et al., 2016) imports data directly from the Gene Expression Omnibus (GEO) database to perform differential expression analysis and visualize results. This allows the user to perform complicated biological analyses without having an extensive computational or bioinformatic background. pcaExplorer (Ludt et al., 2022; Marini & Binder, 2019) helps users perform principal component analysis (PCA) on RNA-seq data to visualize and explain the variance within their dataset, helping to identify patterns and outliers. It provides an interactive graphical interface that allows users to adjust parameters, filter data points, and investigate how specific features contribute to overall variance. explain the variance in their data and identify patterns and outliers. This interactivity enhances data exploration by allowing researchers to assess the relationships between variables and samples, which is particularly useful in the early stages of RNA-seq data analysis. The popularity and utility of this platform is indicated by its number of citations: more than 200 over four years. 33 shinyGO (Ge et al., 2020) provides users with a tool for gene ontology (GO) enrichment analysis for plants and animals. It provides interactive bar charts of GO terms, enrichment plots for GO-term enrichment, and pathway diagrams that link to KEGG pathways. Gene ontology is useful for associating gene lists to underlying biological functions, cellular components, and molecular components. Again, the utility of the Shiny approach is indicated by the high citation rate of ShinyGO, almost 2,500 since its publication in 2020. These popular examples illustrate how Shiny apps have enhanced bioinformatics research by making complex analyses more accessible and interactive, and highlight the growing demand for such applications. 2.3 Methodologies 2.3.1 Workflow Integration and Application Launch The MetaFunc App is integrated with the rest of the MetaFunc pipeline, which allows the app to function as part of a cohesive workflow, rather than as a standalone tool. MetaFunc relies on snakemake (Köster & Rahmann, 2012), a workflow management system that automates the execution of data processing pipelines. Upon execution, the MetaFunc pipeline automatically generates an SQLite database and launches the MetaFunc App as the final step of workflow, or it can be initiated manually by running the app.R file. This setup accommodates new datasets as they become available; this means that once the pipeline is set up once, any new databases from subsequent pipeline runs will automatically become selectable for viewing within the app, provided they have been correctly placed in the proper directory. 34 2.3.2 Database Development & Design The SQL database which serves as the backbone of the MetaFunc App is constructed by the database build script that is integrated into the snakemake workflow. The MetaFunc pipeline results directory consists of a complex list of 290 output files of varying formats. This script locates and processes 8 .tsv files into a unified database structure with the following tables. Microbial Taxonomy Table: This table catalogs microbial % abundances indexed by taxonomy identifier (TaxID), and is the back end for the Microbiome-Abundances tab in the app. This table contains (1) TaxIDs which are used as a primary key to join with the mapping tables and (2) microbial abundance data for each sample on a per-Tax ID level; (3) the URL for the corresponding NCBI taxonomy browser page for the TaxID. Gene Ontology Tables: The Microbiome-Gene Ontology tab depends on two separate gene ontology tables, one for storing individual data and one for storing grouped analysis data. The GO individual table stores the following information: (1) The GO ID number; (2) The GO term description; (3) The GO category; (4) the URL of the corresponding AmiGO page for the GO term; (5) the microbial abundance data for each sample on a per-GO term level. The GO grouped table features the same columns, although microbial abundance data is stored for each experimental group. Mapped Gene Ontology and TaxID Tables: The GO to TaxIDs and TaxIDs to GO tabs depend on two tables, one for individual and one for grouped data. These tables link GO terms to microbial TaxIDs, enabling researchers to trace functional annotations to specific microbial taxa and vice versa. Each table stores the (1) GO ID, (2) the description of the GO term (3) abundances for each sample or group and (4) details on the TaxIDs associated with each GO term. The TaxIDs are mapped to additional information via the mapping tables. 35 Human Gene Expression Table: This table is the back-end of the Host-Abundances tab, and contains normalized host gene expression data as Transcripts Per Million (TPM) values. The table also includes a URL column that links each gene to its Ensembl entry, allowing users to explore additional gene-specific information. Mapping Tables: The database includes mapping tables to ensure proper association between taxonomic identifiers, GO identifiers, and sample metadata. The sample mapping table contains metadata for each sample, including (1) sample name, (2) sample type, and (3) internal sample id. The tax id mapping table maps the primary key of TaxIDs to fully parsed taxonomic name data such as Kingdom and lower taxa. The database is designed using standard database concepts such as referential integrity and normalization, to ensure efficient data management and retrieval. Primary keys such as Tax ID, GO ID and sample names are used to enforce the uniqueness of records, and facilitate rapid lookups and relational joins between tables. With the exception of the mapped GO to Tax ID tables which store some duplicate data to reduce querying time, the database schema adheres to the principles of third normal form, minimizing redundancy and promoting consistency. 2.3.3 Downloading and Installation of the MetaFunc App The MetaFunc App is distributed as part of the MetaFunc pipeline and can be downloaded directly from the pipeline’s GitLab repository at https://gitlab.com/schmeierlab/workflows/metafunc for example using `git clone https://gitlab.com/schmeierlab/workflows/metafunc.git`. The MetaFunc App files can then be found within the /metafunc-shiny/ subdirectory of the repository. This directory contains all necessary files required to run the app. 36 https://gitlab.com/schmeierlab/workflows/metafunc Users should follow the MetaFunc pipeline installation instructions at: https://metafunc.readthedocs.io/en/latest/usage.html. Additional MetaFunc App instructions can be found at https://metafunc.readthedocs.io/en/latest/rshiny.html#label-shiny. MetaFunc App dependencies are the following R package: shiny, shinyjs, DT, DBI, formattable, tidyverse, ggplot2, shinythemes, shinydashboard, data.table, dplyr, stringr, reshape2, RSQLite, openxlsx, and splitstackshape. Exact versions are specified in the app .yaml file and are automatically downloaded when the pipeline is installed. 2.3.4 Technical implementation of the MetaFunc App 2.3.4.1 Project Structure To ensure ease of maintenance and scalability, the MetaFunc App follows best practices for Shiny app development by separating server logic and UI definitions into distinct component scripts. This modular approach supports updates and troubleshooting. The database querying script and initialization variables are organized in separate files, which are integrated into the app through a master script. Databases generated by the MetaFunc pipeline are stored in a subdirectory that is read by the MetaFunc App to populate the internal database selection tool. 2.3.4.2 Front-End Development The MetaFunc App user-interface (UI) is structured into three main tabs and four subtabs: The Home tab is the initial entry point for the user, and features the database selection and load buttons, usage instructions, and an embedded workflow diagram of the MetaFunc pipeline. A left-justified panel houses the three main tabs and their respective subtabs. The current selected tab is highlighted as a visual indicator for navigation. The Microbiome tab is divided into four subtabs: “Abundances”, “Gene Ontology”, “GO to TaxIDs”, and “TaxIDs to GO”. These subtabs are structured to allow dynamic display of data 37 https://metafunc.readthedocs.io/en/latest/usage.html https://metafunc.readthedocs.io/en/latest/rshiny.html#label-shiny tables. Key functionalities include filtering, sorting, and exporting data to Excel, TSV, or CSV formats. The “Abundances” subtab displays a table of microbial abundance data for each sample or sample group. The “Gene Ontology” subtab displays GO data for each sample or sample group. subtabs with GO data feature a toggle to filter by one of three GO categories: Biological Process (BP), Cellular Component (CC), or Molecular Function (MF). The “GO to TaxIDs” and “TaxIDs to GO” subtabs effectively combine the “Abundances” and “Gene Ontology” subtabs, featuring linked tables of abundance and taxonomy data that update dynamically based on user selections. For example, in the “GO to TaxIDs” subtab, the user can select one or more rows of GO information in the top table, and the bottom table will dynamically update to show only the microbial taxa associated with those GO terms. Conversely, in the “TaxIDs to GO” tab users can select one or more rows of taxonomic information in the top table, and the bottom table will update to display the associated GO terms. A reset button is also included to allow users to clear filters and return the tables to their original state. The Host tab includes a single subtab titled “Abundances”, which provides users with data displaying and filtering capabilities for host count (TPM) data. Example screenshots of the user interface can be seen in the `screenshots` folder of the metafunc-shiny gitlab repository. 2.3.4.3 Back-End Development The back-end development of the MetaFunc App centers around integrating reactive programming and server-side processing to handle the exploration of large datasets. Custom server-side processing functions are responsible for fetching and preprocessing the data from the SQLite database, ensuring the data is formatted consistently for the data tables. Additionally, the app uses shinyjs (Attali, 2022) to store session data for the database path in cookies, allowing the 38 app to recall which database was previously loaded and automatically reloads it without requiring manual input from the user. Central to the data table display functionality is the DT R package (Xie et al., 2022), an R wrapper for the JavaScript library DataTables, which facilitates the creation of highly interactive tables. These tables allow the user to sort, search, filter (including setting ranges for numeric values), reorder columns, and paginate the data. Additionally, DT supports embedded HTML URLs, allowing provision of direct links to external resources, such as Ensembl (Martin et al., 2023), NCBI (Sayers et al., 2022), and AmiGO (Gene Ontology Consortium, 2024). The “GO to TaxIDs” and “TaxIDs to GO” tabs leverage reactive programming to create a dynamic linkage between two interconnected tables. In the “GO to TaxIDs” tab, the upper table displays GO terms, and based on the selected rows, the app triggers a reactive update that fetches and displays the corresponding taxonomic information in the lower table. This functionality is enabled by server-side processing, ensuring that user selections in the first table are instantly passed to the server, which retrieves the relevant taxonomic data and filters and re-renders the second table in real time. Similarly, the “Tax IDs to GO” tab implements the reverse operation. The MetaFunc App also incorporates CSS-based loading indicators to provide users with visual feedback while data is being processed, allowing users to be aware of background processes to avoid confusion during longer wait times. Lastly, the app features comprehensive export capabilities, allowing users to download entire tables, or specific rows in CSV, TSV, and XLSX formats. This functionality is achieved through the combination of DT, base R, and openxlsx (Schauberger & Walker, 2024) packages. Custom functions generate a timestamp and append it to the filename to provide users with a unique filename and time record for each export. 39 2.3.5 Speed and Scalability Optimization Ensuring scalability without compromising speed was one of the most challenging issues encountered during the development of the MetaFunc App. Initial versions of the app relied on large TSV files for data storage and retrieval, resulting in slow performance as the size of the datasets scaled. Transitioning to an SQLite database considerably improved performance in data joining and retrieval. To handle large datasets more efficiently, table partitioning and pagination were also introduced. Partitioning divided larger tables into smaller, manageable segments allowing only the data necessary for each tab of the app to be loaded, while pagination minimized the memory footprint and client-side rendering time by loading data incrementally. All calculations and database processing are server-side to offload the computational burden from the client. and dependent only on the hardware of the hosting computer. Benchmarking tests (in Results) were conducted to assess performance with datasets of varying size on a Windows 10 system with 32GB RAM and an AMD Ryzen 9 5900X 12-Core Processor, running at 3.7 GHz with 24 logical processors. The metrics represent an average of five measurements. Response time was recorded using Shiny’s showcase mode to track the delay between user input and app reaction. Memory usage was recorded using the profvis R package, and CPU utilization was tracked using the system task manager. CPU usage was monitored using Activity Monitor. Latency, the delay between input and output visualization was recorded using Firefox Developer Tools. Database load times were tracked using a custom wrapper function to measure response times for initial load time of the largest tables in the app, those of the GO to TaxIDs tab. It is important to note that subsequent load times of the tables will always be shorter than the initial load. 40 2.3.6 Data Availability Datasets used for the development, validation, and demonstration sets are available at: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA413956 and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA788974. 2.3.7 Testing and Validation We conducted several structured phases of testing to ensure the functionality and performance of the app. Testing was led by myself and was repeated for validation by the other members of the MetaFunc project. These tests focused on verifying data integrity, app functionality, and performance in the standalone version of the app. Following pipeline integration, additional integration and deployment testing was performed to ensure the app was installable, handled real-world datasets and interacted with other components of the pipeline. The first phase of testing was installation and deployment testing. In order to ensure the app could reliably be installed on multiple operating system environments, installation was tested on Windows 10 and 11, macOS Mojave and Catalina, and Ubuntu 20.4 LTS systems. The second phase of testing focused on data integrity checks, with particular attention to identifying errors in data linkage and display. In a worst-case scenario, mislinked or missing samples, GO, or taxonomy information could have led to inaccuracies in the display and skewed functional analysis. This phase involved verifying the accurate linkage and representation of sample ids, GO IDs and Tax IDs, and data values (e.g. abundances, TPM values) within the original TSV files, database, and the app display tables. The third phase of testing involved functional testing, which aimed to ensure that all user interface elements responded correctly to user inputs. Tests were conducted on every interactive feature, including buttons, row selection, table search, filters, sliders, ability to add or remove 41 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA413956 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA788974 columns. Filtering edge-cases with unusual input values (e.g. negative values, characters) were tested. Functional testing also included checking user experience under different browser environments (Chrome, Firefox) to ensure consistency in app behaviour across platforms. The final phase of testing focused on performance, specifically measuring how the core functionalities of the MetaFunc App, including database loading, data filtering, and rendering of interactive tables performed under varying database input sizes and user input loads. This was aimed at assessing the ability of the app to handle both typical and extreme data loads (e.g. 2000 samples, 100,000 genes) and inputs. We also tested its resilience by simulating high-intensity user interactions, such as rapidly switching tabs, applying multiple filters, and by pressing multiple buttons at intervals of less than 3 seconds. These tests aimed to identify any bottlenecks or points of failure, including attempts to make the app crash under extreme conditions. We also evaluated stability over extended periods of use (more than one hour) by sending commands at 10-minute intervals, observing CPU usage, memory consumption, and system temperature, to identify any signs of resource exhaustion. This phase provided a comprehensive understanding of the stability of the MetaFunc App during heavy usage scenarios. These structured testing phases allowed us to systematically identify and resolve issues, thereby allowing us to improve the user experience. Additionally, the MetaFunc App, as part of the broader MetaFunc project, has undergone further validation through peer review processes associated with its publication in Gut Microbiome (Sulit et al., 2023). 42 2.4 Results 2.4.1 Overview In this section, we demonstrate the functionality and performance of the MetaFunc App using a step-by-step walkthrough with screenshots. We also present our benchmarking results for performance evaluation of the MetaFunc App. Note on Image Quality: Due to the constraints of the document format, some screenshots included in this section may appear difficult to read. Higher resolution versions of the screenshot images in this chapter can be accessed in the `/screenshots` folder of the metafunc-shiny gitlab repository. 2.4.2 User Interface and Case Study Exploration 2.4.2.1 Home Page Analysis begins on the Home Page of the MetaFunc App, which provides usage instructions and access to the main functionalities. As shown in Figure 2.2, users are directed to select and load a database from the list of databases created by the MetaFunc pipeline. 43 Figure 2.2 The Home Page of the MetaFunc App. Note. This homepage showcases user navigation options and usage instructions. 44 2.4.2.2 Microbiome - Abundances After loading the database, the user is directed by the documentation to choose a tab in the left side panel, “Microbiome” or “Host” depending on what type of data they want to explore. After clicking the “Microbiome” tab, the user is greeted with 4 subtabs; the “Abundances” subtab, illustrated in Figure 2.3, indicates the % abundances for each taxon, and can be further filtered, sorted, and searched. In this example the data has been filtered on the ‘Genus’ column by the name ‘Escherichia’. In this instance, the user can then identify which Escherichia genus microbes are of high or low abundance, and in which samples or sample groups. By highlighting the rows, this data can be exported just for the selected microbes. 45 Figure 2.3 The “Microbial - Abundances” subtab of the MetaFunc App. Note. This tab allows users to display and filter microbial abundance data (% read count per taxonomy id) across various samples. The interactive table enables extensive filtering and selection options. The table also includes functionality for resetting filters and exporting the filtered data in .csv or Excel formats. 46 2.4.2.3 Microbiome - Gene Ontology If the user wishes to explore GO terms present in their dataset these are accessible via the Gene Ontology subtab, shown in Figure 2.4. Users can explore which GO terms have high or low % read counts across one or more samples or sample groups to help with hypothesis generation, or they can quantify abundances associated with predetermined GO terms for hypothesis testing. In this example, the table has been filtered on the ‘Description’ column by the partial search term ‘polyamine’, which results in four GO terms. Polyamines are a target of interest in our case study for their known association with tumour progression and cell growth (Sulit et al., 2023). 47 Figure 2.4 The “Gene Ontology” subtab of the MetaFunc App. Note. Displayed are GO terms % reads mapped to each GO identifier). 48 2.4.2.4 Microbiome - GO to TaxIDs In the next subtab, GO to TaxIDs, Figure 2.5, the top table mirrors the gene ontology table shown in the Microbiome - Gene Ontology subtab (Figure 2.4), and the bottom table mirrors the taxonomy table shown in the Microbiome - Abundances subtab (Figure 2.3), except it starts out with no data shown. When the user selects one or more GO identifiers of interest in the top table, the bottom table updates to display corresponding associated taxonomy information and taxonomy abundances. This allows the user to see all microbes associated with a particular GO biological process, cellular component, or molecular function. For example, in Figure 2.5, the GO term ‘polyamine biosynthetic process’ has been selected, and the bottom table has populated with the abundance information of 126 microbial taxa associated with this GO term, it has been sorted by highest average abundance to lowest, showing Escherichia coli has the highest average abundance of all microbes associated with this GO term. 49 Figure 2.5 The “GO to TaxIDs” subtab. Note. This tab features two interactive tables, where the selection of one or more GO terms in the top table updates the bottom table to display the microbial taxa associated with this GO term. 2.4.2.5 Microbiome - TaxIDs to GO Similarly, in the TaxIDs to GO tab, shown in Figure 2.6, the inverse operation is shown. In this example, the data has been filtered on the genus column for ‘Escherichia’, and the microbe Escherichia coli has been selected in the top microbial abundances table. The bottom table has rendered the list of 1268 GO biological processes associated with Escherichia coli. A filter has 50 subsequently been applied to the bottom GO table by filtering the description column for ‘polyamine’, showing four related GO terms. The % reads mapped to each GO term for the different polyamine related biological processes associated with Escherichia coli are shown. Figure 2.6 The “TaxIDs to GO” subtab. Note. This tab features two interactive tables, where the selection of one or more rows of microbial taxa in the top table updates the bottom table to reflect the GO terms associated with the selected taxa. 51 2.4.2.6 Single Sample vs Grouped Analysis All subtabs under the Microbiome tab of the MetaFunc App can display both individual and grouped analyses, depending on the research objectives. Individual analysis focuses on the exploration of single samples with no additional group context. Grouped analysis allows for the comparison of microbial taxa and their GO terms between predefined groups, such as disease states, treatment types, or molecular subtypes. By comparing Figures 2.7 and 2.8 below, we can see the differences between the display of individual and grouped analyses on the same table of microbial abundan