Copyright is owned by the Author of the thesis.  Permission is given for 
a copy to be downloaded by an individual for the purpose of research and 
private study only.  The thesis may not be reproduced elsewhere without 
the permission of the Author. 
 

Bioinformatics Tools and Explainable Machine Learning Approaches for
Colorectal Cancer Genomic and Metagenomic Data Analysis

A thesis presented in partial fulfilment of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

School of Mathematical and Computational Sciences at Massey University, Albany Campus,

Auckland, New Zealand

Tyler Kolisnik

2024


Copyright:

© 2024 Tyler Kolisnik. This work is licensed under a Creative Commons Attribution 4.0
International License: https://creativecommons.org/licenses/by/4.0/.

Supervisory Team:

Dr. Olin K. Silander, PhD,
Senior Research Fellow, Liggins Institute, University of Auckland, Auckland, New

Zealand

Dr. Sebastian Schmeier, PhD,
Senior Lecturer in Bioinformatics, School of Natural and Computational Sciences,

Massey University, Auckland, New Zealand (former)

Dr. Steven Jones, PhD,
Head of Bioinformatics and Co-Director, Canada’s Michael Smith Genome Sciences

Centre at BC Cancer & Director, Bioinformatics Graduate Program, University of British
Columbia, Vancouver, BC, Canada

Dr. Adam N. H. Smith, PhD,
Senior Lecturer in Statistics, School of Mathematical and Computational Sciences,

Massey University, Auckland, New Zealand

Examination Panel:

Dr. Justin O’Sullivan, PhD,
Director, Liggins Institute, University of Auckland, Auckland, New Zealand

Dr. Xochitl Morgan, PhD,
Director, Harvard Chan Microbiome Analysis Core, Harvard University, USA

Dr. Jonathan Marshall, PhD,
Associate Professor in Statistics, School of Mathematical and Computational Sciences,

Massey University, Auckland, New Zealand

ii

https://creativecommons.org/licenses/by/4.0/


Abstract

Colorectal cancer (CRC) is a leading cause of cancer-related mortality worldwide and is

influenced by complex interactions between genetic factors and the microbiome. The advent of

high-throughput sequencing technologies has led to the generation of vast amounts of genomic

and metagenomic data, providing opportunities to uncover novel biological markers

(biomarkers) for CRC diagnosis, prognosis, and treatment. However, analyzing such datasets

poses significant computational and interpretive challenges, necessitating the development of

efficient and user-friendly bioinformatics tools.

Recent advancements in machine learning, particularly Random Forest (RF) models,

have shown promise in identifying predictive features in genomic data. Yet, existing

implementations often face limitations in scaling and interpretability, especially when applied to

large genomic studies. Additionally, integrating host genomic and microbial metagenomic data

remains a complex task due to the heterogeneity of data types and sophisticated analytical

methods required.

This thesis focuses on the development of computational tools and the application of

machine learning techniques to enhance the analysis of genomic and metagenomic data for

colorectal cancer research. Firstly, I present the MetaFunc App, an interactive Shiny application

designed to facilitate the exploration of data generated from the MetaFunc pipeline, an analysis

iii


pipeline for host and microbiome transcriptome data. The app provides a user-friendly interface

for visualizing and analyzing microbial taxonomic profiles alongside host gene expression data,

linking functional annotations to specific microbial taxa. This integration facilitates a deeper

understanding of microbial contributions to a designated target outcome, e.g. cancer versus

normal, and aids in identifying potential microbial biomarkers and eliciting their functions.

Secondly, I apply Random Forest machine learning models to identify genomic and

microbial biomarkers that differentiate right-sided colorectal cancer (RCC) from left-sided

colorectal cancer (LCC). Utilizing RNA-seq data for 58,677 coding and non-coding human

genes, and count data for 28,577 microbial taxa from 308 patient tumour samples, I develop

three models: a genes-only model, a microbes-only model, and a combined genes-and-microbes

model. The genes-only model achieves an accuracy of 90%, identifying significant genomic

features such as PRAC1, HOXB13, HOXC4, and HOXC6, which are associated with colorectal

cancer location and development. The microbes-only model achieves an accuracy of 70%,

identifying significant microbial features including Ruminococcus gnavus and Fusobacterium

nucleatum. The combined model achieves an accuracy of 87%, which may reflect an association

between microbial communities and host gene expression in CRC.

Finally, I present pyRforest, an R package that integrates Python’s scikit-learn

RandomForestClassifier into the R environment via the reticulate package to enhance

computational efficiency and memory management when using Random Forest models in R to

iv


analyze large genomic datasets. pyRforest also includes a novel rank-based permutation method

for calculating p-values of individual features for feature identification. Additionally, it includes

the capacity for calculating and plotting SHapley Additive exPlanations (SHAP) to interpret the

contribution of each feature to model predictions, enhancing the explainability of Random Forest

model results. The utility of pyRforest is demonstrated through a case study, where it is used to

identify candidate biomarkers and provide insights into biological significance.

Collectively, this work advances the understanding of genomic and microbial factors

influencing colorectal cancer and provides advanced computational tools that can be used in

other analyses. The MetaFunc App and pyRforest package facilitate the integration and

interpretation of complex genomic and metagenomic data, and represent valuable resources for

biomarker discovery. By addressing current challenges in data analysis and in bioinformatics

software development, this thesis lays the groundwork for future research in bioinformatics and

oncology, ultimately aiming to create and implement tools for improved genomic and

metagenomic cancer dataset analysis.

v


Acknowledgements

I would like to express my sincerest thanks to my doctoral supervisors, Dr. Olin Silander,

Dr. Adam Smith, Dr. Sebastian Schmeier, and Dr. Steven Jones, this thesis would not have been

possible without their guidance and support. I’m profoundly thankful to Dr. Jones, whose

supervision at the British Columbia Genome Sciences Centre allowed me to return home to

Canada to continue my studies at Massey from “abroad” during the uncertain times of the global

COVID-19 pandemic. I am especially thankful to all my fellow students and colleagues at

Massey University, especially Arielle Sulit, and my fellow students at the Genome Sciences

Centre, especially Faeze Keshavarz-Rahaghi. I would like to also thank Lucia Lam, who has

been an amazing mentor and has helped me so much with my career in bioinformatics. I would

like to thank my high school math and physics teacher, Roger Pavey, for everything he taught me

that set me up for success with my education. I would like to thank my amazing neighbours and

friends Samina and Aodhan for making us feel welcome in New Zealand.

My PhD studies would not have been possible without a PhD scholarship from the

Massey University School of Natural Sciences, for which I am profoundly grateful.

Thank you to my parents Teresa and Steven, and my brother Kelly for always providing

me with encouragement, and for always being supportive of my decisions. Thank you to my

partner Layton and my cat Fry for their love and support and for coming on the journey abroad to

New Zealand with me.

I would also like to express my deepest appreciation for the support of my friends and

extended family throughout my PhD. Thank you to my amazing friends Shyy, Stefanie, Rachel,

and Naomi. Thank you to my friend Jeff for making sure that not all of my time on computers is

spent coding. Thank you to my in-laws Caron and Blaine for always being there to help us

during tough times. A special thank you goes out to my cousins, Elise, Mercedes, Mike, Troy,

Jen, and Jade, my aunts and uncles, especially my aunts Rita and Rose, and my uncle Aaron, and

all those who sent me packages from Canada while I was studying abroad, it made me feel a lot

closer to home.

“Things are only impossible until they are not.” - Captain Jean-Luc Picard

“Tea, earl grey, hot.” - Also Captain Jean-Luc Picard

vi


List of Publications

The following published works were completed during my doctoral candidature.

The results presented in Chapter 2 of this thesis contributed to the following publication, which

is attached in Appendix A:

Sulit, A. K., Kolisnik, T., Frizelle, F. A., Purcell, R., & Schmeier, S. (2023). MetaFunc:

taxonomic and functional analyses of high throughput sequencing for microbiomes. Gut

Microbiome, 4(4), 1-21. https://doi.org/10.1017/gmb.2022.12.

The results presented in Chapter 3 of this thesis were published as:

Kolisnik, T., Sulit, A. K., Schmeier, S., Frizelle, F., Purcell, R., Smith, A., & Silander, O. (2023).

Identifying important microbial and genomic biomarkers for differentiating right- versus

left-sided colorectal cancer using random forest models. BMC Cancer, 23(647), 1-11.

https://doi.org/10.1186/s12885-023-10848-9.

The results presented in Chapter 4 of this thesis were published as:

Kolisnik, T., Keshavarz-Rahaghi, F., Purcell, R., Smith, A., & Silander, O. (2024). pyRforest: A

comprehensive R package for genomic data analysis featuring scikit-learn Random Forests

in R. Briefings in Functional Genomics, 2024(38), 1-9.

https://doi.org/10.1093/bfgp/elae038.

In addition, during my time studying at the BC Genome Sciences Centre I contributed to the

following publication:

Keshavarz-Rahaghi, F., Pleasance, E., Kolisnik, T., & Jones, S. J. M. (2022). A p53

transcriptional signature in primary and metastatic cancers derived using machine learning.

Frontiers in Genetics, 13, 987238. https://doi.org/10.3389/fgene.2022.987.

vii

https://doi.org/10.1017/gmb.2022.12
https://doi.org/10.1186/s12885-023-10848-9
https://doi.org/10.1093/bfgp/elae038
https://doi.org/10.1093/bfgp/elae038
https://doi.org/10.3389/fgene.2022.987238


List of Abbreviations

AI Artificial Intelligence

APC Adenomatous Polyposis Coli (Gene)

AUC Area Under the Curve

AUROC Area Under the Receiver Operator Characteristic Curve

BRAF B-Raf proto-oncogene (Gene)

CMS Consensus Molecular Subtypes

CPM Counts-Per-Million

CRC Colorectal Cancer

DE Differential Expression

DNA Deoxyribonucleic Acid

EMT Epithelial-Mesenchymal Transition

EGFR Epidermal Growth Factor Receptor (Gene)

FMT Fecal Microbiota Transplant

FPKM Fragments Per Kilobase of transcript per Million mapped reads

GB Gigabyte

GHz Gigahertz

HOXB13 Homeobox B13 (Gene)

HOXC4 Homeobox C4 (Gene)

HOXC6 Homeobox C6 (Gene)

HOXC8 Homeobox C8 (Gene)

KEGG Kyoto Encyclopedia of Genes and Genomes

KNN k-Nearest Neighbours

KRAS Kirsten RAt Sarcoma

LCC Left-sided Colorectal Cancer

ML Machine Learning

MLH1 MultiL Homolog 1 (Gene)

MSH1 MutS Homolog 1 (Gene)

MSH6 MutS Homolog 6 (Gene)

viii


MSI Microsatellite Instability

NCBI National Center for Biotechnology Information

NeSI New Zealand e-Science Infrastructure

PCA Principal Component Analysis

PIK3CA Phosphatidylinositol-4,5-Bisphosphate 3-Kinase Catalytic Subunit Alpha (Gene)

PMS2 Postmeiotic Segregation Increased 2 (Gene)

PRAC1 Prostate Cancer Susceptibility Candidate 1 (Gene)

RCC Right-sided Colorectal Cancer

RF Random Forest

RNA Ribonucleic Acid

RNA-seq Ribonucleic Acid Sequencing

RNN Recurrent Neural Networks

SHAP SHapley Additive exPlanations

SVM Support Vector Machines

TGFBR2 Transforming Growth Factor Beta Receptor 2 (Gene)

TNM Tumour Node Metastasis (Cancer staging system)

TPM Transcripts-Per-Million

WES Whole-Exome Sequencing

WGS Whole-Genome Sequencing

ix


Table of Contents

Abstract iii
Acknowledgements vi
List of Publications vii
List of Abbreviations viii
Table of Contents x
List of Figures xiv
List of Tables xvi
Chapter 1 - Introduction 1

1.1 Colorectal Cancer 1
1.1.1 Epidemiology 1
1.1.2 Colorectal Cancer Nomenclature 1
1.1.3 CRC Sidedness and Clinical Implications 2
1.1.4 Molecular and Genetic Differences between RCC and LCC 3
1.1.5 CRC Treatments 4
1.1.6 The Gut Microbiome in CRC 5
1.1.7 Advances in Genomic Technologies 7

1.1.7.1 High-Throughput Sequencing Technologies 7
1.1.7.2 Challenges in Data Normalization and Scaling 8

1.2 Computational Approaches in Genomic Research 9
1.2.1 Use of Applications for Exploring Metagenomic Data 9
1.2.2 Existing Applications for Exploring Metagenomic Data 10
1.2.3 Application of Machine Learning Approaches in CRC Research 11

1.2.3.1 Historical use of ML for CRC Research 11
1.2.3.2 Random Forest Models 12
1.2.3.3 Features in Random Forests 17
1.2.3.4 Feature Importance Scoring in Random Forests 18
1.2.3.5 Permutation Importance 20
1.2.3.6 SHAP values 21
1.2.3.7 Functional Analysis 21
1.2.3.8 Biomarker Validation 22

1.3 Overview and Structure of this Thesis 23
Chapter 2 - The MetaFunc App: An R Shiny Application for Joint Exploration of
Microbial Diversity, Function, and Host Gene Expression 26

2.1 Introduction 28
2.1.1 Overview 28
2.1.2 Development History and Specific Contributions 29

x


2.1.3 Motivation and Research Questions 31
2.2 Background 32

2.2.1 Shiny Applications 32
2.2.2 Shiny Apps in Bioinformatics 33

2.3 Methodologies 34
2.3.1 Workflow Integration and Application Launch 34
2.3.2 Database Development & Design 34
2.3.3 Downloading and Installation of the MetaFunc App 36
2.3.4 Technical implementation of the MetaFunc App 37

2.3.4.1 Project Structure 37
2.3.4.2 Front-End Development 37
2.3.4.3 Back-End Development 38

2.3.5 Speed and Scalability Optimization 40
2.3.6 Data Availability 41
2.3.7 Testing and Validation 41

2.4 Results 43
2.4.1 Overview 43
2.4.2 User Interface and Case Study Exploration 43

2.4.2.1 Home Page 43
2.4.2.2 Microbiome - Abundances 45
2.4.2.3 Microbiome - Gene Ontology 47
2.4.2.4 Microbiome - GO to TaxIDs 49
2.4.2.5 Microbiome - TaxIDs to GO 50
2.4.2.6 Single Sample vs Grouped Analysis 52
2.4.2.7 Host - Abundances 53

2.4.3 Benchmarking Results 55
2.5 Discussion 56

2.5.1 Summary 56
2.5.2 Implications for Data Visualization and Analysis 57
2.5.3 Comparisons with Other Tools 57
2.5.4 Limitations and Future Directions 58

2.6 Conclusion 59
Chapter 3 - Identifying important microbial and genomic biomarkers for differentiating
right- versus left-sided colorectal cancer using Random Forest models 60

3.1 Abstract 62
3.2 Introduction 63

3.2.1 Background 63
3.3 Methods & Materials 65

xi


3.3.1 Patients, Samples and Processing 65
3.3.2 Random Forest Model Generation 65
3.3.3 Feature Importance and Retention 68
3.3.4 Differential Expression, Feature Side-Assignment and Heatmap Generation 69

3.4 Results 69
3.4.1 Random Forest Model Performance 69
3.4.2 Significant Model Features 72

3.5 Discussion 77
3.5.1 Patterns in RCC 79
3.5.2 Patterns in LCC 80

3.6 Conclusions 82
3.7 Declarations 82
3.8 Supplementary Material 84

Chapter 4 - pyRforest: A comprehensive R package for genomic data analysis featuring
scikit-learn Random Forests in R 97

4.1 Abstract 99
4.2 Introduction 100
4.3 Methods and Design 102

4.3.1 Integration of R and Python 102
4.3.2 Dataset Preparation and Hyperparameter Optimization 103
4.3.3 Model Tuning and Evaluation 103
4.3.4 Post-Hoc Feature Importance Significance Testing 104
4.3.5 SHAPley Additive exPlanations (SHAP) 106
4.3.6 Biological Interpretation 106

4.4 Comparisons with Alternative Implementations 107
4.5 Case Study 110

4.5.1 RNA-seq Data Analysis in Colorectal Cancer 110
4.5.2 Data Preparation and Model Training 110
4.5.3 Model Performance 111
4.5.4 Identifying Significantly Important Features 111
4.5.5 SHAP Analysis 113
4.5.6 clusterProfiler and g:Profiler Analysis 115

4.6 Discussion 117
4.7 Conclusion 121
4.8 Author Statements 121
4.9 Supplementary Material 123

Chapter 5 Conclusions 126
5.1. Overview 126

xii


5.2 Summary of Findings 126
5.2.1 Development of the MetaFunc App (Chapter 2) 126
5.2.2 Identifying Biomarkers for Colorectal Cancer Sidedness Using Random Forests
(Chapter 3) 127
5.2.3 Development of the pyRforest R package (Chapter 4) 128

5.3 Future Directions 129
5.3.1 Enhancing the MetaFunc App 129
5.3.2 Expanding the use of Machine Learning for Biomarker Identification 130
5.3.3 Further Development of pyRforest 130

5.4 Concluding Remarks 131
References 132
Appendix A 152

xiii


List of Figures

Figure 1.1 Summary of differences between left and right-side colorectal cancer. 3

Figure 1.2 A visualization of the flow of testing and training data in 5-fold cross
validation.

15

Figure 2.1 The MetaFunc Workflow. 30

Figure 2.2 The Home Page of the MetaFunc App. 44

Figure 2.3 The “Microbial - Abundances” subtab of the MetaFunc App. 46

Figure 2.4 The “Gene Ontology” subtab of the MetaFunc App. 48

Figure 2.5 The “GO to TaxIDs” subtab. 50

Figure 2.6 The “TaxIDs to GO” subtab. 51

Figure 2.7 Individual Microbial Abundances Table. 52

Figure 2.8 Grouped Microbial Abundances Table. 53

Figure 2.9 The “Host - Abundances” subtab of the MetaFunc App. 54

Figure 3.1 Receiver Operating Characteristic Curves (ROC) as calculated on the
held-out validation set.

71

Figure 3.2 Feature importance plots showing rank-based feature importance scores
of the permuted data and the scores of the real (unpermuted) data.

72

Figure 3.3 A heatmap of scaled gene expression values of the top-scoring genomic
features discovered by the genes-only RF model and clinical
characteristics.

76

Supplementary
Figure 3.1

Genes-Only model mean test F1 scores over 8 hyperparameters set to a
varying range of intervals.

84

xiv


xv

Supplementary
Figure 3.2

Microbes-only model mean test F1 scores over 8 hyperparameters
set to a varying range of intervals.

85

Supplementary
Figure 3.3

Genes-and-microbes model mean test F1 scores over eight
hyperparameters set to a varying range of intervals.

86

Supplementary
Figure 3.4

A heatmap of scaled gene expression values of the top-scoring
microbial features discovered by the microbes-only RF model and
clinical characteristics.

87

Supplementary
Figure 3.5

A heatmap of scaled genomic and microbial expression values of
the top-scoring features discovered by the genes-and-microbes RF
model and clinical characteristics.

88

Figure 4.1 Feature importance plot showing individual and mean rank-based
feature importance scores of the permuted data.

113

Figure 4.2 SHAPley Additive Explanation plots which illustrate the impact of
individual features on model prediction.

114

Figure 4.3 Gene Ontology enrichment analysis using clusterProfiler on the
significant features identified by pyRforest.

116

Figure 4.4 Enrichment analysis Manhattan plot from g:Profiler calculated on
the 83 significant features found in our CRC case study.

117


List of Tables

Table 1.1 CMS Subtypes and their observed pathophysiological differences. 13

Table 1.2 Common statistical metrics for evaluating performance of machine
learning classifiers and their definitions.

16

Table 2.1 Benchmarking results for the MetaFunc App. 55

Table 3.1 Patient Demographics & Cancer Characteristics. 66

Table 3.2 Random Forest Model Results. 70

Table 3.3 Top ranking features from the RF model trained on the genes-only
dataset (Left).

73

Table 3.4. Top ranking features with p-values less than 0.05 and their importance
scores discovered by our microbes-only model (Left).

74

Table 3.5 Top ranking features with p-values less than 0.05 and their importance
scores discovered by our genes-and-microbes model (Left).

75

Supplementary
Table 3.1

Full version of Table 3.4. 89

Supplementary
Table 3.2

Full version of Table 3.5. 92

Supplementary
Table 3.3

Patient Demographics & Cancer Characteristics Stratified by Cancer
Side.

96

Table 4.1 A benchmarking analysis comparing memory usage and run time of
Random Forest model fitting across different datasets.

107

Table 4.2 K-fold cross-validation scoring metrics for the RF model, stratified by
dataset splits (Validation, Testing) on the CRC dataset from our case
study.

111

Supplementary
Table 4.1

A comparative analysis of feature identification approaches. 123

xvi


Supplementary
Table 4.2

Feature and Capability Comparison of RandomForest Implementations
for Genomic Data Analysis.

123

Supplementary
Table 4.3

The top-scoring model hyperparameters as optimized by the pyRforest
`tune_and_train_rf_model` function which makes use of scikit-learn
`GridSearchCV`.

125

xvii


Chapter 1 Introduction

1.1 Colorectal Cancer

1.1.1 Epidemiology

Worldwide, colorectal cancer (CRC) is the second most prevalent cancer in women, and

the third most prevalent cancer in men, with approximately 2 million newly diagnosed cases

occurring every year (WCRF International, 2022). CRC represents 10% of all global cancers

(IARC, 2019). Age and family history are the greatest risk factors for CRC, with over 99% of

cases occurring in people over 40. CRC is known to arise from polyps, which are groups of cells

that begin as benign and then later become malignant (Ballinger & Anggiansah, 2007). Due to

the invasive nature of colonoscopy, which is the primary diagnostic method for colorectal cancer,

and the non-specific symptoms of CRC, many cases go undiagnosed until the disease has

progressed to an advanced stage.

1.1.2 Colorectal Cancer Nomenclature

Colorectal cancer refers to cancers that can arise in the colon, rectosigmoid junction or

rectum (IARC, 2019; Mayo Clinic, 2022). The term CRC has become the preferred term by

many clinicians and researchers because it encompasses all malignancies across the lower

gastrointestinal tract. However, historically, there has been a lack of standardization in the

nomenclature for CRC, which presents an issue when reconciling data from older studies with

modern research (Paschke et al., 2018). Some clinicians prefer to refer to specific anatomical

locations, leading to distinctions such as ‘colon cancer’ versus ‘rectal cancer’, while others

continue to use the more general terms ‘bowel cancer’ or ‘colorectal cancer’. Recently, further

distinctions within CRC have become an important topic in the research community. Terms such

1


as ‘proximal’ or ‘right-sided’ and ‘distal’ or ‘left-sided’ colorectal cancer are now becoming

more widely used to describe tumours based on their specific location within the colon. This is

due to emerging evidence that right-sided and left-sided CRCs exhibit distinct molecular

characteristics, clinical presentations, and responses to treatment (Lee et al., 2015).

1.1.3 CRC Sidedness and Clinical Implications

Cancers located in the proximal part of the colon, comprising cancers of the cecum,

ascending colon, hepatic flexure, and proximal part of the transverse colon are referred to as

right-sided colorectal cancers (RCC) (Mukund et al., 2020, Stintzing et al., 2017). Cancers

located in the distal portion of the colon (beyond the splenic flexure), descending colon, sigmoid

colon, and rectum, are referred to as left-sided colorectal cancers (LCC). These two portions of

the colon, while linked in sequence, play different roles in the digestive process and have

different developmental origins, with the right-sided colon developing from the embryonic

midgut, and the left-sided colon developing from the embryonic hindgut (Kostouros et al., 2020).

Figure 1.1 illustrates several differences in the pathology and treatment of right versus left-sided

colorectal cancer (Baran et al., 2018; Lee et al., 2015).

In terms of epidemiology, RCC is more common in women than in men, and typically

presents with larger tumours, higher tumour node metastasis (TNM) classification stage and

poorer overall survival outcomes (Liang et al., 2018; Yang et al., 2016; Zhao et al., 2020). In

contrast, LCC is more common overall than RCC, especially in men, and it is often identified at

a lower TNM (Tumour Node Metastasis) stage, with the tumours being smaller. LCCs generally

have a more favourable prognosis than RCCs.

2


Figure 1.1

Summary of differences between left and right-side colorectal cancer.

Note. This figure is a derivative work based on information from Lee et al. 2015, with additional
information added from Baran et al. 2018. Permission to create a derivative work was obtained
from Elsevier under license number 5896941135490.

1.1.4 Molecular and Genetic Differences between RCC and LCC

The differences between RCC and LCC are not merely anatomical but extend to their

molecular and genetic landscapes. RCC is characterized by a higher frequency of mutations

caused by defects in DNA repair called microsatellite instability (MSI), BRAF (B-Raf

proto-oncogene) mutations, and CpG island methylator phenotype positivity, all of which

contribute to its rapid progression and poorer prognosis (Baran et al., 2018; Lee et al., 2015).

RCC tumours tend to be more resistant to standard chemotherapy and respond better to

immunotherapy. In contrast, LCC tumours typically exhibit chromosomal instability and

3


mutations in genes like the proto-oncogenes APC (Adenomatous Polyposis Coli) and KRAS

(Kirsten RAt Sarcoma), which are more commonly targeted by conventional treatments like

chemotherapy and EGFR (Epidermal Growth Factor Receptor) inhibitors (Natsume et al., 2018).

Understanding the molecular genetic distinctions between RCC and LCC is critical as it impacts

prognosis and shapes the therapeutic strategies available to patients.

1.1.5 CRC Treatments

Treatment options for colorectal cancer can include surgery, chemotherapy,

immunotherapy, and less commonly, radiotherapy (Kuipers et al., 2015). Early-stage CRC can

often be treated with minimally invasive surgical techniques such as mucosal resection or

submucosal dissection during colonoscopy. For later stage cancers, more radical surgical options

such as partial colectomy are common.

Chemotherapy is a cornerstone of treatment for CRC, and may be delivered in a

neoadjuvant (before surgery) or adjuvant (after surgery) setting. Frequently used chemotherapy

drugs include 5-flurouracil, capacitabine, oxaliplatin, irinotecan, and targeted biologics such as

bevacizumab, aflibercept, cetuximab, and panitumumab, depending on the genetic profile of the

tumour (Kumar et al., 2023). Radiation therapy is not commonly used, except for low-stage

rectal cancers, due to the sensitivity of the bowel to radiation (Häfner & Debus, 2016).

Among the most promising treatments in colorectal cancer are the immunotherapeutic

options. A recent study on 12 patients with left-sided CRC showed that treatment with

dostarlamib, an anti-PD-1 monoclonal antibody, achieved complete clinical remission for 12

months after their last dose (Cercek et al., 2022). While the clinically complete response in this

trial is promising, there are two key caveats that these patients specifically had rectal

adenocarcinoma (a form of LCC) with a genetic mutations for loss of expression of the MLH1,

4


MSH1, MSH6, and PMS2 genes. This study placed an emphasis on anatomic location, the

microbiome, and gene expression in shaping treatment decisions.

1.1.6 The Gut Microbiome in CRC

The human gastrointestinal tract hosts a diverse community of microorganisms,

collectively known as the gut microbiome (Thursby & Juge, 2017). This microbial ecosystem

plays an important role in maintaining homeostasis by aiding digestion, synthesizing vitamins,

and modulating the immune system. The balance of microbial populations is crucial for health,

and disruptions can lead to various diseases. In the past twenty years, projects such as The

Human Microbiome Project (Human Microbiome Project Consortium, 2012) have started to

characterize the importance of microbial interactions and colon epithelial cells in the human gut.

However, the roles that specific microbes play in the development and progression of colorectal

cancer remain poorly understood (Zou et al., 2018). Therapeutic interventions targeting the gut

microbiome have shown significant success in other conditions; for instance, fecal microbiota

transplants (FMT) are now a standard of care for recurrent Clostridium difficile infections, with

cure rates of up to 90% (Rohlke & Stollman, 2012). This success raises the possibility of

exploring similar microbiome-targeted therapies for CRC. Despite this success, stool transplants

have not been extensively trialed for treatment of CRC, although some early studies such as

those conducted in mice are showing promising results (Yu et al., 2023). In this study, FMT

inhibited CRC progression in mice by reversing intestinal microbial dysbiosis, reducing tumour

size and number, and significantly prolonging survival rates when compared to the group that did

not receive FMT. The study also noted an increase in infiltration of CD8+ T cells, and

inflammatory cytokines, suggesting an enhanced anti-cancer immune response in the FMT

group.

5


The composition of the gut microbiome is unique to every individual, which presents a

significant challenge when studying microbe-gene interactions (Gilbert et al., 2018). In addition

to cancer, numerous other diseases, such as Crohn’s disease, irritable bowel syndrome (IBS),

obesity and autism, have been suggested to have links to the gut microbiome; however,

therapeutic options targeting the microbiome remain largely unsuccessful at treating these

diseases, and the specific microbes involved and their roles in influencing gene expression

remain somewhat of a mystery (Gilbert et al., 2018). In colorectal cancer, one of the leading

theories postulates that a dysbiosis of the gut microbiome results in inflammation which then

leads to tumour initiation, progression, and proliferation (Kostic et al., 2013).

Inflammation is considered a hallmark of cancer development (Hanahan & Weinberg,

2011), and microbes have been shown to produce substances that induce cellular inflammation,

contributing to tumourigenesis (Brennan et al., 2016; Grivennikov et al., 2010). In CRC, the gut

microbiome plays a significant role by modulating inflammation, epithelial cell proliferation, and

apoptosis (Gao et al., 2015). Certain microbial species have been implicated in CRC

development through their ability to produce pro-inflammatory and genotoxic substances (Kostic

et al., 2013).

One such microbe is Fusobacterium nucleatum, an anaerobic Gram-negative bacterium

found in higher abundance in CRC tissues compared to normal tissues (Kostic et al., 2013; Wang

et al., 2022). F. nucleatum has been shown to promote carcinogenesis by adhering to and

invading epithelial cells via FadA adhesin to E-cadherin binding and activation of β-catenin

signalling pathways which promote inflammatory responses (Rubinstein et al., 2013).

Additionally, certain strains of Escherichia coli can produce colibactin, a genotoxin that induces

6


DNA double-strand breaks, potentially promoting tumour initiation (Rosendahl Huber et al.,

2024).

Ruminococcus and Akkermansia species have also been associated with CRC. Both of

these bacterial species are involved in the degradation of mucin and can disrupt the mucus layer,

potentially facilitating contact between luminal contents and the epithelium, thereby promoting

inflammation and subsequent carcinogenesis (Coleman et al., 2021).

Identifying specific microbes associated with CRC can aid in early diagnosis and risk

assessment (Zackular et al., 2014). Exploring microbial biomarkers enhances our understanding

of disease mechanisms and opens up possibilities for microbiome-targeted interventions, such as

dietary modifications, probiotics, prebiotics, antibiotics, and fecal transplants. There remains a

significant gap in our understanding of how complex microbial communities interact with host

factors to influence CRC development and progression (Brennan et al., 2016). Current research

often focuses on individual bacterial species or specific pathways, but the synergistic effects of

the microbiome as a whole and its interplay with gene expression have not been fully elucidated,

especially while also considering the implications of right-side versus left-side CRC tumour

location. This demonstrates the growing recognition of the importance of integrating host

genomic data with not only tumour location, but also microbiome profiles to understand the

complex interactions between the tumour, host, and microbiome (Garza et al., 2020).

1.1.7 Advances in Genomic Technologies

1.1.7.1 High-Throughput Sequencing Technologies

The advent of next-generation sequencing (NGS) technologies has revolutionized

genomic research by enabling rapid and cost-effective large-scale sequencing of DNA and RNA

(Goodwin et al., 2016). Techniques such as whole-genome sequencing (WGS), whole-exome

7


sequencing (WES), and RNA sequencing (RNA-seq) have become indispensable tools in cancer

genomics, facilitating the identification of genetic mutations, gene expression profiles, and

alternative splicing events (Metzker, 2010).

Metagenomic sequencing allows for characterization of microbial communities,

providing insights into microbial diversity, function, and interactions with the host without the

use of sample cultures (Sharpton, 2014).

1.1.7.2 Challenges in Data Normalization and Scaling

The generation of large-scale genomic and metagenomic datasets presents significant

analytical challenges. These datasets often contain tens of thousands of features (e.g. genes,

microbial taxa) and require computational methods that can manage their complexity while

minimizing potential sources of error. Data normalization is a critical step to adjust for technical

variation such as sequencing depth, gene length, and library size, ensuring that comparisons

reflect true biological differences (Conesa et al., 2016).

For RNA-seq data, normalization approaches like transcripts per million (TPM) and

fragments per kilobase of transcript per million mapped reads (FPKM) are commonly used to

account for sequencing depth and gene length (Conesa et al., 2016). TPM normalization

facilitates comparison of gene expression levels across samples by scaling for both the length of

the gene and the total number of reads across samples, thus providing a consistent framework for

downstream analysis.

In metagenomic analyses, normalization techniques must account for differences in

sequencing effort, sample composition, and the inherently uneven distribution of microbial taxa.

As used in Chapter 2, microbial abundances can be expressed as percentages, calculated by

dividing the count of each taxon by the total read counts mapped to all taxa in a sample and

8


multiplying by 100 (Sulit et al., 2023). This method provides a straightforward interpretation of

the proportion of each microbial taxon within a sample. In Chapters 3 and 4, we utilize counts

per million (CPM) normalization which adjusts for differences in library sizes by scaling raw

counts to a common scale based on total counts, enabling comparison of relative abundances

across samples, and ensuring consistency of scaling with the TPM normalized genomic data.

1.2 Computational Approaches in Genomic Research

1.2.1 Use of Applications for Exploring Metagenomic Data

The analysis of metagenomic data presents significant challenges due to the complexity

and vast diversity present in microbial communities. Metagenomic datasets typically contain a

mixture of sequences from different bacterial, viral, and eukaryotic organisms (Lapidus et al.,

2021). These sequences can be analyzed to identify genes, proteins, and infer potential functions,

providing insights into the functional profiles of the gut microbiome (Sharpton, 2014). Extracting

meaningful biological insights from these datasets requires appropriate analytical tools. When it

comes to building custom pipelines for genomic data analysis, R Shiny has become a popular

tool amongst bioinformaticians for building applications that make it possible for researchers

without coding expertise to display and interact with biological data.

Currently, there is an absence of integrated solutions that allow researchers to perform

gene ontology-based functional microbiome analyses within a single platform. No pre-existing R

Shiny tools enable the exploration of host and microbial abundances and their taxonomic profiles

while linking these to gene ontology annotations. Consequently, researchers must employ

multiple separate tools, resulting in fragmented workflows that take considerable time to set up.

This lack of an integrated solution poses challenges for conducting holistic studies on

9


host-microbiome interactions, particularly in the context of diseases like colorectal cancer, where

understanding the functional roles of microbial communities is essential.

1.2.2 Existing Applications for Exploring Metagenomic Data

Despite the lack of integrated solutions, some applications have been developed to make

certain aspects of microbiome analysis more accessible and interactive. Phyloseq (McMurdie &

Holmes, 2012) is a widely used R shiny app that has tools for analysis and display of

microbiome census data. It enables the production of graphical plots for ecological analyses.

However, Phyloseq is primarily focused on phylogenetic analyses and lacks comprehensive

functional annotation capabilities. It does not provide functionalities for displaying host gene

expression data or performing gene ontology analyses, which limits its utility in studies aiming

to understand host-microbiome interactions at the functional level.

Animalcules (Zhao et al., 2020) is another R Shiny application designed for microbiome

data analysis and visualization. It offers an interactive interface for exploring microbiome

datasets, including functionalities for differential abundance analysis, diversity analysis, and

visualization of microbiome community compositions. Animalcules does not offer functional

insights into gene ontology analysis and requires that users preprocess their data from existing

tools.

While these applications have made certain aspects of microbiome analysis more

accessible and interactive, they lack the ability to perform gene ontology based functional

microbiome analysis. In addition to accessible tools for host-microbiome exploration, new

methods that can lead to candidate biomarker discovery and new biological insights are required.

10


1.2.3 Application of Machine Learning Approaches in CRC Research

Machine learning (ML) is a broad range of statistical techniques that enable computers to

learn from data and fit models that can be used to make predictions, classify data, or to support

decision-making (Davenport et al., 2019). Rather than being explicitly programmed to perform

specific tasks, ML algorithms identify patterns from training data and can generalize to make

predictions on unseen data. In biomedical research, ML excels at analyzing complex datasets to

reveal insights that can improve diagnosis, prognosis, and inform treatment strategies. ML

techniques have been applied to many aspects of CRC research, including early detection,

classification of tumour subtypes, prediction of patient outcomes, and identification of candidate

biomarkers (Alboaneen et al., 2023).

 1.2.3.1 Historical use of ML for CRC Research

ML algorithms have been utilized in CRC research to analyze many types of data,

including genomic, transcriptomic, proteomic, and imaging data (Kourou et al., 2015). ML

algorithms fall into different non-exclusive categories including unsupervised, supervised, deep

learning and ensemble, each with their own strengths and weaknesses (Jordan & Mitchell, 2015).

Supervised algorithms learn from labeled training data, where each input is associated

with a known output label to predict labels on new unseen data. This includes approaches such as

support vector machines (SVM) (Cortes et al., 1995), k-nearest neighbours (KNN) (Cover &

Hart, 1967), decision trees (Quinlan, 1986), and Random Forests (Breiman, 2001). In

unsupervised approaches, algorithms learn from data identifying patterns, data structures and

relationships without any predefined outcome labels. Unsupervised techniques include clustering

(e.g. hierarchical clustering (Lloyd, 1982) or k-means clustering (Johnson, 1967)), and

dimensionality reduction techniques like principal component analysis (PCA) (Jolliffe, 2013).

11


Deep learning is another subset of machine learning which involves neural networks with

multiple layers (called deep architectures) that can learn hierarchical representations of data

(LeCun et al., 2015). This includes convolutional neural networks (CNNs) (O’Shea et al., 2015),

recurrent neural networks (RNNs) (Sherstinsky, 2018). Lastly, ensemble approaches are another

method which combine predictions from multiple models to improve overall performance, this

includes methods like gradient boosting machines (Friedman, 2001), and Random Forests

(Breiman, 2001), which are a primary topic of this thesis work.

 1.2.3.2 Random Forest Models

The Random Forest (RF) model is a supervised ensemble ML method developed by

Breiman (2001) which constructs multiple decision trees during training. Each tree is built from

a random subset of the data and features, and the final prediction is made by aggregating the

outcomes (i.e., taking the majority vote) of all trees. This approach enhances the stability and

accuracy of the model while reducing overfitting. RF models are particularly well-suited for

genomic research due to their ability to handle large, complex datasets with many features

(genes, microbial taxa, etc.) (Chen & Ishwaran, 2012). RFs are non-parametric, making no

assumptions about data distribution, and are robust to outliers and noise typical of genomic data

(Miniati et al., 2016).

Historically, one of the most widely cited studies that makes use of RF models for

colorectal cancer is the CMS stratification study titled “The consensus molecular subtypes of

colorectal cancer” (Guinney et al., 2015). In this study, DNA microarray and RNA-sequencing

data were used to develop a Random Forest classifier for colorectal cancer which classified it

into four subtypes, now known as the consensus molecular subtypes (CMS).

12


This CMS classification is integrated into an R package, CMScaller (Eide et al., 2017),

and provides a framework for understanding the molecular diversity of CRC by associating each

sample with a subtype associated with specific molecular characteristics and clinical

implications. Table 1.1 summarizes the observed pathophysiological differences amongst the

CMS subtypes.

Table 1.1

CMS Subtypes and their observed pathophysiological differences.

CMS Subtype Molecular Characteristics Clinical Implications

CMS 1
“MSI Immune”

MSI High, BRAF &
TGFBR2 mutations; Immune
infiltration & activation

Worse survival after relapse;
better response to
immunotherapy

CMS 2
“Canonical”

Chromosomal Instability,
WNT/MYC activation; APC
& TP53 mutations

Good prognosis; responsive
to conventional chemotherapy

CMS 3
“Metabolic”

Metabolic dysregulation,
KRAS & APC mutations

Variable prognosis; potential
metabolic therapy targets

CMS 4
“Mesenchymal”

TGF-β activation, stromal
invasion, EMT activation,
angiogenesis

Poor prognosis; high
recurrence; prone to
metastasis

Note. This table is a derivative work based Guinney et al. 2015, with additional information from
Inamura 2018 (Creative Commons Attribution License). Permission to create a derivative work
from Guinney et al. 2015 was obtained from Springer Nature under license number
5896940821640.

While the CMS classification offers valuable insights into the molecular heterogeneity of

CRC, it has recently come under scrutiny for its lack of reproducible results and poor prognostic

values (Eide et al., 2021). Multiple samples from the same CRC tumours frequently have

different CMS subtype classifications (Mouillet-Richard et al., 2024). In addition, no CMS

subtypes consistently correlate with LCC and RCC locations, while other studies have

demonstrated that RCC and LCC exhibit distinct molecular and clinical features that impact

13


prognosis and treatment response (Lee et al., 2017). This discrepancy highlights a potential gap

in the CMS subtype classification. Therefore, there is an additional need for research that

explicitly focuses and accounts for tumour location to develop more precise classifiers and

improve personalized treatment strategies for CRC patients.

Moreover, while Guinney et al. (2015) utilized a RF classifier to develop the CMS

subtypes, they provided little detail on the inner workings of their model. Specifically, the study

treated the model as a “black box”, and did not extensively explain the feature importance scores

or the specific genes that contributed most significantly to the classification. This lack of model

interpretability makes it challenging to understand the biological drivers of each subtype and

limits the ability to translate these findings in clinical practice. Therefore, new studies in the area

must not only account for tumour location, but also emphasize the interpretability of the ML

model used.

One way to interpret RF models is to use feature importance, a type of scoring metric for

each feature which can offer insights into which variables contribute most to the model

predictions. The built-in feature importance scoring allows RF models to address some of the

black-box issues by highlighting the most relevant features driving the predictions. This makes

RF more interpretable compared to other ML methods such as Support Vector Machines (SVM)

which relies on complex kernel functions, and deep learning which utilizes complex internal

representations, limiting the ability to easily interpret how specific features influence final

predictions (Breiman, 2001; Cortes et al., 1995; LeCun et al., 2015).

However, even with these advantages, RF models, like many other ML models, can

suffer from overfitting. Overfitting occurs when a model learns not only the underlying patterns

in the training data but also the random noise specific to the training set, leading to poor

14


generalization on new, unseen data (Hastie et al., 2009). Overfitted models perform well on

training data but fail to predict accurately on testing or real-world data.

To mitigate overfitting and ensure model robustness, cross-validation techniques such as

k-fold cross validation (as illustrated in Figure 1.2) can be applied. Cross-validation is a

statistical method used to evaluate performance of a classifier by partitioning the data into

multiple subsets (Hastie et al., 2009). In k-fold cross validation, the dataset is divided into k

subsets (or folds). The model is trained on k-1 subsets and tested on the remaining one, and this

process is repeated k times, with each subset serving as the test set once.

Figure 1.2

A visualization of the flow of testing and training data in 5-fold cross validation.

Note. This figure is a derivative work based on information from scikit-learn (2022) which is
licensed under the Copyright BSD License 2007-2024, scikit-learn developers.

This method also allows for the calculation of an estimate of the model’s performance on

unseen data. Evaluating RF model performance typically employs six common statistical

metrics: accuracy, precision, recall (sensitivity), specificity, F1 score, and the area under the

15


Receiver Operating Characteristic curve (ROC AUC). Table 1.2 summarizes these metrics and

their definitions.

Table 1.2

Common statistical metrics for evaluating performance of machine learning classifiers and their
definitions.

Metric Formula Description

Accuracy  𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

The percentage of overall
correct classification.

Precision  𝑇𝑃 
𝑇𝑃 + 𝐹𝑃

The proportion of positive
cases which were classified
correctly out of all positive

cases.

Recall (Sensitivity) 𝑇𝑃 
𝑇𝑃 + 𝐹𝑁

The proportion of correctly
classified positive cases.

Specificity 𝑇𝑁 
𝑇𝑁 + 𝐹𝑃

The proportion of correctly
classified true negatives in

negative cases.

F1 Score 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Conveys the balance between
precision and recall.

ROC AUC (Area Under the
Receiver Operating

Characteristic Curve)

Calculated by taking the area
under the receiver operating
characteristic curve using the

trapezoidal rule.

An accuracy-like
measurement for classifier

output quality across all
possible classification

thresholds.

 Note. Where TP: True Positives; TN: True Negatives; FP: False Positives; FN: False Negatives.
This table contains information derived from Pellegrino et al. 2021 with permission obtained
under a Creative Commons Attribution 4.0 International License, with additional information
added from Baratloo et al. 2015.
 

In developing RF models, the data are typically partitioned into training, validation, and

testing sets. Performance metrics (Table 1.2) are often recalculated on the testing and/or

16


validation set many times during model optimization, and then calculated a single time on the

testing set for final performance reporting.

 1.2.3.3 Features in Random Forests

Improving the explainability of RF models fit to genomic data is crucial for translating

the models into actionable biological insights. Feature selection plays a role in enhancing model

interpretability by identifying the most relevant features (e.g., genes, proteins, microbes, or

metabolites) that are significantly associated with patient outcomes, disease progression, or

treatment responses. By focusing on the most important features to the RF model, it is possible to

identify features that may have important biological mechanisms.

Feature selection can be categorized into pre-hoc and post-hoc methods, each with unique

advantages and challenges. Pre-hoc feature selection involves selecting or reducing features

before model training, based on certain criteria, such as variance, correlation with the target

variable, or domain knowledge. This approach can help in reducing dimensionality, minimizing

noise, and improving model performance, however pre-hoc methods are prone to erroneously

excluding important features (Saeys et al., 2007).

In contrast, post-hoc feature selection involves evaluating the importance of features after

the model has been trained. This method leverages the model’s inherent ability to assess feature

importance, allowing for a more informed selection of relevant candidate biomarkers based on

their contribution to the model’s predictive power. Unlike pre-hoc methods, post-hoc

feature-selection methods can capture complex interactions and non-linear relationships between

features and the target variable.

Despite the prevalence of pre-hoc feature-selection techniques in machine-learning based

genomics research, post-hoc methods, particularly in the form of feature-importance scoring in

17


RF models can be a powerful alternative for identifying important features without risking

erroneously excluding features prior to model building.

 1.2.3.4 Feature Importance Scoring in Random Forests

RF models have the ability to provide feature-importance scores, indicating the

contribution of each feature to the model’s predictive power. A popular choice for feature

importance calculation in RF models is based on the Gini impurity, which was first pioneered as

a measure of statistical dispersion to quantify income inequality by Corrado Gini (1912), but

made popular for use in ML models by Breiman et al. 2001 & 2017. Gini impurity measures the

likelihood of an incorrect classification of a new instance if it were randomly labeled according

to the distribution of labels in the dataset.

Gini impurity (Breiman et al. 2001, 2017; Homola 2017), at a node n is calculated by:

IG(n)=1-
𝑖=1

𝐽

∑ (𝑃
𝑖
 )2

Where:

IG(n): Gini Impurity at a node (n)

J: Number of classification labels

Pi: Proportion of samples belonging to class i

This is first calculated for the root node, and then recursively for all subsequent child

nodes. The further down the tree, the less the total weighted Gini Impurity, and the last layer of

nodes has a Gini impurity of 0.

The decrease in Gini impurity when a feature is used to split a node contributes to that feature’s

importance. The importance nij of node j is calculated as:

18


nij= wj Cj-wleft(j)Cleft(j)–Wright(j)Cright(j)

Where:

wj: Weighted number of samples reaching node j

Cj: Impurity value of node j

leftj, rightj: Child node from split on node j

This formula calculates the importance for each individual node in the tree. This must

then be summarized to the level of each feature (i.e., gene or microbe in our dataset) in the tree,

as multiple nodes may base their decision on information from the same feature. Individual

feature importance can then be calculated by:

𝑓𝑖
𝑖

=
∑𝑗:𝑛𝑜𝑑𝑒 𝑗 𝑠𝑝𝑙𝑖𝑡𝑠 𝑜𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑖𝑛𝑖

𝑗

Σ𝑘ϵ 𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠 𝑛𝑖
𝑘

fii: The importance of feature i

nij: The importance of node j

nik: The importance of node k

These values are then scaled to a value between 0 and 1 by dividing by the sum of all

feature importance values:

𝑛𝑜𝑟𝑚𝑓𝑖
𝑖

=
𝑓𝑖

𝑖

Σ𝑗ϵ 𝑎𝑙𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑓𝑖
𝑗

The final task is to calculate the feature’s importance over the entire Random Forest. This

is calculated as the sum of the normalized features importance in each individual tree, divided by

the total number of trees.

𝑅𝐹𝑓𝑖
𝑖

=
Σ𝑗ϵ 𝑎𝑙𝑙 𝑡𝑟𝑒𝑒𝑠 𝑛𝑜𝑟𝑚𝑓𝑖

𝑗

𝑇

Where:
RFfi: Importance of feature i calculated from all trees in the RF model

19


normfiij: Normalized feature importance for i in tree j

T: Total number of trees

This provides a normalized feature importance score between 0 and 1, indicating the

relative importance of each feature across the entire forest. By identifying the most important

features and quantifying their relative importance, where these features represent biological

inputs (e.g., genes, microbes), researchers can gain insights into their biological relevance at

predicting the target (outcome) variable (Breiman et al., 2001, 2017).

1.2.3.5 Permutation Importance

Despite the inherent feature importance scores included with most RF model

implementations, there is often a need for more robust approaches to assessing the significance

of each feature in the model. Unlike base feature importance scores, permutation importance

offers an unbiased estimate of the feature’s true contribution to the model’s predictive

performance. The traditional approach to permutation importance for RF models, as included in

many implementations such as scikit-learn, involves randomly shuffling the values of each

feature and observing the effect on the model’s predictive performance (Pedregosa et al., 2011).

This disrupts any existing relationships between the feature and the target variable, but also

disrupts the relationships between features themselves. In cases where features may be correlated

(such as genomics), this disruption can lead to inappropriate null values of importance scores,

thus this type of importance is not suitable for such applications (Altmann et al., 2010).

Rank-based permutation importance approaches are more suitable for cases of complex

interactions and correlations between features (Altmann et al., 2010). Rather than shuffling

individual feature values, it ranks the features by their importance scores and compares the

observed feature at each rank with the distribution of importance scores obtained through

20


permutations. Existing rank-based permutation implementations tend to evaluate the importance

of each feature individually, comparing the observed importance to feature-specific

null-distributions. While this is effective in many contexts, this approach may overlook the

interactions between correlated features, which is not ideal for genomic datasets. There remains a

need for rank-based permutation methods that better account for the interdependencies between

features and consider the shared importance of correlated features. In addition to determining

feature importance, it is critical to understand how these features interact and contribute to the

predictions of the model, in terms of directionality, magnitude, and on an individual and

combined basis.

1.2.3.6 SHAP values

SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017) values have emerged as

a powerful tool for enhancing the interpretability of ML models. Originating from game theory,

SHAP values provide a framework for interpreting ML model output by assigning each feature

an importance value for a particular prediction, reflecting both the magnitude and directionality

of its effect. In the context of RF models, SHAP values provide global and local interpretability,

allowing researchers to understand how each feature contributes to the prediction for individual

samples or the sample set. Unlike traditional feature importance metrics, SHAP values offer

more detailed insights into the specific influence of each feature which may facilitate the

identification of candidate biomarkers in the context of genomics research.

 1.2.3.7 Functional Analysis

Understanding the functional roles of identified genes (or candidate biomarkers) is

crucial for elucidating their biological significance. Gene Set Enrichment Analysis (GSEA)

(Subramanian et al., 2005) is a technique that can assess whether predefined sets of genes show

21


statistically significant differences between two biological states (e.g. tumour versus normal).

This allows researchers to determine whether specific biological pathways or processes are

overrepresented among the identified genes.

Gene Ontology (GO) (Ashburner et al., 2000) is another technique for functional

annotation, GO is a bioinformatics initiative for the creation of a standardized vocabulary to

describe genes and gene products across different species. GO serves as a database resource that

provides standardized annotations that classify gene function into three main categories:

biological processes, cellular components, and molecular functions. Biological processes

describe the pathways and larger processes made up of the activities of multiple gene products,

such as cell cycle, signal transduction, or metabolism. Cellular components describe the

locations relative to cellular structures in which a gene product performs a function, such as the

nucleus, mitochondrion, or cell membrane. Molecular functions are the elemental activities of a

gene product at the molecular level, such as binding or catalysis. By integrating GO and GSEA

researchers can attribute biological function to a list of candidate biomarkers to attribute them

with meaningful biological insights, assess worthiness for future study, and aid in the

identification of therapeutic targets and the development of targeted intervention.

 1.2.3.8 Biomarker Validation

An ultimate goal of interpreting ML models in a cancer genomics setting is the

identification of biomarkers which can inform diagnosis, prognosis, and treatment strategies.

Biomarkers are biological indicators that can be objectively measured, such as genes, proteins, or

microbes, that can influence specific disease states or outcomes (Strimbu & Tavel, 2010). By

leveraging the feature importance scores from RF models, researchers can pinpoint the most

influential features that contribute to the prediction of clinical end points (often called targets),

22


such as tumour location or patient prognosis. Biomarker identification can involve many steps,

from identifying features of interest through feature selection, attributing statistical significance,

comparisons or rankings with other features, biological validation through cross-referencing with

existing literature, functional assays, and experimental studies. The ultimate test of the utility of

a biomarker lies in its clinical validation during clinical trials, where its effectiveness and

reliability are confirmed in studies on patient populations.

1.3 Overview and Structure of this Thesis

Understanding the complex interplay between host genetics and the gut microbiome is

crucial for advancing diagnosis, prognosis, and personalized treatment strategies in CRC.

Despite significant advancements in genomic technologies and computational methods,

challenges remain in effectively integrating and interpreting multi-omics data. The overarching

goal of this thesis is to enhance the analysis and interpretation of genomic and metagenomic data

within the context of CRC research by creating interactive applications for exploring results,

integrating multi-omics datasets, developing code for efficient machine learning techniques in R,

and facilitating biomarker discovery through functional analyses.

As discussed above, while numerous studies have investigated the genomic and microbial

aspects of CRC, many have not fully integrated host genomic data with microbiome profiles.

Additionally, tumour location, an important factor influencing CRC biology and patient

outcomes, was historically overlooked and is still often not adequately considered in existing

analyses. Furthermore, there is a lack of user-friendly, integrated tools for exploring complex

metagenomic and metatranscriptomic data related to host-microbe interactions in CRC.

An opportunity to address these challenges is presented by our RNA-seq datasets which

encompass both host gene expression and microbial profiles from CRC patient samples. By

23


developing interactive applications for data exploration, and code packages for uncovering

candidate biomarkers via explainable RF models, this thesis seeks to facilitate the exploration of

such complex datasets and uncover novel biomarkers associated with CRC tumour sidedness.

The main aims of this thesis are to:

Aim 1: Develop an interactive application for genomic and metagenomic data exploration,

linking microbial profiles with gene ontology to enhance understanding of host-microbiome

interactions in colorectal cancer;

Aim 2: Apply Random Forest Models to identify candidate biomarkers that distinguish right-

and left-sided colorectal cancer, and employ rank-based permutation testing to improve model

interpretability by quantifying the contributions of individual features;

Aim 3: Develop a code package to facilitate building efficient RF models in R for use in

genomic biomarker discovery, while incorporating SHAP analysis and functional gene ontology

assessment.

This thesis is structured into five main chapters, each contributing to the objective of

enhancing the analysis and interpretation of colorectal cancer genomic and metagenomic data

through a combination of software development, discovery and hypothesis generation, and

hypothesis testing.

Chapter 2 focuses on software development, detailing the creation of the MetaFunc App,

an interactive R Shiny application designed to facilitate the exploration of data generated by the

MetaFunc pipeline. This application addresses the need for integrated tools that can analyze

microbial abundances, taxonomic profiles, and link them to gene ontology annotations within a

single platform. By streamlining the MetaFunc workflow, the MetaFunc App supports

researchers in conducting comprehensive studies on host-microbiome interactions.

24


Chapter 3 involves discovery and hypothesis generation as well as hypothesis testing. By

using RF machine learning models, this study aims to identify and explain the contributions of

significant genomic and microbial candidate biomarkers that differentiate right-sided and

left-sided CRC. This chapter also introduces a rank-based feature permutation test for post-hoc

feature selection which is later integrated into the R package introduced in Chapter 4.

Chapter 4 is dedicated to software development and hypothesis testing. This chapter

introduces pyRforest, an R package that integrates Python’s optimized RF algorithms into the R

environment. This code package provides users with an end-to-end workflow for building and

analyzing RF models for genomic research. pyRforest features memory-efficient algorithms,

advanced feature selection methods, SHAP analysis for model interpretability, and gene ontology

analysis for functional assessment.

Chapter 5 summarizes the findings, discusses their implications for genomic research and

suggests directions for future work. This includes exploring additional machine learning

techniques, integration with additional multi-omics datasets, and further validation of identified

candidate biomarkers.

25


Chapter 2

The MetaFunc App: An R Shiny Application for Joint Exploration of
Microbial Diversity, Function, and Host Gene Expression

The MetaFunc App is featured in the following publication (Appendix A):

Sulit, A. K., Kolisnik, T., Frizelle, F. A., Purcell, R., & Schmeier, S. (2023). MetaFunc:

taxonomic and functional analyses of high throughput sequencing for microbiomes. Gut

Microbiome, 4(4), 1-21. https://doi.org/10.1017/gmb.2022.12.

Authors’ contributions to the above publication:
Conceptualisation: A.K.S, R.P., S.S.; Formal analysis: A.K.S, F.A.F., R.P., S.S.;
Funding Acquisition: R.P.; Investigation: A.K.S., R.P., S.S.; Methodology: A.K.S, T.K., R.P.,
S.S.; Software: A.K.S, T.K., S.S.; Supervision: R.P., S.S.; Validation: A.K.S, S.S.; Visualisation:
A.K.S; Writing – original draft: A.K.S; Writing – review and editing: A.K.S, T.K., F.A.F, R.P.,
S.S. A.K.S. and S.S. developed and co-wrote the pipeline which ultimately led to MetaFunc, and
were involved with the majority of the design. T.K. developed the shiny application that is
integrated in the pipeline. A.K.S. wrote the manuscript with
editorial input from T.K., S.S. and R.P. R.P. further contributed to the design of the pipeline.
F.A.F. provided guidance about all clinical aspects of the manuscript.

Author’s Note:
While I hold a secondary authorship on the aforementioned manuscript, my contributions to the
MetaFunc App are substantial and significant to the MetaFunc project, and include over 2000
lines of code. This chapter focuses on the development of the MetaFunc App, which I led, and
presents novel, unpublished content that does not overlap with the Sulit et al. 2023 paper
(Appendix A). I have full permissions from all MetaFunc project authors to include this chapter
in my thesis work.

Creative Commons Attribution License CC BY 4.0 International:
https://creativecommons.org/licenses/by/4.0/.

26

https://doi.org/10.1017/gmb.2022.12
https://creativecommons.org/licenses/by/4.0/


27


2.1 Introduction

2.1.1 Overview

Metatranscriptomic RNA-sequencing has transformed the study of complex interactions

between host organisms and their associated microbiomes and pathogens; yet, the vast scale and

complexity of these datasets presents analytical challenges. Extracting meaningful insights from

such data requires tools that can integrate microbial and host-gene functions to provide a deeper

understanding of biological processes. MetaFunc (Sulit et al., 2023) (Appendix A) is a command

line-implemented pipeline tool designed to facilitate this process, allowing functional annotation

of microbial and host metagenomic and metatranscriptomic data by integrating the processes of

protein prediction, functional annotation, and taxonomic classification (Appendix A).

In order to ease the interpretation and exploration of MetaFunc results, we developed a

user-friendly R Shiny application (the MetaFunc App). This Shiny app allows interactive

exploration of host and microbial abundances, as well as taxonomic and gene ontology tables

generated by MetaFunc. The MetaFunc App allows users to pinpoint specific Gene Ontology

(GO) (Harris et al., 2004) terms and track the species linked with taxonomic annotations. It also

features the ability to invert this process by keying in a species to obtain all associated GO terms.

These features are particularly valuable for researchers aiming to understand the functional

repertoire of a certain microbial species in the host environment, or compare functional profiles

across different species. By linking GO terms to species, researchers can pinpoint specific

biological processes within microbial species and track functional traits across them. Conversely,

by linking species to GO terms, researchers can explore the functional contributions of individual

species across broader phylogenetic contexts.

28


In this chapter, I present the motivation and development of MetaFunc, detail the

methodologies used in the development and application of the MetaFunc App, and provide

benchmarking results and a case study. Additionally, I discuss the significance of the MetaFunc

App, compare it to existing applications, address its scope and limitations, and suggest future

directions.

2.1.2 Development History and Specific Contributions

In order to effectively analyze complex metatranscriptomic data, interactive data tools

that can integrate multiple data types are crucial. One effective solution is the use of an R shiny

application, a web-browser based framework that enables dynamic data displays. Visual

inspection of genomic and metagenomic output is useful, including the ability to sort, search, and

filter columns (e.g., counts of taxa or functional categories). Previous examples of similar

applications that can rank, sort, and rearrange tabular data have proven useful in other contexts,

such as the Pavian application (Breitwieser & Salzberg, 2020), which was designed for the

exploration of microbial taxa in metagenomic or metabarcoding results. In addition, data

exploration via Shiny negates the need for more complicated manipulation steps, for example,

sorting or selecting taxa via R.

29


Figure 2.1.

The MetaFunc Workflow.

Note. Reproduced in part from Sulit et al. 2023 (includes Kolisnik) under the Creative Common
Attribution License.

Before detailing the design and behaviour of the MetaFunc App, I outline the MetaFunc

pipeline itself. Briefly, the MetaFunc pipeline (Figure 2.1) begins by aligning reads to the human

reference genome using the STAR aligner (Dobin et al., 2013). Human-mapped reads are used

for host gene expression and functional analysis, while the remaining human unmapped reads are

processed using the metagenomic classifier Kaiju (Menzel et al., 2016) for protein-based

taxonomic classification. Kaiju classifies microbial sequences by translating them and matching

them with protein sequences and associated accession numbers. These protein accession numbers

are subsequently mapped to functional GO (Harris et al., 2004) terms via a custom database

30


(designated nrgo). The pipeline also calculates microbial species abundances per sample or

condition and performs Spearman correlation analysis between host gene expression and

microbial species abundances.

The resulting data, including microbial abundances, functional annotations, and

host-microbe interactions, are stored in an SQL database for efficient access and retrieval. Users

can explore these results interactively with the MetaFunc App, which allows for the visualization

of taxonomy and GO tables, facilitating deeper and more detailed investigation into the

biological processes, molecular functions, and cellular components present in various species

across different datasets and conditions.

2.1.3 Motivation and Research Questions

The motivation for the development of the MetaFunc App was the need for a more

intuitive way for the user to sort, search, and interact with the complex taxonomic and functional

results generated by the MetaFunc pipeline. The output of this pipeline is more easily interpreted

with interactive data displays. The design of the MetaFunc App was centered on fulfilling the

following aims:

1. To optimize the accessibility and usability of the complex outputs of MetaFunc for users

with varying levels of bioinformatics expertise.

2. To enable users to effectively explore and derive meaningful insights from host and

microbial abundances, as well as taxonomic and gene ontology tables produced by the

MetaFunc pipeline.

3. To minimise computational overhead and ensure the MetaFunc App remains accessible

on platforms with varying compute power and scales to handle large datasets.

31


In the following section I discuss this development process, including the selection of R

Shiny as the development framework, and the methods I implemented to fulfill the above aims.

2.2 Background

2.2.1 Shiny Applications

R Shiny (Chang et al., 2024) was used to create an application to support data display and

exploration with integration to the MetaFunc pipeline. Developed by Posit (formerly Rstudio),

Shiny was chosen because of its versatility and accessibility, particularly for users without

extensive computational backgrounds. It offers an interactive, user-friendly interface, which

allows researchers to display and explore their data without needing advanced programming

skills in R (R Core Team, 2023) or SQL (Gilmore, 2008). This enables users to focus on deriving

meaningful insights from datasets rather than navigating complex computational tasks.

Shiny addresses traditional limitations for data manipulation and analysis steps in R,

which typically rely on static tools like ggplot2 and dplyr (Wickham 2016, 2019). Instead, Shiny

allows for dynamic interactivity in a browser-based application interface in which users can

interact with data in real time using features such as drop-down menus, sliders, and buttons,

triggering actions such as recalculating results or refreshing data tables. Shiny integrates a

server-side back end with a front-end user interface (UI) that enables interactive data

manipulation and display. Additionally, Shiny’s server-side processing supports dynamic

rendering, enabling the application to display only relevant portions of the data at any given

time, improving both speed and performance especially when visualizing large data tables. Shiny

apps can be hosted on local or on servers, allowing users to access them remotely without the

need for reinstallation or complex configuration.

32


For more experienced users, Shiny offers a range of advanced functionalities and

integration with other tools. For example, it can connect with tools such shinyjs (Attali, 2022),

which allows users to store cookies for session management or enhance displays through

JavaScript. Shiny also interfaces with R packages such as DBI (R Special Interest Group on

Databases (R-SIG-DB) et al., 2024) to manage database connectivity with SQL databases. This

versatility has contributed to Shiny’s widespread adoption in the development of bioinformatics

applications as a powerful framework and useful tool for data analysis.

2.2.2 Shiny Apps in Bioinformatics

Due to its versatility and usefulness, Shiny is used for a number of bioinformatics

applications, including the following examples.

shinyGEO (Dumas et al., 2016) imports data directly from the Gene Expression

Omnibus (GEO) database to perform differential expression analysis and visualize results. This

allows the user to perform complicated biological analyses without having an extensive

computational or bioinformatic background.

pcaExplorer (Ludt et al., 2022; Marini & Binder, 2019) helps users perform principal

component analysis (PCA) on RNA-seq data to visualize and explain the variance within their

dataset, helping to identify patterns and outliers. It provides an interactive graphical interface that

allows users to adjust parameters, filter data points, and investigate how specific features

contribute to overall variance. explain the variance in their data and identify patterns and outliers.

This interactivity enhances data exploration by allowing researchers to assess the relationships

between variables and samples, which is particularly useful in the early stages of RNA-seq data

analysis. The popularity and utility of this platform is indicated by its number of citations: more

than 200 over four years.

33


shinyGO (Ge et al., 2020) provides users with a tool for gene ontology (GO) enrichment

analysis for plants and animals. It provides interactive bar charts of GO terms, enrichment plots

for GO-term enrichment, and pathway diagrams that link to KEGG pathways. Gene ontology is

useful for associating gene lists to underlying biological functions, cellular components, and

molecular components. Again, the utility of the Shiny approach is indicated by the high citation

rate of ShinyGO, almost 2,500 since its publication in 2020.

These popular examples illustrate how Shiny apps have enhanced bioinformatics research

by making complex analyses more accessible and interactive, and highlight the growing demand

for such applications.

 2.3 Methodologies

2.3.1 Workflow Integration and Application Launch

The MetaFunc App is integrated with the rest of the MetaFunc pipeline, which allows the

app to function as part of a cohesive workflow, rather than as a standalone tool. MetaFunc relies

on snakemake (Köster & Rahmann, 2012), a workflow management system that automates the

execution of data processing pipelines.

Upon execution, the MetaFunc pipeline automatically generates an SQLite database and

launches the MetaFunc App as the final step of workflow, or it can be initiated manually by

running the app.R file. This setup accommodates new datasets as they become available; this

means that once the pipeline is set up once, any new databases from subsequent pipeline runs

will automatically become selectable for viewing within the app, provided they have been

correctly placed in the proper directory.

34


2.3.2 Database Development & Design

The SQL database which serves as the backbone of the MetaFunc App is constructed by

the database build script that is integrated into the snakemake workflow. The MetaFunc pipeline

results directory consists of a complex list of 290 output files of varying formats. This script

locates and processes 8 .tsv files into a unified database structure with the following tables.

Microbial Taxonomy Table: This table catalogs microbial % abundances indexed by

taxonomy identifier (TaxID), and is the back end for the Microbiome-Abundances tab in the app.

This table contains (1) TaxIDs which are used as a primary key to join with the mapping tables

and (2) microbial abundance data for each sample on a per-Tax ID level; (3) the URL for the

corresponding NCBI taxonomy browser page for the TaxID.

Gene Ontology Tables: The Microbiome-Gene Ontology tab depends on two separate

gene ontology tables, one for storing individual data and one for storing grouped analysis data.

The GO individual table stores the following information: (1) The GO ID number; (2) The GO

term description; (3) The GO category; (4) the URL of the corresponding AmiGO page for the

GO term; (5) the microbial abundance data for each sample on a per-GO term level. The GO

grouped table features the same columns, although microbial abundance data is stored for each

experimental group.

Mapped Gene Ontology and TaxID Tables: The GO to TaxIDs and TaxIDs to GO tabs

depend on two tables, one for individual and one for grouped data. These tables link GO terms to

microbial TaxIDs, enabling researchers to trace functional annotations to specific microbial taxa

and vice versa. Each table stores the (1) GO ID, (2) the description of the GO term (3)

abundances for each sample or group and (4) details on the TaxIDs associated with each GO

term. The TaxIDs are mapped to additional information via the mapping tables.

35


Human Gene Expression Table: This table is the back-end of the Host-Abundances tab,

and contains normalized host gene expression data as Transcripts Per Million (TPM) values. The

table also includes a URL column that links each gene to its Ensembl entry, allowing users to

explore additional gene-specific information.

Mapping Tables: The database includes mapping tables to ensure proper association

between taxonomic identifiers, GO identifiers, and sample metadata. The sample mapping table

contains metadata for each sample, including (1) sample name, (2) sample type, and (3) internal

sample id. The tax id mapping table maps the primary key of TaxIDs to fully parsed taxonomic

name data such as Kingdom and lower taxa.

The database is designed using standard database concepts such as referential integrity

and normalization, to ensure efficient data management and retrieval. Primary keys such as Tax

ID, GO ID and sample names are used to enforce the uniqueness of records, and facilitate rapid

lookups and relational joins between tables. With the exception of the mapped GO to Tax ID

tables which store some duplicate data to reduce querying time, the database schema adheres to

the principles of third normal form, minimizing redundancy and promoting consistency.

 2.3.3 Downloading and Installation of the MetaFunc App

The MetaFunc App is distributed as part of the MetaFunc pipeline and can be

downloaded directly from the pipeline’s GitLab repository at

https://gitlab.com/schmeierlab/workflows/metafunc for example using `git clone

https://gitlab.com/schmeierlab/workflows/metafunc.git`. The MetaFunc App files can then be

found within the /metafunc-shiny/ subdirectory of the repository. This directory contains all

necessary files required to run the app.

36

https://gitlab.com/schmeierlab/workflows/metafunc


Users should follow the MetaFunc pipeline installation instructions at:

https://metafunc.readthedocs.io/en/latest/usage.html. Additional MetaFunc App instructions can

be found at https://metafunc.readthedocs.io/en/latest/rshiny.html#label-shiny.

MetaFunc App dependencies are the following R package: shiny, shinyjs, DT, DBI,

formattable, tidyverse, ggplot2, shinythemes, shinydashboard, data.table, dplyr, stringr, reshape2,

RSQLite, openxlsx, and splitstackshape. Exact versions are specified in the app .yaml file and

are automatically downloaded when the pipeline is installed.

 2.3.4 Technical implementation of the MetaFunc App

 2.3.4.1 Project Structure

To ensure ease of maintenance and scalability, the MetaFunc App follows best practices

for Shiny app development by separating server logic and UI definitions into distinct component

scripts. This modular approach supports updates and troubleshooting. The database querying

script and initialization variables are organized in separate files, which are integrated into the app

through a master script. Databases generated by the MetaFunc pipeline are stored in a

subdirectory that is read by the MetaFunc App to populate the internal database selection tool.

2.3.4.2 Front-End Development

The MetaFunc App user-interface (UI) is structured into three main tabs and four subtabs:

The Home tab is the initial entry point for the user, and features the database selection

and load buttons, usage instructions, and an embedded workflow diagram of the MetaFunc

pipeline. A left-justified panel houses the three main tabs and their respective subtabs. The

current selected tab is highlighted as a visual indicator for navigation.

The Microbiome tab is divided into four subtabs: “Abundances”, “Gene Ontology”, “GO

to TaxIDs”, and “TaxIDs to GO”. These subtabs are structured to allow dynamic display of data

37

https://metafunc.readthedocs.io/en/latest/usage.html
https://metafunc.readthedocs.io/en/latest/rshiny.html#label-shiny


tables. Key functionalities include filtering, sorting, and exporting data to Excel, TSV, or CSV

formats. The “Abundances” subtab displays a table of microbial abundance data for each sample

or sample group. The “Gene Ontology” subtab displays GO data for each sample or sample

group. subtabs with GO data feature a toggle to filter by one of three GO categories: Biological

Process (BP), Cellular Component (CC), or Molecular Function (MF). The “GO to TaxIDs” and

“TaxIDs to GO” subtabs effectively combine the “Abundances” and “Gene Ontology” subtabs,

featuring linked tables of abundance and taxonomy data that update dynamically based on user

selections. For example, in the “GO to TaxIDs” subtab, the user can select one or more rows of

GO information in the top table, and the bottom table will dynamically update to show only the

microbial taxa associated with those GO terms. Conversely, in the “TaxIDs to GO” tab users can

select one or more rows of taxonomic information in the top table, and the bottom table will

update to display the associated GO terms. A reset button is also included to allow users to clear

filters and return the tables to their original state.

The Host tab includes a single subtab titled “Abundances”, which provides users with

data displaying and filtering capabilities for host count (TPM) data. Example screenshots of the

user interface can be seen in the `screenshots` folder of the metafunc-shiny gitlab repository.

2.3.4.3 Back-End Development

The back-end development of the MetaFunc App centers around integrating reactive

programming and server-side processing to handle the exploration of large datasets. Custom

server-side processing functions are responsible for fetching and preprocessing the data from the

SQLite database, ensuring the data is formatted consistently for the data tables. Additionally, the

app uses shinyjs (Attali, 2022) to store session data for the database path in cookies, allowing the

38


app to recall which database was previously loaded and automatically reloads it without

requiring manual input from the user.

Central to the data table display functionality is the DT R package (Xie et al., 2022), an R

wrapper for the JavaScript library DataTables, which facilitates the creation of highly interactive

tables. These tables allow the user to sort, search, filter (including setting ranges for numeric

values), reorder columns, and paginate the data. Additionally, DT supports embedded HTML

URLs, allowing provision of direct links to external resources, such as Ensembl (Martin et al.,

2023), NCBI (Sayers et al., 2022), and AmiGO (Gene Ontology Consortium, 2024).

The “GO to TaxIDs” and “TaxIDs to GO” tabs leverage reactive programming to create a

dynamic linkage between two interconnected tables. In the “GO to TaxIDs” tab, the upper table

displays GO terms, and based on the selected rows, the app triggers a reactive update that fetches

and displays the corresponding taxonomic information in the lower table. This functionality is

enabled by server-side processing, ensuring that user selections in the first table are instantly

passed to the server, which retrieves the relevant taxonomic data and filters and re-renders the

second table in real time. Similarly, the “Tax IDs to GO” tab implements the reverse operation.

The MetaFunc App also incorporates CSS-based loading indicators to provide users with

visual feedback while data is being processed, allowing users to be aware of background

processes to avoid confusion during longer wait times.

Lastly, the app features comprehensive export capabilities, allowing users to download

entire tables, or specific rows in CSV, TSV, and XLSX formats. This functionality is achieved

through the combination of DT, base R, and openxlsx (Schauberger & Walker, 2024) packages.

Custom functions generate a timestamp and append it to the filename to provide users with a

unique filename and time record for each export.

39


2.3.5 Speed and Scalability Optimization

Ensuring scalability without compromising speed was one of the most challenging issues

encountered during the development of the MetaFunc App. Initial versions of the app relied on

large TSV files for data storage and retrieval, resulting in slow performance as the size of the

datasets scaled. Transitioning to an SQLite database considerably improved performance in data

joining and retrieval. To handle large datasets more efficiently, table partitioning and pagination

were also introduced. Partitioning divided larger tables into smaller, manageable segments

allowing only the data necessary for each tab of the app to be loaded, while pagination

minimized the memory footprint and client-side rendering time by loading data incrementally.

All calculations and database processing are server-side to offload the computational

burden from the client. and dependent only on the hardware of the hosting computer.

Benchmarking tests (in Results) were conducted to assess performance with datasets of

varying size on a Windows 10 system with 32GB RAM and an AMD Ryzen 9 5900X 12-Core

Processor, running at 3.7 GHz with 24 logical processors. The metrics represent an average of

five measurements. Response time was recorded using Shiny’s showcase mode to track the delay

between user input and app reaction. Memory usage was recorded using the profvis R package,

and CPU utilization was tracked using the system task manager. CPU usage was monitored using

Activity Monitor. Latency, the delay between input and output visualization was recorded using

Firefox Developer Tools. Database load times were tracked using a custom wrapper function to

measure response times for initial load time of the largest tables in the app, those of the GO to

TaxIDs tab. It is important to note that subsequent load times of the tables will always be shorter

than the initial load.

40


2.3.6 Data Availability

Datasets used for the development, validation, and demonstration sets are available at:

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA413956 and

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA788974.

2.3.7 Testing and Validation

We conducted several structured phases of testing to ensure the functionality and

performance of the app. Testing was led by myself and was repeated for validation by the other

members of the MetaFunc project. These tests focused on verifying data integrity, app

functionality, and performance in the standalone version of the app. Following pipeline

integration, additional integration and deployment testing was performed to ensure the app was

installable, handled real-world datasets and interacted with other components of the pipeline.

The first phase of testing was installation and deployment testing. In order to ensure the

app could reliably be installed on multiple operating system environments, installation was tested

on Windows 10 and 11, macOS Mojave and Catalina, and Ubuntu 20.4 LTS systems.

The second phase of testing focused on data integrity checks, with particular attention to

identifying errors in data linkage and display. In a worst-case scenario, mislinked or missing

samples, GO, or taxonomy information could have led to inaccuracies in the display and skewed

functional analysis. This phase involved verifying the accurate linkage and representation of

sample ids, GO IDs and Tax IDs, and data values (e.g. abundances, TPM values) within the

original TSV files, database, and the app display tables.

The third phase of testing involved functional testing, which aimed to ensure that all user

interface elements responded correctly to user inputs. Tests were conducted on every interactive

feature, including buttons, row selection, table search, filters, sliders, ability to add or remove

41

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA413956
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA788974


columns. Filtering edge-cases with unusual input values (e.g. negative values, characters) were

tested. Functional testing also included checking user experience under different browser

environments (Chrome, Firefox) to ensure consistency in app behaviour across platforms.

The final phase of testing focused on performance, specifically measuring how the core

functionalities of the MetaFunc App, including database loading, data filtering, and rendering of

interactive tables performed under varying database input sizes and user input loads. This was

aimed at assessing the ability of the app to handle both typical and extreme data loads (e.g. 2000

samples, 100,000 genes) and inputs. We also tested its resilience by simulating high-intensity

user interactions, such as rapidly switching tabs, applying multiple filters, and by pressing

multiple buttons at intervals of less than 3 seconds. These tests aimed to identify any bottlenecks

or points of failure, including attempts to make the app crash under extreme conditions. We also

evaluated stability over extended periods of use (more than one hour) by sending commands at

10-minute intervals, observing CPU usage, memory consumption, and system temperature, to

identify any signs of resource exhaustion. This phase provided a comprehensive understanding

of the stability of the MetaFunc App during heavy usage scenarios.

These structured testing phases allowed us to systematically identify and resolve issues,

thereby allowing us to improve the user experience. Additionally, the MetaFunc App, as part of

the broader MetaFunc project, has undergone further validation through peer review processes

associated with its publication in Gut Microbiome (Sulit et al., 2023).

42


 2.4 Results

 2.4.1 Overview

In this section, we demonstrate the functionality and performance of the MetaFunc App

using a step-by-step walkthrough with screenshots. We also present our benchmarking results for

performance evaluation of the MetaFunc App.

Note on Image Quality: Due to the constraints of the document format, some

screenshots included in this section may appear difficult to read. Higher resolution versions of

the screenshot images in this chapter can be accessed in the `/screenshots` folder of the

metafunc-shiny gitlab repository.

 2.4.2 User Interface and Case Study Exploration

 2.4.2.1 Home Page

Analysis begins on the Home Page of the MetaFunc App, which provides usage

instructions and access to the main functionalities. As shown in Figure 2.2, users are directed to

select and load a database from the list of databases created by the MetaFunc pipeline.

43


Figure 2.2 The Home Page of the MetaFunc App.

Note. This homepage showcases user navigation options and usage instructions.

44


2.4.2.2 Microbiome - Abundances

After loading the database, the user is directed by the documentation to choose a tab in

the left side panel, “Microbiome” or “Host” depending on what type of data they want to

explore. After clicking the “Microbiome” tab, the user is greeted with 4 subtabs; the

“Abundances” subtab, illustrated in Figure 2.3, indicates the % abundances for each taxon, and

can be further filtered, sorted, and searched. In this example the data has been filtered on the

‘Genus’ column by the name ‘Escherichia’. In this instance, the user can then identify which

Escherichia genus microbes are of high or low abundance, and in which samples or sample

groups. By highlighting the rows, this data can be exported just for the selected microbes.

45


Figure 2.3 The “Microbial - Abundances” subtab of the MetaFunc App.

Note. This tab allows users to display and filter microbial abundance data (% read count per taxonomy id) across various samples. The
interactive table enables extensive filtering and selection options. The table also includes functionality for resetting filters and
exporting the filtered data in .csv or Excel formats.

46


2.4.2.3 Microbiome - Gene Ontology

If the user wishes to explore GO terms present in their dataset these are accessible via the Gene

Ontology subtab, shown in Figure 2.4. Users can explore which GO terms have high or low %

read counts across one or more samples or sample groups to help with hypothesis generation, or

they can quantify abundances associated with predetermined GO terms for hypothesis testing. In

this example, the table has been filtered on the ‘Description’ column by the partial search term

‘polyamine’, which results in four GO terms. Polyamines are a target of interest in our case study

for their known association with tumour progression and cell growth (Sulit et al., 2023).

47


Figure 2.4 The “Gene Ontology” subtab of the MetaFunc App.

Note. Displayed are GO terms % reads mapped to each GO identifier).

48


2.4.2.4 Microbiome - GO to TaxIDs

In the next subtab, GO to TaxIDs, Figure 2.5, the top table mirrors the gene ontology table

shown in the Microbiome - Gene Ontology subtab (Figure 2.4), and the bottom table mirrors the

taxonomy table shown in the Microbiome - Abundances subtab (Figure 2.3), except it starts out

with no data shown. When the user selects one or more GO identifiers of interest in the top table,

the bottom table updates to display corresponding associated taxonomy information and

taxonomy abundances. This allows the user to see all microbes associated with a particular GO

biological process, cellular component, or molecular function. For example, in Figure 2.5, the

GO term ‘polyamine biosynthetic process’ has been selected, and the bottom table has populated

with the abundance information of 126 microbial taxa associated with this GO term, it has been

sorted by highest average abundance to lowest, showing Escherichia coli has the highest average

abundance of all microbes associated with this GO term.

49


Figure 2.5 The “GO to TaxIDs” subtab.

Note. This tab features two interactive tables, where the selection of one or more GO terms in the
top table updates the bottom table to display the microbial taxa associated with this GO term.

2.4.2.5 Microbiome - TaxIDs to GO

Similarly, in the TaxIDs to GO tab, shown in Figure 2.6, the inverse operation is shown. In this

example, the data has been filtered on the genus column for ‘Escherichia’, and the microbe

Escherichia coli has been selected in the top microbial abundances table. The bottom table has

rendered the list of 1268 GO biological processes associated with Escherichia coli. A filter has

50


subsequently been applied to the bottom GO table by filtering the description column for

‘polyamine’, showing four related GO terms. The % reads mapped to each GO term for the

different polyamine related biological processes associated with Escherichia coli are shown.

Figure 2.6 The “TaxIDs to GO” subtab.

Note. This tab features two interactive tables, where the selection of one or more rows of
microbial taxa in the top table updates the bottom table to reflect the GO terms associated with
the selected taxa.

51


2.4.2.6 Single Sample vs Grouped Analysis

All subtabs under the Microbiome tab of the MetaFunc App can display both individual

and grouped analyses, depending on the research objectives. Individual analysis focuses on the

exploration of single samples with no additional group context. Grouped analysis allows for the

comparison of microbial taxa and their GO terms between predefined groups, such as disease

states, treatment types, or molecular subtypes. By comparing Figures 2.7 and 2.8 below, we can

see the differences between the display of individual and grouped analyses on the same table of

microbial abundan