Five-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand

dc.confidentialEmbargo : No
dc.contributor.advisorReyes, Napoleon
dc.contributor.authorBrenner, Martin
dc.date.accessioned2026-05-03T21:40:55Z
dc.date.issued2025-12-29
dc.description.abstractMultimodal sensor fusion has become increasingly vital in perception systems, enabling richer scene understanding by combining complementary information from diverse sensing modalities. As autonomous systems, robotics, and computer vision applications demand greater robustness across varying environmental conditions, the need for effective fusion strategies has never been more pressing. This thesis develops a comprehensive framework for robust multimodal perception, advancing through three key stages—evidence synthesis, dataset construction, and fusion architecture design—and is underpinned by four publications, three published in Q1 journals and one submitted. The work begins with a systematic literature review of RGB-D-T (visual, depth, and thermal) fusion, examining existing datasets, calibration techniques, fusion approaches, and evaluation methods. This review highlights critical gaps that justify the development of a new dataset, benchmarks, and methodologies. Building on these insights, the thesis introduces MM5, a multimodal dataset and processing pipeline combining five sensing modalities: RGB, depth, infrared intensity, thermal, and ultraviolet imaging. MM5 provides standardised capture, calibration, and preprocessing procedures, along with tools supporting data alignment and the labelling of unaligned data. The resource includes both raw and preprocessed data from the proposed depth and thermal preprocessing algorithms. The thesis then presents two fusion strategies operating at different architectural levels to address different alignment scenarios. The first approach fuses aligned features at the encoder level, combining information from all five modalities through lightweight enhancements at each processing stage and pixel-level gating mechanisms. This yields robust baseline results while revealing the distinct contributions of each modality to overall performance. The second approach operates at the decoder level and is designed explicitly for unaligned data. It employs separate processing heads for thermal and ultraviolet modalities, each trained with its own ground-truth labels. Crucially, this design handles unaligned and optically distorted inputs without requiring explicit preprocessing or geometric alignment, making the system more practical and resilient to modality-specific issues such as artefacts, occlusions, and missing signals. Together, these contributions establish a five-modality benchmark and advance multimodal semantic segmentation through effective encoder-decoder fusion strategies.
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/74475
dc.publisherMassey University
dc.rights© The Author
dc.subjectmultimodal semantic segmentation
dc.subjectsensor fusion
dc.subjectfive-modality dataset
dc.subjectRGB-NIR-depth-thermal-UV perception
dc.subjectfusion architectures
dc.subjectdata preprocessing
dc.subjectMultisensor data fusion
dc.subjectSemantic computing
dc.subjectImage segmentation
dc.subjectComputer vision
dc.subjectData sets
dc.subject.anzsrc46 Information and computing sciences::4603 Computer vision and multimedia computation::460304 Computer vision
dc.subject.anzsrc46 Information and computing sciences::4603 Computer vision and multimedia computation::460306 Image processing
dc.titleFive-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand
thesis.degree.disciplineMathematical and Computational Sciences
thesis.degree.nameDoctor of Philosophy (Ph.D.)
thesis.description.doctoral-citation-abridgedMr Brenner investigated how combining five different types of sensor data could improve automatic object recognition. He created the first dataset of its kind and designed two architectures for combining sensor data. His findings revealed that early combination maximises accuracy, while late combination provides greater resilience to sensor failure.
thesis.description.doctoral-citation-longMost computer vision systems rely on a single camera, which fails under poor lighting or when objects look visually similar. Mr Brenner investigated how combining five types of sensor data — colour, depth, infrared, thermal, and ultraviolet — could improve automatic object identification. He created the first publicly available dataset combining all five data types with original high-fidelity data, developed new methods to preserve sensor detail during data preparation, and designed two fusion architectures that combine sensor information at different processing stages. His research revealed a fundamental trade-off: combining sensors early maximises accuracy, while combining them late provides greater resilience when sensors fail.
thesis.description.name-pronounciationMartin Brenner: MAR – TIN BREN – ER

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
BrennerPhDThesis.pdf
Size:
41.67 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.87 KB
Format:
Item-specific license agreed upon to submission
Description: