Five-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand

Brenner, Martin

Five-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand

dc.confidential	Embargo : No
dc.contributor.advisor	Reyes, Napoleon
dc.contributor.author	Brenner, Martin
dc.date.accessioned	2026-05-03T21:40:55Z
dc.date.issued	2025-12-29
dc.description.abstract	Multimodal sensor fusion has become increasingly vital in perception systems, enabling richer scene understanding by combining complementary information from diverse sensing modalities. As autonomous systems, robotics, and computer vision applications demand greater robustness across varying environmental conditions, the need for effective fusion strategies has never been more pressing. This thesis develops a comprehensive framework for robust multimodal perception, advancing through three key stages—evidence synthesis, dataset construction, and fusion architecture design—and is underpinned by four publications, three published in Q1 journals and one submitted. The work begins with a systematic literature review of RGB-D-T (visual, depth, and thermal) fusion, examining existing datasets, calibration techniques, fusion approaches, and evaluation methods. This review highlights critical gaps that justify the development of a new dataset, benchmarks, and methodologies. Building on these insights, the thesis introduces MM5, a multimodal dataset and processing pipeline combining five sensing modalities: RGB, depth, infrared intensity, thermal, and ultraviolet imaging. MM5 provides standardised capture, calibration, and preprocessing procedures, along with tools supporting data alignment and the labelling of unaligned data. The resource includes both raw and preprocessed data from the proposed depth and thermal preprocessing algorithms. The thesis then presents two fusion strategies operating at different architectural levels to address different alignment scenarios. The first approach fuses aligned features at the encoder level, combining information from all five modalities through lightweight enhancements at each processing stage and pixel-level gating mechanisms. This yields robust baseline results while revealing the distinct contributions of each modality to overall performance. The second approach operates at the decoder level and is designed explicitly for unaligned data. It employs separate processing heads for thermal and ultraviolet modalities, each trained with its own ground-truth labels. Crucially, this design handles unaligned and optically distorted inputs without requiring explicit preprocessing or geometric alignment, making the system more practical and resilient to modality-specific issues such as artefacts, occlusions, and missing signals. Together, these contributions establish a five-modality benchmark and advance multimodal semantic segmentation through effective encoder-decoder fusion strategies.
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/74475
dc.publisher	Massey University
dc.rights	© The Author
dc.subject	multimodal semantic segmentation
dc.subject	sensor fusion
dc.subject	five-modality dataset
dc.subject	RGB-NIR-depth-thermal-UV perception
dc.subject	fusion architectures
dc.subject	data preprocessing
dc.subject	Multisensor data fusion
dc.subject	Semantic computing
dc.subject	Image segmentation
dc.subject	Computer vision
dc.subject	Data sets
dc.subject.anzsrc	46 Information and computing sciences::4603 Computer vision and multimedia computation::460304 Computer vision
dc.subject.anzsrc	46 Information and computing sciences::4603 Computer vision and multimedia computation::460306 Image processing
dc.title	Five-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand
thesis.degree.discipline	Mathematical and Computational Sciences
thesis.degree.name	Doctor of Philosophy (Ph.D.)
thesis.description.doctoral-citation-abridged	Mr Brenner investigated how combining five different types of sensor data could improve automatic object recognition. He created the first dataset of its kind and designed two architectures for combining sensor data. His findings revealed that early combination maximises accuracy, while late combination provides greater resilience to sensor failure.
thesis.description.doctoral-citation-long	Most computer vision systems rely on a single camera, which fails under poor lighting or when objects look visually similar. Mr Brenner investigated how combining five types of sensor data — colour, depth, infrared, thermal, and ultraviolet — could improve automatic object identification. He created the first publicly available dataset combining all five data types with original high-fidelity data, developed new methods to preserve sensor detail during data preparation, and designed two fusion architectures that combine sensor information at different processing stages. His research revealed a fundamental trade-off: combining sensors early maximises accuracy, while combining them late provides greater resilience when sensors fail.
thesis.description.name-pronounciation	Martin Brenner: MAR – TIN BREN – ER

Files

Original bundle

Now showing 1 - 1 of 1

Name:: BrennerPhDThesis.pdf
Size:: 41.67 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 9.87 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations