Five-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand
| dc.confidential | Embargo : No | |
| dc.contributor.advisor | Reyes, Napoleon | |
| dc.contributor.author | Brenner, Martin | |
| dc.date.accessioned | 2026-05-03T21:40:55Z | |
| dc.date.issued | 2025-12-29 | |
| dc.description.abstract | Multimodal sensor fusion has become increasingly vital in perception systems, enabling richer scene understanding by combining complementary information from diverse sensing modalities. As autonomous systems, robotics, and computer vision applications demand greater robustness across varying environmental conditions, the need for effective fusion strategies has never been more pressing. This thesis develops a comprehensive framework for robust multimodal perception, advancing through three key stages—evidence synthesis, dataset construction, and fusion architecture design—and is underpinned by four publications, three published in Q1 journals and one submitted. The work begins with a systematic literature review of RGB-D-T (visual, depth, and thermal) fusion, examining existing datasets, calibration techniques, fusion approaches, and evaluation methods. This review highlights critical gaps that justify the development of a new dataset, benchmarks, and methodologies. Building on these insights, the thesis introduces MM5, a multimodal dataset and processing pipeline combining five sensing modalities: RGB, depth, infrared intensity, thermal, and ultraviolet imaging. MM5 provides standardised capture, calibration, and preprocessing procedures, along with tools supporting data alignment and the labelling of unaligned data. The resource includes both raw and preprocessed data from the proposed depth and thermal preprocessing algorithms. The thesis then presents two fusion strategies operating at different architectural levels to address different alignment scenarios. The first approach fuses aligned features at the encoder level, combining information from all five modalities through lightweight enhancements at each processing stage and pixel-level gating mechanisms. This yields robust baseline results while revealing the distinct contributions of each modality to overall performance. The second approach operates at the decoder level and is designed explicitly for unaligned data. It employs separate processing heads for thermal and ultraviolet modalities, each trained with its own ground-truth labels. Crucially, this design handles unaligned and optically distorted inputs without requiring explicit preprocessing or geometric alignment, making the system more practical and resilient to modality-specific issues such as artefacts, occlusions, and missing signals. Together, these contributions establish a five-modality benchmark and advance multimodal semantic segmentation through effective encoder-decoder fusion strategies. | |
| dc.identifier.uri | https://mro.massey.ac.nz/handle/10179/74475 | |
| dc.publisher | Massey University | |
| dc.rights | © The Author | |
| dc.subject | multimodal semantic segmentation | |
| dc.subject | sensor fusion | |
| dc.subject | five-modality dataset | |
| dc.subject | RGB-NIR-depth-thermal-UV perception | |
| dc.subject | fusion architectures | |
| dc.subject | data preprocessing | |
| dc.subject | Multisensor data fusion | |
| dc.subject | Semantic computing | |
| dc.subject | Image segmentation | |
| dc.subject | Computer vision | |
| dc.subject | Data sets | |
| dc.subject.anzsrc | 46 Information and computing sciences::4603 Computer vision and multimedia computation::460304 Computer vision | |
| dc.subject.anzsrc | 46 Information and computing sciences::4603 Computer vision and multimedia computation::460306 Image processing | |
| dc.title | Five-modality semantic segmentation : dataset, encoder & decoder fusion, and per-pixel gating : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand | |
| thesis.degree.discipline | Mathematical and Computational Sciences | |
| thesis.degree.name | Doctor of Philosophy (Ph.D.) | |
| thesis.description.doctoral-citation-abridged | Mr Brenner investigated how combining five different types of sensor data could improve automatic object recognition. He created the first dataset of its kind and designed two architectures for combining sensor data. His findings revealed that early combination maximises accuracy, while late combination provides greater resilience to sensor failure. | |
| thesis.description.doctoral-citation-long | Most computer vision systems rely on a single camera, which fails under poor lighting or when objects look visually similar. Mr Brenner investigated how combining five types of sensor data — colour, depth, infrared, thermal, and ultraviolet — could improve automatic object identification. He created the first publicly available dataset combining all five data types with original high-fidelity data, developed new methods to preserve sensor detail during data preparation, and designed two fusion architectures that combine sensor information at different processing stages. His research revealed a fundamental trade-off: combining sensors early maximises accuracy, while combining them late provides greater resilience when sensors fail. | |
| thesis.description.name-pronounciation | Martin Brenner: MAR – TIN BREN – ER |
