Brenner MReyes NHSusnjak TBarczak ALC2026-01-062026-05-01Brenner M, Reyes NH, Susnjak T, Barczak ALC. (2026). GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion. Information Fusion. 129.1566-2535https://mro.massey.ac.nz/handle/10179/73979We introduce GatedFusion-Net (GF-Net), built on the SegFormer Transformer backbone, as the first architecture to unify RGB, depth ( D ), infrared intensity ( I ), thermal ( T ), and ultraviolet ( UV ) imagery for dense semantic segmentation on the MM5 dataset. GF-Net departs from the CMX baseline via: (1) stage-wise RGB-intensity-depth enhancement that injects geometrically aligned D, I cues at each encoder stage, together with surface normals ( N ), improving illumination invariance without adding parameters; (2) per-pixel sigmoid gating, where independent Sigmoid Gate blocks learn spatial confidence masks for T and UV and add their contributions to the RGB+DIN base, trimming computational cost while preserving accuracy; and (3) modality-wise normalisation using per-stream statistics computed on MM5 to stabilise training and balance cross-cue influence. An ablation study reveals that the five-modality configuration (RGB+DIN+T+UV) achieves a peak mean IoU of 88.3 %, with the UV channel contributing a 1.7-percentage-point gain under optimal lighting (RGB3). Under challenging illumination, it maintains comparable performance, indicating complementary but situational value. Modality-ablation experiments reveal strong sensitivity: removing RGB, T, DIN , or UV yields relative mean IoU reductions of 83.4 %, 63.3 %, 56.5 %, and 30.1 %, respectively. Sigmoid-Gate fusion behaves primarily as static, lighting-dependent weighting rather than adapting to sensor loss. Throughput on an RTX 3090 with a MiT-B0 backbone is real-time: 640 × 480 at 74 fps for RGB+DIN+T, 55 fps for RGB+DIN+T+UV, and 41 fps with five gated streams. These results establish the first RGB-D-I-T-UV segmentation baselines on MM5 and show that per-pixel sigmoid gating is a lightweight, effective alternative to heavier attention-based fusion.CC BY 4.0(c) 2025 The Author/shttps://creativecommons.org/licenses/by/4.0/Multimodal fusionThermal imagingUV imagingPreprocessingSensor fusionSemantic segmentationVision transformersReal-time fusionGatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusionJournal article10.1016/j.inffus.2025.1039861872-6305journal-article103986S1566253525010486