GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion
| dc.citation.volume | 129 | |
| dc.contributor.author | Brenner M | |
| dc.contributor.author | Reyes NH | |
| dc.contributor.author | Susnjak T | |
| dc.contributor.author | Barczak ALC | |
| dc.date.accessioned | 2026-01-06T00:24:15Z | |
| dc.date.issued | 2026-05-01 | |
| dc.description.abstract | We introduce GatedFusion-Net (GF-Net), built on the SegFormer Transformer backbone, as the first architecture to unify RGB, depth ( D ), infrared intensity ( I ), thermal ( T ), and ultraviolet ( UV ) imagery for dense semantic segmentation on the MM5 dataset. GF-Net departs from the CMX baseline via: (1) stage-wise RGB-intensity-depth enhancement that injects geometrically aligned D, I cues at each encoder stage, together with surface normals ( N ), improving illumination invariance without adding parameters; (2) per-pixel sigmoid gating, where independent Sigmoid Gate blocks learn spatial confidence masks for T and UV and add their contributions to the RGB+DIN base, trimming computational cost while preserving accuracy; and (3) modality-wise normalisation using per-stream statistics computed on MM5 to stabilise training and balance cross-cue influence. An ablation study reveals that the five-modality configuration (RGB+DIN+T+UV) achieves a peak mean IoU of 88.3 %, with the UV channel contributing a 1.7-percentage-point gain under optimal lighting (RGB3). Under challenging illumination, it maintains comparable performance, indicating complementary but situational value. Modality-ablation experiments reveal strong sensitivity: removing RGB, T, DIN , or UV yields relative mean IoU reductions of 83.4 %, 63.3 %, 56.5 %, and 30.1 %, respectively. Sigmoid-Gate fusion behaves primarily as static, lighting-dependent weighting rather than adapting to sensor loss. Throughput on an RTX 3090 with a MiT-B0 backbone is real-time: 640 × 480 at 74 fps for RGB+DIN+T, 55 fps for RGB+DIN+T+UV, and 41 fps with five gated streams. These results establish the first RGB-D-I-T-UV segmentation baselines on MM5 and show that per-pixel sigmoid gating is a lightweight, effective alternative to heavier attention-based fusion. | |
| dc.description.confidential | false | |
| dc.edition.edition | May 2026 | |
| dc.identifier.citation | Brenner M, Reyes NH, Susnjak T, Barczak ALC. (2026). GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion. Information Fusion. 129. | |
| dc.identifier.doi | 10.1016/j.inffus.2025.103986 | |
| dc.identifier.eissn | 1872-6305 | |
| dc.identifier.elements-type | journal-article | |
| dc.identifier.issn | 1566-2535 | |
| dc.identifier.number | 103986 | |
| dc.identifier.pii | S1566253525010486 | |
| dc.identifier.uri | https://mro.massey.ac.nz/handle/10179/73979 | |
| dc.language | English | |
| dc.publisher | Elsevier B V | |
| dc.publisher.uri | https://www.sciencedirect.com/science/article/pii/S1566253525010486 | |
| dc.relation.isPartOf | Information Fusion | |
| dc.rights | CC BY 4.0 | |
| dc.rights | (c) 2025 The Author/s | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Multimodal fusion | |
| dc.subject | Thermal imaging | |
| dc.subject | UV imaging | |
| dc.subject | Preprocessing | |
| dc.subject | Sensor fusion | |
| dc.subject | Semantic segmentation | |
| dc.subject | Vision transformers | |
| dc.subject | Real-time fusion | |
| dc.title | GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion | |
| dc.type | Journal article | |
| pubs.elements-id | 608922 | |
| pubs.organisational-group | Other |

