GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion

dc.citation.volume129
dc.contributor.authorBrenner M
dc.contributor.authorReyes NH
dc.contributor.authorSusnjak T
dc.contributor.authorBarczak ALC
dc.date.accessioned2026-01-06T00:24:15Z
dc.date.issued2026-05-01
dc.description.abstractWe introduce GatedFusion-Net (GF-Net), built on the SegFormer Transformer backbone, as the first architecture to unify RGB, depth ( D ), infrared intensity ( I ), thermal ( T ), and ultraviolet ( UV ) imagery for dense semantic segmentation on the MM5 dataset. GF-Net departs from the CMX baseline via: (1) stage-wise RGB-intensity-depth enhancement that injects geometrically aligned D, I cues at each encoder stage, together with surface normals ( N ), improving illumination invariance without adding parameters; (2) per-pixel sigmoid gating, where independent Sigmoid Gate blocks learn spatial confidence masks for T and UV and add their contributions to the RGB+DIN base, trimming computational cost while preserving accuracy; and (3) modality-wise normalisation using per-stream statistics computed on MM5 to stabilise training and balance cross-cue influence. An ablation study reveals that the five-modality configuration (RGB+DIN+T+UV) achieves a peak mean IoU of 88.3 %, with the UV channel contributing a 1.7-percentage-point gain under optimal lighting (RGB3). Under challenging illumination, it maintains comparable performance, indicating complementary but situational value. Modality-ablation experiments reveal strong sensitivity: removing RGB, T, DIN , or UV yields relative mean IoU reductions of 83.4 %, 63.3 %, 56.5 %, and 30.1 %, respectively. Sigmoid-Gate fusion behaves primarily as static, lighting-dependent weighting rather than adapting to sensor loss. Throughput on an RTX 3090 with a MiT-B0 backbone is real-time: 640 × 480 at 74 fps for RGB+DIN+T, 55 fps for RGB+DIN+T+UV, and 41 fps with five gated streams. These results establish the first RGB-D-I-T-UV segmentation baselines on MM5 and show that per-pixel sigmoid gating is a lightweight, effective alternative to heavier attention-based fusion.
dc.description.confidentialfalse
dc.edition.editionMay 2026
dc.identifier.citationBrenner M, Reyes NH, Susnjak T, Barczak ALC. (2026). GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion. Information Fusion. 129.
dc.identifier.doi10.1016/j.inffus.2025.103986
dc.identifier.eissn1872-6305
dc.identifier.elements-typejournal-article
dc.identifier.issn1566-2535
dc.identifier.number103986
dc.identifier.piiS1566253525010486
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/73979
dc.languageEnglish
dc.publisherElsevier B V
dc.publisher.urihttps://www.sciencedirect.com/science/article/pii/S1566253525010486
dc.relation.isPartOfInformation Fusion
dc.rightsCC BY 4.0
dc.rights(c) 2025 The Author/s
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectMultimodal fusion
dc.subjectThermal imaging
dc.subjectUV imaging
dc.subjectPreprocessing
dc.subjectSensor fusion
dc.subjectSemantic segmentation
dc.subjectVision transformers
dc.subjectReal-time fusion
dc.titleGatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion
dc.typeJournal article
pubs.elements-id608922
pubs.organisational-groupOther

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
608922 PDF.pdf
Size:
15.43 MB
Format:
Adobe Portable Document Format
Description:
Evidence

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.22 KB
Format:
Plain Text
Description:

Collections