Deep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand

dc.confidentialEmbargo : No
dc.contributor.advisorJiang, Wang
dc.contributor.authorJiang, Tao
dc.date.accessioned2025-09-23T23:56:27Z
dc.date.available2025-09-23T23:56:27Z
dc.date.issued2025-09-23
dc.description.abstractVideo Salient Object Detection (VSOD) is a fundamental task in video analysis, focusing on identifying and segmenting the most visually prominent objects in dynamic scenes. In this thesis, we propose three novel deep learning-based approaches to enhance VSOD performance in complex environments. Firstly, we introduce an Inheritance Enhancement Network (IENet), a Transformer-based framework designed to improve the integration of long-term spatial-temporal dependencies. We propose a unidirectional cross-frame enhancement mechanism, ensuring consistent and orderly information propagation across frames while minimizing interference. Extensive experiments demonstrate that IENet significantly improves detection accuracy in complex video scenes. Secondly, we present a Knowledge-sharing Hierarchical Memory Fusion Network (KHMF-Net) to address the challenges of scribble-supervised VSOD, where annotations are sparse and ambiguous. Our approach utilizes a memory bank-based hierarchical encoder-decoder architecture to reduce error accumulation and mitigate background distractions. Additionally, we introduce a dual-attention knowledge-sharing strategy, enabling the model to refine predictions by leveraging learned information across frames. This approach effectively enhances feature consistency and improves the separation of foreground and background objects. Thirdly, we propose the Multimodal Energy Prompting Network (MEPNet), which leverages optical flow and depth as implicit prompts to fine-tune a Segment Anything Model (SAM) for improved VSOD performance. We first introduce a Spectrogram Energy Generator (SEG), which extracts energy-driven prompts that enhance the spatial-temporal representation of salient objects. Furthermore, the Modality-Energy Adapter (MEA) effectively integrates these prompts into SAM, improving the model's ability to capture motion and structural cues. Extensive evaluations show that MEPNet effectively incorporates multimodal information, resulting in more robust and precise VSOD outcomes. In summary, we propose three innovative approaches to improve VSOD in complex scenes. Each method undergoes extensive evaluation on benchmark datasets, achieving superior performance compared to existing state-of-the-art models. Our contributions provide new insights into advancing video-based salient object detection, paving the way for more robust and efficient VSOD frameworks.
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/73601
dc.publisherMassey University
dc.rights© The Author
dc.subjectVideo Salient Object Detection, Feature fusion, Visual transformer, Frame-aware temporal relationships
dc.subjectScribble-supervised, Weakly supervised, Knowledge-sharing
dc.subjectPrompt learning, Multimodality, Spectrogram energy
dc.subject.anzsrc460309 Video processing
dc.subject.anzsrc461106 Semi- and unsupervised learning
dc.titleDeep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
thesis.degree.discipline​Computer Vision​
thesis.degree.nameDoctor of Philosophy (Ph.D.)
thesis.description.doctoral-citation-abridgedDr. Tao Jiang, in his doctoral research at the forefront of artificial intelligence, pioneered groundbreaking approaches to video understanding. His work on multimodal prompting and hierarchical reasoning systems has significantly advanced the capability of machines to perceive and interpret dynamic visual environments. By innovatively integrating foundational models, he established new standards of efficiency and accuracy in video analysis, with profound implications for autonomous systems and intelligent computing. His dissertation, completed with distinction, represents a lasting contribution to the field of computer vision.​​
thesis.description.doctoral-citation-longDr. Tao Jiang's doctoral research significantly advanced the field of video salient object detection (VSOD) by developing AI systems that can intelligently identify and track the most perceptually relevant objects in complex video sequences. His work pioneered the integration of multimodal energy prompting mechanisms and hierarchical knowledge-sharing networks, effectively leveraging foundational models like SAM to enhance computational accuracy and efficiency in dynamic visual environments. These innovations have been rigorously validated across multiple benchmarks and published in high-impact venues, demonstrating their robustness for real-world applications such as autonomous navigation, intelligent video surveillance, and augmented reality systems. His research provides a transformative framework for next-generation video analysis technologies, bridging theoretical innovation with practical implementation.​
thesis.description.name-pronounciationTAO JIANG

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
JiangPhDThesis.pdf
Size:
39.01 MB
Format:
Adobe Portable Document Format
Description:
JiangPhDThesis.pdf