Deep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand

Jiang, Tao

Deep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand

dc.confidential	Embargo : No
dc.contributor.advisor	Jiang, Wang
dc.contributor.author	Jiang, Tao
dc.date.accessioned	2025-09-23T23:56:27Z
dc.date.available	2025-09-23T23:56:27Z
dc.date.issued	2025-09-23
dc.description.abstract	Video Salient Object Detection (VSOD) is a fundamental task in video analysis, focusing on identifying and segmenting the most visually prominent objects in dynamic scenes. In this thesis, we propose three novel deep learning-based approaches to enhance VSOD performance in complex environments. Firstly, we introduce an Inheritance Enhancement Network (IENet), a Transformer-based framework designed to improve the integration of long-term spatial-temporal dependencies. We propose a unidirectional cross-frame enhancement mechanism, ensuring consistent and orderly information propagation across frames while minimizing interference. Extensive experiments demonstrate that IENet significantly improves detection accuracy in complex video scenes. Secondly, we present a Knowledge-sharing Hierarchical Memory Fusion Network (KHMF-Net) to address the challenges of scribble-supervised VSOD, where annotations are sparse and ambiguous. Our approach utilizes a memory bank-based hierarchical encoder-decoder architecture to reduce error accumulation and mitigate background distractions. Additionally, we introduce a dual-attention knowledge-sharing strategy, enabling the model to refine predictions by leveraging learned information across frames. This approach effectively enhances feature consistency and improves the separation of foreground and background objects. Thirdly, we propose the Multimodal Energy Prompting Network (MEPNet), which leverages optical flow and depth as implicit prompts to fine-tune a Segment Anything Model (SAM) for improved VSOD performance. We first introduce a Spectrogram Energy Generator (SEG), which extracts energy-driven prompts that enhance the spatial-temporal representation of salient objects. Furthermore, the Modality-Energy Adapter (MEA) effectively integrates these prompts into SAM, improving the model's ability to capture motion and structural cues. Extensive evaluations show that MEPNet effectively incorporates multimodal information, resulting in more robust and precise VSOD outcomes. In summary, we propose three innovative approaches to improve VSOD in complex scenes. Each method undergoes extensive evaluation on benchmark datasets, achieving superior performance compared to existing state-of-the-art models. Our contributions provide new insights into advancing video-based salient object detection, paving the way for more robust and efficient VSOD frameworks.
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/73601
dc.publisher	Massey University
dc.rights	© The Author
dc.subject	Video Salient Object Detection, Feature fusion, Visual transformer, Frame-aware temporal relationships
dc.subject	Scribble-supervised, Weakly supervised, Knowledge-sharing
dc.subject	Prompt learning, Multimodality, Spectrogram energy
dc.subject.anzsrc	460309 Video processing
dc.subject.anzsrc	461106 Semi- and unsupervised learning
dc.title	Deep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
thesis.degree.discipline	Computer Vision
thesis.degree.name	Doctor of Philosophy (Ph.D.)
thesis.description.doctoral-citation-abridged	Dr. Tao Jiang, in his doctoral research at the forefront of artificial intelligence, pioneered groundbreaking approaches to video understanding. His work on multimodal prompting and hierarchical reasoning systems has significantly advanced the capability of machines to perceive and interpret dynamic visual environments. By innovatively integrating foundational models, he established new standards of efficiency and accuracy in video analysis, with profound implications for autonomous systems and intelligent computing. His dissertation, completed with distinction, represents a lasting contribution to the field of computer vision.
thesis.description.doctoral-citation-long	Dr. Tao Jiang's doctoral research significantly advanced the field of video salient object detection (VSOD) by developing AI systems that can intelligently identify and track the most perceptually relevant objects in complex video sequences. His work pioneered the integration of multimodal energy prompting mechanisms and hierarchical knowledge-sharing networks, effectively leveraging foundational models like SAM to enhance computational accuracy and efficiency in dynamic visual environments. These innovations have been rigorously validated across multiple benchmarks and published in high-impact venues, demonstrating their robustness for real-world applications such as autonomous navigation, intelligent video surveillance, and augmented reality systems. His research provides a transformative framework for next-generation video analysis technologies, bridging theoretical innovation with practical implementation.
thesis.description.name-pronounciation	TAO JIANG

Files

Original bundle

Now showing 1 - 1 of 1

Name:: JiangPhDThesis.pdf
Size:: 39.01 MB
Format:: Adobe Portable Document Format
Description:: JiangPhDThesis.pdf

Download

Collections

Theses and Dissertations