Deep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
| dc.confidential | Embargo : No | |
| dc.contributor.advisor | Jiang, Wang | |
| dc.contributor.author | Jiang, Tao | |
| dc.date.accessioned | 2025-09-23T23:56:27Z | |
| dc.date.available | 2025-09-23T23:56:27Z | |
| dc.date.issued | 2025-09-23 | |
| dc.description.abstract | Video Salient Object Detection (VSOD) is a fundamental task in video analysis, focusing on identifying and segmenting the most visually prominent objects in dynamic scenes. In this thesis, we propose three novel deep learning-based approaches to enhance VSOD performance in complex environments. Firstly, we introduce an Inheritance Enhancement Network (IENet), a Transformer-based framework designed to improve the integration of long-term spatial-temporal dependencies. We propose a unidirectional cross-frame enhancement mechanism, ensuring consistent and orderly information propagation across frames while minimizing interference. Extensive experiments demonstrate that IENet significantly improves detection accuracy in complex video scenes. Secondly, we present a Knowledge-sharing Hierarchical Memory Fusion Network (KHMF-Net) to address the challenges of scribble-supervised VSOD, where annotations are sparse and ambiguous. Our approach utilizes a memory bank-based hierarchical encoder-decoder architecture to reduce error accumulation and mitigate background distractions. Additionally, we introduce a dual-attention knowledge-sharing strategy, enabling the model to refine predictions by leveraging learned information across frames. This approach effectively enhances feature consistency and improves the separation of foreground and background objects. Thirdly, we propose the Multimodal Energy Prompting Network (MEPNet), which leverages optical flow and depth as implicit prompts to fine-tune a Segment Anything Model (SAM) for improved VSOD performance. We first introduce a Spectrogram Energy Generator (SEG), which extracts energy-driven prompts that enhance the spatial-temporal representation of salient objects. Furthermore, the Modality-Energy Adapter (MEA) effectively integrates these prompts into SAM, improving the model's ability to capture motion and structural cues. Extensive evaluations show that MEPNet effectively incorporates multimodal information, resulting in more robust and precise VSOD outcomes. In summary, we propose three innovative approaches to improve VSOD in complex scenes. Each method undergoes extensive evaluation on benchmark datasets, achieving superior performance compared to existing state-of-the-art models. Our contributions provide new insights into advancing video-based salient object detection, paving the way for more robust and efficient VSOD frameworks. | |
| dc.identifier.uri | https://mro.massey.ac.nz/handle/10179/73601 | |
| dc.publisher | Massey University | |
| dc.rights | © The Author | |
| dc.subject | Video Salient Object Detection, Feature fusion, Visual transformer, Frame-aware temporal relationships | |
| dc.subject | Scribble-supervised, Weakly supervised, Knowledge-sharing | |
| dc.subject | Prompt learning, Multimodality, Spectrogram energy | |
| dc.subject.anzsrc | 460309 Video processing | |
| dc.subject.anzsrc | 461106 Semi- and unsupervised learning | |
| dc.title | Deep learning for video salient object detection : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand | |
| thesis.degree.discipline | Computer Vision | |
| thesis.degree.name | Doctor of Philosophy (Ph.D.) | |
| thesis.description.doctoral-citation-abridged | Dr. Tao Jiang, in his doctoral research at the forefront of artificial intelligence, pioneered groundbreaking approaches to video understanding. His work on multimodal prompting and hierarchical reasoning systems has significantly advanced the capability of machines to perceive and interpret dynamic visual environments. By innovatively integrating foundational models, he established new standards of efficiency and accuracy in video analysis, with profound implications for autonomous systems and intelligent computing. His dissertation, completed with distinction, represents a lasting contribution to the field of computer vision. | |
| thesis.description.doctoral-citation-long | Dr. Tao Jiang's doctoral research significantly advanced the field of video salient object detection (VSOD) by developing AI systems that can intelligently identify and track the most perceptually relevant objects in complex video sequences. His work pioneered the integration of multimodal energy prompting mechanisms and hierarchical knowledge-sharing networks, effectively leveraging foundational models like SAM to enhance computational accuracy and efficiency in dynamic visual environments. These innovations have been rigorously validated across multiple benchmarks and published in high-impact venues, demonstrating their robustness for real-world applications such as autonomous navigation, intelligent video surveillance, and augmented reality systems. His research provides a transformative framework for next-generation video analysis technologies, bridging theoretical innovation with practical implementation. | |
| thesis.description.name-pronounciation | TAO JIANG |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- JiangPhDThesis.pdf
- Size:
- 39.01 MB
- Format:
- Adobe Portable Document Format
- Description:
- JiangPhDThesis.pdf
