Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 3 of 3
  • Item
    Salient Object Detection for complex scenes : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2024-09-19) Wang, Yi
    Salient Object Detection (SOD), a primary objective in computer vision, aims to locate and segment the region most visually striking within an image. In this thesis, we present three innovative methods based on deep learning to improve SOD performance in complex scenes. Firstly, we introduce the Multiple Enhancement Network (MENet), inspired by boundary perception, gradual enhancement, frequency decomposition, and content integrity of the Human Visual System (HVS). We propose a flexible multi-scale feature enhancement module to aggregate and refine features and use iterative training to improve boundary and adaptive features in the dual-branch decoder of MENet. A multi-level hybrid loss guides the network in learning pixel-, region-, and object-level features. Evaluations of benchmark datasets show that MENet outperforms other SOD models, especially when the salient region has multiple objects with varied appearances or complex shapes. Secondly, we propose TFGNet, an effective frequency-guided network for saliency detection based on Transformer. TFGNet has a parallel two-branch decoder, which leverages a pixel-wise decoder and a Transformer decoder to optimise high-spatial frequency boundary details and low-spatial frequency salient features. A novel loss is also designed to use frequency distribution similarity measurement to further improve performance. The experimental results indicate that TFGNet can accurately locate salient objects with more complete and precise boundaries on various complex backgrounds. This framework also rekindles awareness of the advantages of exploiting images' spatial frequency features in SOD. Thirdly, we design a multi-source weakly supervised SOD (WSOD) framework that can effectively utilise pseudo-background (non-salient region) labels combined with scribble labels to obtain more accurate salient features. We first create a comprehensive salient pseudo-mask generator from multiple self-learning features. Also, we pioneer the exploration of generating salient pseudo-labels via point-prompted and box-prompted Segment-Anything Models (SAM). Then, a Transformer-based WSOD network named WBNet is proposed, which leverages pixel-decoder and transformer-decoder with auxiliary edge predictor with multi-source loss function to handle complex saliency detection tasks. In summary, we contribute three novel approaches to address salient object detection in complex scenes. Each model achieves cutting-edge performance across prestigious datasets validated through comprehensive experiments.
  • Item
    Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2024-07-25) Ma, Yujun
    Action recognition aims to identify human actions in videos through complete action execution. Current action recognition approaches are primarily based on convolutional neural networks (CNNs), Transformers, or hybrids of both. Despite their strong performance, several challenges persist: insufficient disentangled modeling of spatio-temporal features, (ii) a lack of fine-grained motion modelling in action representation, and (iii) limited exploration of the positional embedding of spatial tokens. In this thesis, we introduce three novel deep-learning approaches that address these challenges and enhance spatial and temporal representation in diverse action recognition tasks, including RGB-D, coarse-grained, and fine-grained action recognition. Firstly, we develop a multi-stage factorized spatio-temporal model (MFST) for RGB-D action recognition. This model addresses the limitations of existing RGB-D approaches that rely on entangled spatio-temporal 3D convolution. The MFST em ploys a multi-stage hierarchical structure where each stage independently constructs spatio-temporal dimensions. This progression from low-level features to higher-order semantic primitives results in a robust spatio-temporal representation. Secondly, we introduce a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for coarse-grained and fine-grained action recognition. RPE-STDT addresses the high computational costs of Vision Transformers in video data processing, particularly due to the absolute-position embedding in frame patch tokenization. RPE-STDT utilizes dual Transformer encoder series: spatial encoders for intra-temporal index token interactions, and temporal encoders for inter-temporal dimension interactions with a subsampling strategy. Thirdly, we propose a convolutional transformer network (CTN) for fine-grained action recognition. Traditional Transformer models require extensive training data and additional supervision to rival CNNs in learning capabilities. The proposed CTN merges CNN’s strengths (e.g., weight sharing, and locality) with Transformer bene fits (e.g., dynamic attention, and long-range dependency learning), allows for superior fine-grained motion representation. In summary, we contribute three deep-learning models for diverse action recognition tasks. Each model achieves the state-of-the-art performance across multiple prestigious datasets, as validated by thorough experimentation.
  • Item
    Grape yield analysis with 3D cameras and ultrasonic phased arrays : a thesis by publications presented in fulfillment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Albany, New Zealand
    (Massey University, 2024-01-18) Parr, Baden
    Accurate and timely estimation of vineyard yield is crucial for the profitability of vineyards. It enables better management of vineyard logistics, precise application of inputs, and optimization of grape quality at harvest for higher returns. However, the traditional manual process of yield estimation is prone to errors and subjectivity. Additionally, the financial burden of this manual process often leads to inadequate sampling, potentially resulting in sub-optimal insights for vineyard management. As such, there is a growing interest in automating yield estimation using computer vision techniques and novel applications of technologies such as ultrasound. Computer vision has seen significant use in viticulture. Current state-of-the-art 2D approaches, powered by advanced object detection models, can accurately identify grape bunches and individual grapes. However, these methods are limited by the physical constraints of the vineyard environment. Challenges such as occlusions caused by foliage, estimating the hidden parts of grape bunches, and determining berry sizes and distributions still lack clear solutions. Capturing 3D information about the spatial size and position of grape berries has been presented as the next step towards addressing these issues. By using 3D information, the size of individual grapes can be estimated, the surface curvature of berries can be used as identifying features, and the position of grape bunches with respect to occlusions can be used to compute alternative perspectives or estimate occlusion ratios. Researchers have demonstrated some of this value with 3D information captured through traditional means, such as photogrammetry and lab-based laser scanners. However, these face challenges in real-world environments due to processing time and cost. Efficiently capturing 3D information is a rapidly evolving field, with recent advancements in real-time 3D camera technologies being a significant driver. This thesis presents a comprehensive analysis of the performance of available 3D camera technologies for grape yield estimation. Of the technologies tested, we determined that individual berries and concave details between neighbouring grapes were better represented by time-of-flight based technologies. Furthermore, they worked well regardless of ambient lighting conditions, including direct sunlight. However, distortions of individual grapes were observed in both ToF and LiDAR 3D scans. This is due to subsurface scattering of the emitted light entering the grapes before returning, changing the propagation time and by extension the measured distance. We exploit these distortions as unique features and present a novel solution, working in synergy with state-of-the-art 2D object detection, to find and reconstruct in 3D, grape bunches scanned in the field by a modern smartphone. An R2 value of 0.946 and an average precision of 0.970 was achieved when comparing our result to manual counts. Furthermore, our novel size estimation algorithm was able accurately to estimate berry sizes when manually compared to matching colour images. This work represents a novel and objective yield estimation tool that can be used on modern smartphones equipped with 3D cameras. Occlusion of grape bunches due to foliage remains a challenge for automating grape yield estimation using computer vision. It is not always practical or possible to move or trim foliage prior to image capture. To this end, research has started investigating alternative techniques to see through foliage-based occlusions. This thesis introduces a novel ultrasonic-based approach that is able to volumetrically visualise grape bunches directly occluded by foliage. It is achieved through the use of a highly directional ultrasonic phased array and novel signal processing techniques to produce 3D convex hulls of foliage and grape bunches. We utilise a novel approach of agitating the foliage to enable spatial variance filtering to remove leaves and highlight specific volumes that may belong to grape bunches. This technique has wide-reaching potential, in viticulture and beyond.