Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 10 of 22
  • Item
    Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2024-07-25) Ma, Yujun
    Action recognition aims to identify human actions in videos through complete action execution. Current action recognition approaches are primarily based on convolutional neural networks (CNNs), Transformers, or hybrids of both. Despite their strong performance, several challenges persist: insufficient disentangled modeling of spatio-temporal features, (ii) a lack of fine-grained motion modelling in action representation, and (iii) limited exploration of the positional embedding of spatial tokens. In this thesis, we introduce three novel deep-learning approaches that address these challenges and enhance spatial and temporal representation in diverse action recognition tasks, including RGB-D, coarse-grained, and fine-grained action recognition. Firstly, we develop a multi-stage factorized spatio-temporal model (MFST) for RGB-D action recognition. This model addresses the limitations of existing RGB-D approaches that rely on entangled spatio-temporal 3D convolution. The MFST em ploys a multi-stage hierarchical structure where each stage independently constructs spatio-temporal dimensions. This progression from low-level features to higher-order semantic primitives results in a robust spatio-temporal representation. Secondly, we introduce a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for coarse-grained and fine-grained action recognition. RPE-STDT addresses the high computational costs of Vision Transformers in video data processing, particularly due to the absolute-position embedding in frame patch tokenization. RPE-STDT utilizes dual Transformer encoder series: spatial encoders for intra-temporal index token interactions, and temporal encoders for inter-temporal dimension interactions with a subsampling strategy. Thirdly, we propose a convolutional transformer network (CTN) for fine-grained action recognition. Traditional Transformer models require extensive training data and additional supervision to rival CNNs in learning capabilities. The proposed CTN merges CNN’s strengths (e.g., weight sharing, and locality) with Transformer bene fits (e.g., dynamic attention, and long-range dependency learning), allows for superior fine-grained motion representation. In summary, we contribute three deep-learning models for diverse action recognition tasks. Each model achieves the state-of-the-art performance across multiple prestigious datasets, as validated by thorough experimentation.
  • Item
    Grape yield analysis with 3D cameras and ultrasonic phased arrays : a thesis by publications presented in fulfillment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Albany, New Zealand
    (Massey University, 2024-01-18) Parr, Baden
    Accurate and timely estimation of vineyard yield is crucial for the profitability of vineyards. It enables better management of vineyard logistics, precise application of inputs, and optimization of grape quality at harvest for higher returns. However, the traditional manual process of yield estimation is prone to errors and subjectivity. Additionally, the financial burden of this manual process often leads to inadequate sampling, potentially resulting in sub-optimal insights for vineyard management. As such, there is a growing interest in automating yield estimation using computer vision techniques and novel applications of technologies such as ultrasound. Computer vision has seen significant use in viticulture. Current state-of-the-art 2D approaches, powered by advanced object detection models, can accurately identify grape bunches and individual grapes. However, these methods are limited by the physical constraints of the vineyard environment. Challenges such as occlusions caused by foliage, estimating the hidden parts of grape bunches, and determining berry sizes and distributions still lack clear solutions. Capturing 3D information about the spatial size and position of grape berries has been presented as the next step towards addressing these issues. By using 3D information, the size of individual grapes can be estimated, the surface curvature of berries can be used as identifying features, and the position of grape bunches with respect to occlusions can be used to compute alternative perspectives or estimate occlusion ratios. Researchers have demonstrated some of this value with 3D information captured through traditional means, such as photogrammetry and lab-based laser scanners. However, these face challenges in real-world environments due to processing time and cost. Efficiently capturing 3D information is a rapidly evolving field, with recent advancements in real-time 3D camera technologies being a significant driver. This thesis presents a comprehensive analysis of the performance of available 3D camera technologies for grape yield estimation. Of the technologies tested, we determined that individual berries and concave details between neighbouring grapes were better represented by time-of-flight based technologies. Furthermore, they worked well regardless of ambient lighting conditions, including direct sunlight. However, distortions of individual grapes were observed in both ToF and LiDAR 3D scans. This is due to subsurface scattering of the emitted light entering the grapes before returning, changing the propagation time and by extension the measured distance. We exploit these distortions as unique features and present a novel solution, working in synergy with state-of-the-art 2D object detection, to find and reconstruct in 3D, grape bunches scanned in the field by a modern smartphone. An R2 value of 0.946 and an average precision of 0.970 was achieved when comparing our result to manual counts. Furthermore, our novel size estimation algorithm was able accurately to estimate berry sizes when manually compared to matching colour images. This work represents a novel and objective yield estimation tool that can be used on modern smartphones equipped with 3D cameras. Occlusion of grape bunches due to foliage remains a challenge for automating grape yield estimation using computer vision. It is not always practical or possible to move or trim foliage prior to image capture. To this end, research has started investigating alternative techniques to see through foliage-based occlusions. This thesis introduces a novel ultrasonic-based approach that is able to volumetrically visualise grape bunches directly occluded by foliage. It is achieved through the use of a highly directional ultrasonic phased array and novel signal processing techniques to produce 3D convex hulls of foliage and grape bunches. We utilise a novel approach of agitating the foliage to enable spatial variance filtering to remove leaves and highlight specific volumes that may belong to grape bunches. This technique has wide-reaching potential, in viticulture and beyond.
  • Item
    Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2021) Zong, Ming
    Video action recognition is a difficult and challenging task in video processing. In this thesis, we propose three novel deep learning approaches to improve the accuracy of action recognition. The first approach aims to learn multi-cue based spatiotemporal features by performing 3D convolutions. Previous 3D CNN models mainly perform 3D convolutions on individual cues (e.g., appearance and motion cues), which lacks the effective overall integration of the appearance information and motion information of videos. To address this issue, we propose a novel multi-cue 3D convolutional neural network (named M3D model for short), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. The proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue, which can obtain more discriminative and robust features by integrating three different cues as a whole. In particular, we propose a novel deep residual multi-cue 3D convolution model (named R-M3D for short) to enhance the representation ability by benefitting from the increasing depth of the model, which can obtain more representative spatiotemporal features. The second approach aims to utilize the motion saliency information to enhance the accuracy of action recognition. We propose a novel motion saliency based multi-stream multiplier ResNets (named MSM-ResNets for short) for action recognition. The proposed MSM-ResNets model consists of three interactive streams: the appearance stream, motion stream and motion saliency stream. The appearance stream is responsible for capturing the appearance information with RGB video frames as input. The motion stream is responsible for capturing the motion information with optical flow frames as input. The motion saliency stream is responsible for capturing the salient motion information with motion saliency frames as input. In particular, to utilize the complementary information between different streams over time, the proposed MSM-ResNets model establishes multiplicative connections between different streams. Two kinds of different multiplicative connections are injected, the first one is to inject multiplicative connections to transmit the motion cue from the motion stream to the appearance stream, and the second one is to inject multiplicative connections to transmit the motion saliency cue from the motion saliency stream to the motion stream. The third approach aims to explore the salient spatiotemporal information over time evolution. We propose a novel spatial and temporal saliency based four-stream network with multi-task learning (named 3M model for short) for action recognition. The proposed 3M model comprises two parts: (i) The first part is a spatial and temporal saliency based four-stream network, which comprises four streams: an appearance stream, a motion stream, a novel spatial saliency stream and a novel temporal saliency stream. The novel spatial saliency stream is used to acquire spatial saliency information and the novel temporal saliency stream is used to acquire temporal saliency information. (ii) The second part is a multi-task learning based long short-term memory network (LSTM), which adopts the feature representations obtained by obtained convolutional neural networks (CNN) as input. The multi-task learning based LSTM can share the complementary knowledge between different streams and capture the long-term dependency relationships of consecutive frames. Experiments verify the effectiveness of all the proposed models and show that all the proposed models achieve a better performance than the state-of-the-art.
  • Item
    Learning-based robotic manipulation for dynamic object handling : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Mechatronic Engineering at the School of Food and Advanced Technology, Massey University, Turitea Campus, Palmerston North, New Zealand
    (Massey University, 2021) Janse van Vuuren, Jacobus Petrus
    Recent trends have shown that the lifecycles and production volumes of modern products are shortening. Consequently, many manufacturers subject to frequent change prefer flexible and reconfigurable production systems. Such schemes are often achieved by means of manual assembly, as conventional automated systems are perceived as lacking flexibility. Production lines that incorporate human workers are particularly common within consumer electronics and small appliances. Artificial intelligence (AI) is a possible avenue to achieve smart robotic automation in this context. In this research it is argued that a robust, autonomous object handling process plays a crucial role in future manufacturing systems that incorporate robotics—key to further closing the gap between manual and fully automated production. Novel object grasping is a difficult task, confounded by many factors including object geometry, weight distribution, friction coefficients and deformation characteristics. Sensing and actuation accuracy can also significantly impact manipulation quality. Another challenge is understanding the relationship between these factors, a specific grasping strategy, the robotic arm and the employed end-effector. Manipulation has been a central research topic within robotics for many years. Some works focus on design, i.e. specifying a gripper-object interface such that the effects of imprecise gripper placement and other confounding control-related factors are mitigated. Many universal robotic gripper designs have been considered, including 3-fingered gripper designs, anthropomorphic grippers, granular jamming end-effectors and underactuated mechanisms. While such approaches have maintained some interest, contemporary works predominantly utilise machine learning in conjunction with imaging technologies and generic force-closure end-effectors. Neural networks that utilise supervised and unsupervised learning schemes with an RGB or RGB-D input make up the bulk of publications within this field. Though many solutions have been studied, automatically generating a robust grasp configuration for objects not known a priori, remains an open-ended problem. An element of this issue relates to a lack of objective performance metrics to quantify the effectiveness of a solution—which has traditionally driven the direction of community focus by highlighting gaps in the state-of-the-art. This research employs monocular vision and deep learning to generate—and select from—a set of hypothesis grasps. A significant portion of this research relates to the process by which a final grasp is selected. Grasp synthesis is achieved by sampling the workspace using convolutional neural networks trained to recognise prospective grasp areas. Each potential pose is evaluated by the proposed method in conjunction with other input modalities—such as load-cells and an alternate perspective. To overcome human bias and build upon traditional metrics, scores are established to objectively quantify the quality of an executed grasp trial. Learning frameworks that aim to maximise for these scores are employed in the selection process to improve performance. The proposed methodology and associated metrics are empirically evaluated. A physical prototype system was constructed, employing a Dobot Magician robotic manipulator, vision enclosure, imaging system, conveyor, sensing unit and control system. Over 4,000 trials were conducted utilising 100 objects. Experimentation showed that robotic manipulation quality could be improved by 10.3% when selecting to optimise for the proposed metrics—quantified by a metric related to translational error. Trials further demonstrated a grasp success rate of 99.3% for known objects and 98.9% for objects for which a priori information is unavailable. For unknown objects, this equated to an improvement of approximately 10% relative to other similar methodologies in literature. A 5.3% reduction in grasp rate was observed when removing the metrics as selection criteria for the prototype system. The system operated at approximately 1 Hz when contemporary hardware was employed. Experimentation demonstrated that selecting a grasp pose based on the proposed metrics improved grasp rates by up to 4.6% for known objects and 2.5% for unknown objects—compared to selecting for grasp rate alone. This project was sponsored by the Richard and Mary Earle Technology Trust, the Ken and Elizabeth Powell Bursary and the Massey University Foundation. Without the financial support provided by these entities, it would not have been possible to construct the physical robotic system used for testing and experimentation. This research adds to the field of robotic manipulation, contributing to topics on grasp-induced error analysis, post-grasp error minimisation, grasp synthesis framework design and general grasp synthesis. Three journal publications and one IEEE Xplore paper have been published as a result of this research.
  • Item
    Deep representation learning for action recognition : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand
    (Massey University, 2019) Ren, Jun
    This research focuses on deep representation learning for human action recognition based on the emerging deep learning techniques using RGB and skeleton data. The output of such deep learning techniques is a parameterised hierarchical model, representing the learnt knowledge from the training dataset. It is similar to the knowledge stored in our brain, which is learned from our experience. Currently, the computer’s ability to perform such abstraction is far behind human’s level, perhaps due to the complex processing of spatio-temporal knowledge. The discriminative spatio-temporal representation of human actions is the key for human action recognition systems. Different feature encoding approaches and different learning models may lead to quite different output performances, and at the present time there is no approach that can accurately model the cognitive processing for human actions. This thesis presents several novel approaches to allow computers to learn discriminative, compact and representative spatio-temporal features for human action recognition from multiple input features, aiming at enhancing the performance of an automated system for human action recognition. The input features for the proposed approaches in this thesis are derived from signals that are captured by the depth camera, e.g., RGB video and skeleton data. In this thesis, I developed several geometric features, and proposed the following models for action recognition: CVR-CNN, SKB-TCN, Multi-Stream CNN and STN. These proposed models are inspired by the visual attention mechanisms that are inherently present in human beings. In addition, I discussed the performance of the geometric features that I developed along with the proposed models. Superior experimental results for the proposed geometric features and models are obtained and verified on several benchmarking human action recognition datasets. In the case of the most challenging benchmarking dataset, NTU RGB+D, the accuracy of the results obtained surpassed the performance of the existing RNN-based and ST-GCN models. This study provides a deeper understanding of the spatio-temporal representation of human actions and it has significant implications to explain the inner workings of the deep learning models in learning patterns from time series data. The findings of these proposed models can set forth a solid foundation for further developments, and for the guidance of future human action-related studies.
  • Item
    Plasma-arc cutting control : investigations into machine vision, modelling and cutting head kinematics : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Manawatu, New Zealand
    (Massey University, 2018) Flemmer, Mathew
    Plasma-arc cutting (PAC) is widely used in industry, but it is an under-researched fabrication tool. A review of the literature reveals much study is needed to improve the PAC process regarding efficiency, quality, stability and accuracy. This research investigated a novel control method for PAC. The PAC process was investigated to identify the gaps, and develop feasible methods, methodologies and systems to improve the PAC cutting quality and process control using machine vision. An automated, visual-inspection algorithm was successfully developed. The algorithm uses NC code to path plan and perform kerf width measurement. This visual inspection facilitated research into several aspects of PAC such as the extent of radiative heat transfer, the significance of kerf asymmetry, and a model describing the slope of the leading edge of the kerf-with respect to feed rate and material thickness. A kinematic investigation was conducted on 3 bevel capable plasma heads to complete the elements of a novel control method. An automated, visual-inspection (AVI) system for PAC was designed that consists of a vision unit and a mounting rig. This system is able to perform real-time, kerf width measurement reaching an accuracy of 0.1mm. The methodology was validated by experiment, testing cuts on parts with varying size, shape and complexity. The outcomes of this research were published in the International Journal of Mechanical and Production Engineering and the proceedings of the 2017 Mechatronics and Machine Vision in Practice (M2VIP) international conference. With this developed vision rig, further research was conducted such as an empirical investigation into the relationship between kerf angle and kerf width with respect to torch height, feed rate and material thickness. This investigation was comprised of 35 combinations of the process parameters with 9 replicates for each. A relationship between the process parameters and quality measures was developed, and the magnitudes of kerf asymmetries were quantified. The understanding of the phenomenology of PAC is deficient in several areas. An experimental study was undertaken that reduced the effects of heat transfer by conduction and convection in order to estimate the contribution by radiative heat transfer. This experimental study maintained an arc between a water-cooled anode and plasma torch for 15 seconds. A test piece was specifically designed with imbedded, resistance-temperature-device thermometers positioned around the transferred arc and the temperature was measured. This investigation was able to estimate the effects of radiation from the plasma-arc. The study found radiative heat transfer is less than 3% of the total power input. Another experimental study obtained information on the shape of the leading edge of the kerf. For this study slots were cut into steel plates of 6, 8 and 10mm thickness, at feed rates between 350 and 2000mm/min with a torch height of 1.5mm. Edge points for the centre axis of the leading profile were obtained. A relationship between surface angle and material thickness and feed rate was established and is validated through the test range. A study on obtaining cutting profile data on the front face of the kerf was also undertaken. Slots were cut into plates of 6 and 10mm thickness. Edge points were obtained for the front 180 degrees of the kerf face at sections in 2mm increments. A 3D representation of the shape of the face was then able to be presented. Finally, the kinematics for 3 bevel capable PAC heads was developed. Two of the heads are existing industrial heads, and the third head is being developed by Kerf Ltd. The kinematics investigation produced the DH parameters and transformation matrices for the forwards kinematics. These were validated using MATLAB®. The resulting dynamics were also produced. In conclusion, PAC is a complicated process. This research carried out several studies and has addressed several literature gaps with the proposed methods, methodologies and systems, developed through machine vision and PAC head kinematic study. This research was funded by Callaghan Innovation PhD research funding and received financial support from Kerf Ltd. Callaghan Innovation is a New Zealand government research funding body. Kerf Ltd. is a New Zealand PAC machine manufacturer and distributor.
  • Item
    Low latency vision-based control for robotics : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Mechatronics at Massey University, Manawatu, New Zealand
    (Massey University, 2018) Lues, Joshua
    In this work, the problem of controlling a high-speed dynamic tracking and interception system using computer vision as the measurement unit was explored. High-speed control systems alone present many challenges, and these challenges are compounded when combined with the high volume of data processing required by computer vision systems. A semi-automated foosball table was chosen as the test-bed system because it combines all the challenges associated with a vision-based control system into a single platform. While computer vision is extremely useful and can solve many problems, it can also introduce many problems such as latency, the need for lens and spatial calibration, potentially high power consumption, and high cost. The objective of this work is to explore how to implement computer vision as the measurement unit in a high-speed controller, while minimising latencies caused by the vision itself, communication interfaces, data processing/strategy, instruction execution, and actuator control. Another objective was to implement the solution in one low-latency, low power, low cost embedded system. A field programmable gate array (FPGA) system on chip (SoC), which combines programmable digital logic with a dual core ARM processor (HPS) on the same chip, was hypothesised to be capable of running the described vision-based control system. The FPGA was used to perform streamed image pre-processing, concurrent stepper motor control and provide communication channels for user input, while the HPS performed the lens distortion mapping, intercept calculation and “strategy” control tasks, as well as controlling overall function of the system. Individual vision systems were compared for latency performance. Interception performance of the semi-automated foosball table was then tested for straight, moderate-speed shots with limited view time, and latency was artificially added to the system and the interception results for the same, centre-field shot tested with a variety of different added latencies. The FPGA based system performed the best in both steady-state latency, and novel event detection latency tests. The developed stepper motor control modules performed well in terms of speed, smoothness, resource consumption, and versatility. They are capable of constant velocity, constant acceleration and variable acceleration profiles, as well as being completely parameterisable. The interception modules on the foosball table achieved a 100% interception rate, with a confidence interval of 95%, and reliability of 98.4%. As artificial latency was added to the system, the performance dropped in terms of overall number of successful intercepts. The decrease in performance was roughly linear with a 60% in reduction in performance caused by 100 ms of added latency. Performance dropped to 0% successful intercepts when 166 ms of latency was added. The implications of this work are that FPGA SoC technology may, in future, enable computer vision to be used as a general purpose, high-speed measurement system for a wide variety of control problems.
  • Item
    Lens distortion correction by analysing the shape of patterns in Hough transform space : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Electronics and Computer Engineering at Massey University, Manawatu, New Zealand
    (Massey University, 2018) Chang, Yuan
    Many low cost, wide angle lenses suffer from lens distortion, resulting from a radial variation in the lens magnification. As a result, straight lines, particularly those in the periphery, appear curved. The Hough transform is a commonly used linear feature detection technique within an image. In Hough transform space, straight lines and curved lines have different shapes of peaks. This thesis proposes a lens distortion correction method named SLDC based on analysing the shape of patterns in the Hough transform space. It works by reconstructing the distorted line from significant points on the smile-shaped Hough pattern. It then optimises the distortion parameter by mapping the reconstructed curved line into a straight line and minimising the RMSE. From both simulation and correcting real world images, the SLDC provides encouraging results.
  • Item
    A systematic algorithm development for image processing feature extraction in automatic visual inspection : a thesis presented in partial fulfilment of the requirements for the degree of Master of Technology in the Department of Production Technology, Massey University
    (Massey University, 1990) Xing, G. X. (Guo Xin)
    Image processing techniques applied to modern quality control are described together with the development of feature extraction algorithms for automatic visual inspection. A real-time image processing hardware system already available in the Department of Production Technology is described and has been tested systematically for establishing an optimal threshold function. This systematic testing has been concerned with edge strength and system noise information. With the a priori information of system signal and noise, non-linear threshold functions have been established for real time edge detection. The performance of adaptive thresholding is described and the usefulness of this nonlinear approach is demonstrated from results using machined test samples. Examination and comparisons of thresholding techniques applied to several edge detection operators are presented. It is concluded that, the Roberts' operator with a non-linear thresholding function has the advantages of being simple, fast, accurate and cost effective in automatic visual inspection.
  • Item
    A novel approach to recognition of the detected moving objects in non-stationary background using heuristics and colour measurements : a thesis presented in partial fulfilment of the requirement for the degree of Master of Engineering at Massey University, Albany, New Zealand
    (Massey University, 2017) Lal, Kartikay
    Computer vision has become a growing area of research which involves two fundamental steps, object detection and object recognition. These two steps have been implemented in real world scenarios such as video surveillance systems, traffic cameras for counting cars, or more explicit detection such as detecting faces and recognizing facial expressions. Humans have a vision system that provides sophisticated ways to detect and recognize objects. Colour detection, depth of view and our past experience helps us determine the class of objects with respect to object’s size, shape and the context of the environment. Detection of moving objects on a non-stationary background and recognizing the class of these detected objects, are tasks that have been approached in many different ways. However, the accuracy and efficiency of current methods for object detection are still quite low, due to high computation time and memory intensive approaches. Similarly, object recognition has been approached in many ways but lacks the perceptive methodology to recognise objects. This thesis presents an improved algorithm for detection of moving objects on a non-stationary background. It also proposes a new method for object recognition. Detection of moving objects is initiated by detecting SURF features to identify unique keypoints in the first frame. These keypoints are then searched through individually in another frame using cross correlation, resulting in a process called optical flow. Rejection of outliers is performed by using keypoints to compute global shift of pixels due to camera motion, which helps isolate the points that belong to the moving objects. These points are grouped into clusters using the proposed improved clustering algorithm. The clustering function is capable of adapting to the search radius around a feature point by taking the average Euclidean distance between all the feature points into account. The detected object is then processed through colour measurement and heuristics. Heuristics provide context of the surroundings to recognize the class of the object based upon the object’s size, shape and the environment it is in. This gives object recognition a perceptive approach. Results from the proposed method have shown successful detection of moving objects in various scenes with dynamic backgrounds achieving an efficiency for object detection of over 95% for both indoor and outdoor scenes. The average processing time was computed to be around 16.5 seconds which includes the time taken to detect objects, as well as recognize them. On the other hand, Heuristic and colour based object recognition methodology achieved an efficiency of over 97%.