Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 8 of 8
  • Item
    Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2024-07-25) Ma, Yujun
    Action recognition aims to identify human actions in videos through complete action execution. Current action recognition approaches are primarily based on convolutional neural networks (CNNs), Transformers, or hybrids of both. Despite their strong performance, several challenges persist: insufficient disentangled modeling of spatio-temporal features, (ii) a lack of fine-grained motion modelling in action representation, and (iii) limited exploration of the positional embedding of spatial tokens. In this thesis, we introduce three novel deep-learning approaches that address these challenges and enhance spatial and temporal representation in diverse action recognition tasks, including RGB-D, coarse-grained, and fine-grained action recognition. Firstly, we develop a multi-stage factorized spatio-temporal model (MFST) for RGB-D action recognition. This model addresses the limitations of existing RGB-D approaches that rely on entangled spatio-temporal 3D convolution. The MFST em ploys a multi-stage hierarchical structure where each stage independently constructs spatio-temporal dimensions. This progression from low-level features to higher-order semantic primitives results in a robust spatio-temporal representation. Secondly, we introduce a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for coarse-grained and fine-grained action recognition. RPE-STDT addresses the high computational costs of Vision Transformers in video data processing, particularly due to the absolute-position embedding in frame patch tokenization. RPE-STDT utilizes dual Transformer encoder series: spatial encoders for intra-temporal index token interactions, and temporal encoders for inter-temporal dimension interactions with a subsampling strategy. Thirdly, we propose a convolutional transformer network (CTN) for fine-grained action recognition. Traditional Transformer models require extensive training data and additional supervision to rival CNNs in learning capabilities. The proposed CTN merges CNN’s strengths (e.g., weight sharing, and locality) with Transformer bene fits (e.g., dynamic attention, and long-range dependency learning), allows for superior fine-grained motion representation. In summary, we contribute three deep-learning models for diverse action recognition tasks. Each model achieves the state-of-the-art performance across multiple prestigious datasets, as validated by thorough experimentation.
  • Item
    Representation learning for the graph data : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2022) Gan, Jiangzhang
    Graph data consist of the association information between complex entities and also contain diverse vertex information. To make graph data analysis simple and effective, as the bridge between the original graph data and the graph application tasks, graph representation learning has become a hot research topic in recent years. Previous representation learning methods for the graph data may not reflect the intrinsic relationship between nodes due to the complexity of the graph data. Moreover, they do not preserve the topology of the graph data well, which will affect the effectiveness of the downstream tasks. To deal with these issues, the thesis studies effective graph representation learning methods in terms of graph construction and representation learning. We propose a traditional graph learning method under semi-supervised learning to explore parameter-free fusion of graph learning. Specifically, we first employ the Pearson correlation coefficient to obtain a fully connected Functional Connectivity brain Networks (FCN), and then to learn a sparsely connected FCN for every subject. Finally, the ℓ1-SVM is employed to learn the important features and conduct disease diagnosis. We propose an end-to-end deep graph learning method under semi-supervised learning to improve the quality of initial graph. Specifically, the proposed method first extracts the common information and the complementary information among multiple local graphs to obtain a unified local graph, which is then fused with the global graph of the data to obtain the initial graph for the GCN model. As a result, the proposed method conducts the graph fusion process twice to simultaneously learn the low-dimensional space and the intrinsic graph structure of the data in a unified framework. We propose a multi-view unsupervised graph learning method. Specifically, the adaptive data augmentation first builds a feature graph from the feature space, and then designs a deep graph learning model on the original representation and the topology graph, respectively, to update the feature graph and the new representation. As a result, the adaptive data augmentation outputs multi-view information, which is fed into two GCNs to generate multi-view embedding features. Two kinds of contrastive losses are further designed on multi-view embedding features to explore the complementary information among the topology and feature graphs. Additionally, adaptive data augmentation and contrastive learning are embedded in a unified framework to form an end-to-end model. All proposed methods are evaluated on real-world data sets. Experimental results demonstrate that our methods outperformed all comparison methods, compared to state-of-the-art methods.
  • Item
    A platform for practical homomorphic encryption in neural network classification : a thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy (Ph.D.) in Information Technology, Massey University
    (Massey University, 2021) Baryalai, Mehmood
    Convolutional neural networks (CNN) have become remarkably better in correctly identifying and classifying objects. By using CNN, numerous online services now exist that processes our data to provide meaningful insight and value-added services. Not all services are reliable and trustworthy due to which privacy concerns exist. To address the issue, the work presented in this research develops and optimise new techniques to use Homomorphic Encryption (HE) as a solution. Researchers have proposed solutions like the CryptoNets, Gazelle, and CryptoDL. However, homomorphic encryption is yet to see the limelight for real-world adoption, especially in neural networks. These proposed solutions are seen as a solution only for a particular CNN model and lack generality to be extended to a different CNN model. Moreover, the solutions for HE-CNN integration are seen as unprepared for adoption in a practical and real-world environment. Additionally, the complex integration of hybrid approaches limits their utilization with privacy-preserving based CNN models. For that reason, this research develops the mathematical and practical knowledge required to adopt HE within a CNN. This knowledge of performing encrypted classification for a CNN model is based on a careful selection of appropriate encryption parameters. Furthermore, this study succeeds in developing a dual-cloud system to mitigate many of the technical hurdles for evaluating an encrypted neural network without compromising privacy. Moreover, in the case of a single cloud, this study develops methods for overcoming technical issues in selecting encryption parameters for, and evaluating, a convolutional neural network. In the same context, the novel method of selecting and optimizing encryption parameters based on probability is given. The proposals and the knowledge from this research can aid and advance the strategies of HE-CNN integrations in an efficient and easy way.
  • Item
    Deep representation learning for action recognition : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand
    (Massey University, 2019) Ren, Jun
    This research focuses on deep representation learning for human action recognition based on the emerging deep learning techniques using RGB and skeleton data. The output of such deep learning techniques is a parameterised hierarchical model, representing the learnt knowledge from the training dataset. It is similar to the knowledge stored in our brain, which is learned from our experience. Currently, the computer’s ability to perform such abstraction is far behind human’s level, perhaps due to the complex processing of spatio-temporal knowledge. The discriminative spatio-temporal representation of human actions is the key for human action recognition systems. Different feature encoding approaches and different learning models may lead to quite different output performances, and at the present time there is no approach that can accurately model the cognitive processing for human actions. This thesis presents several novel approaches to allow computers to learn discriminative, compact and representative spatio-temporal features for human action recognition from multiple input features, aiming at enhancing the performance of an automated system for human action recognition. The input features for the proposed approaches in this thesis are derived from signals that are captured by the depth camera, e.g., RGB video and skeleton data. In this thesis, I developed several geometric features, and proposed the following models for action recognition: CVR-CNN, SKB-TCN, Multi-Stream CNN and STN. These proposed models are inspired by the visual attention mechanisms that are inherently present in human beings. In addition, I discussed the performance of the geometric features that I developed along with the proposed models. Superior experimental results for the proposed geometric features and models are obtained and verified on several benchmarking human action recognition datasets. In the case of the most challenging benchmarking dataset, NTU RGB+D, the accuracy of the results obtained surpassed the performance of the existing RNN-based and ST-GCN models. This study provides a deeper understanding of the spatio-temporal representation of human actions and it has significant implications to explain the inner workings of the deep learning models in learning patterns from time series data. The findings of these proposed models can set forth a solid foundation for further developments, and for the guidance of future human action-related studies.
  • Item
    Machine learning and audio processing : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, Auckland, New Zealand
    (Massey University, 2019) Ma, Junbo
    In this thesis, we addressed two important theoretical issues in deep neural networks and clustering, respectively. Also, we developed a new approach for polyphonic sound event detection, which is one of the most important applications in the audio processing area. The developed three novel approaches are: (i) The Large Margin Recurrent Neural Network (LMRNN), which improves the discriminative ability of original Recurrent Neural Networks by introducing a large margin term into the widely used cross-entropy loss function. The developed large margin term utilises the large margin discriminative principle as a heuristic term to navigate the convergence process during training, which fully exploits the information from data labels by considering both target category and competing categories. (ii) The Robust Multi-View Continuous Subspace Clustering (RMVCSC) approach, which performs clustering on a common view-invariant subspace learned from all views. The clustering result and the common representation subspace are simultaneously optimised by a single continuous objective function. In the objective function, a robust estimator is used to automatically clip specious inter-cluster connections while maintaining convincing intra-cluster correspondences. Thus, the developed RMVCSC can untangle heavily mixed clusters without pre-setting the number of clusters. (iii) The novel polyphonic sound event detection approach based on Relational Recurrent Neural Network (RRNN), which utilises the relational reasoning ability of RRNNs to untangle the overlapping sound events across audio recordings. Different from previous works, which mixed and packed all historical information into a single common hidden memory vector, the developed approach allows historical information to interact with each other across an audio recording, which is effective and efficient in untangling the overlapping sound events. All three approaches are tested on widely used datasets and compared with recently published works. The experimental results have demonstrated the effectiveness and efficiency of the developed approaches.
  • Item
    Ensembles of neural networks for language modeling : a thesis presented in partial fulfilment of the requirements for the degree of Master of Philosophy in Information Technology at Massey University, Auckland, New Zealand
    (Massey University, 2018) Xiao, Yujie
    Language modeling has been widely used in the application of natural language processing, and therefore gained a significant amount of following in recent years. The objective of language modeling is to simulate the probability distribution for different linguistic units, e.g., characters, words, phrases and sentences etc, using traditional statistical methods or modern machine learning approach. In this thesis, we first systematically studied the language model, including traditional discrete space based language model and latest continuous space based neural network based language model. Then, we focus on the modern continuous space based language model, which embed elements of language into a continuous-space, aim at finding out a proper word presentation for the given dataset. Mapping the vocabulary space into a continuous space, the deep learning model can predict the possibility of the future words based on the historical presence of vocabulary more efficiently than traditional models. However, they still suffer from various drawbacks, so we studied a series of variants of latest architecture of neural networks and proposed a modified recurrent neural network for language modeling. Experimental results show that our modified model can achieve competitive performance in comparison with existing state-of-the-art models with a significant reduction of the training time.
  • Item
    Traffic flow modeling and forecasting using cellular automata and neural networks : a thesis presented in partial fulfillment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand
    (Massey University, 2006) Liu, Mingzhe
    In This thesis fine grids are adopted in Cellular Automata (CA) models. The fine-grid models are able to describe traffic flow in detail allowing position, speed, acceleration and deceleration of vehicles simulated in a more realistic way. For urban straight roads, two types of traffic flow, free and car-following flow, have been simulated. A novel five-stage speed-changing CA model is developed to describe free flow. The 1.5-second headway, based on field data, is used to simulate car-following processes, which corrects the headway of 1 second used in all previous CA models. Novel and realistic CA models, based on the Normal Acceptable Space (NAS) method, are proposed to systematically simulate driver behaviour and interactions between drivers to enter single-lane Two-Way Stop-Controlled (TWSC) intersections and roundabouts. The NAS method is based on the two following Gaussian distributions. Distribution of space required for all drivers to enter intersections or roundabouts is assumed to follow a Gaussian distribution, which corresponds to heterogeneity of driver behaviour. While distribution of space required for a single driver to enter an intersection or roundabout is assumed to follow another Gaussian distribution, which corresponds to inconsistency of driver behavior. The effects of passing lanes on single-lane highway traffic are investigated using fine grids CA. Vehicles entering, exiting from and changing lanes on passing lane sections are discussed in detail. In addition, a Genetic Algorithm-based Neural Network (GANN) method is proposed to predict Short-term Traffic Flow (STF) in urban networks, which is expected to be helpful for traffic control. Prediction accuracy and generalization ability of NN are improved by optimizing the number of neurons in the hidden layer and connection weights of NN using genetic operations such as selection, crossover and mutation.
  • Item
    Mechatronic simulation & exploration of a mechanical context relevant to quadrupedal neuromorphic walking employing Nervous networks for control : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering, Mechatronics at Massey University, Albany, New Zealand
    (Massey University, 2008) Read, Matthew
    Neuromorphic engineering is the studv and emulation of neural sensory and control structures found in the natural world. Currently a significant research focus in this field, and indeed, in engineering at large, is the research of robotic walking platforms - an ideal application for artificial neural controllers. To design such neuromorphic controllers, significant knowledge is needed of the robotic context to which they will he applied. The focus of this research is to explore the relationship between the mechanical design of a robot, and its resultant walking proficiency. A neuromorphic controller utilizing Nervous networks was constructed, and embedded into a typical & useful mechatronic context. This consists of a simple walking platform, of a type commonly used in Nervous network research. This robot was used to provide intuition and a reference point for development of a simulation for empirical testing. A physical simulation of the mechanical context was developed, allowing for the exploration of its behaviour, particularly with regard to the type of walking "caused" by the integration of an appropriate Nervous network controller. To evaluate the behavioural fitness of this context in various configurations, empirical simulations were run using the developed simulation, and heuristic results derived to develop optimized parameters for causing walking behaviours in the studied context. Further simulations were then run to evaluation the efficacy of these developed heuristics. From these simulations & explorations, the presence of an identifiable "critical point phenomenon" in the interaction between the robot's legs was demonstrated. This critical point was then used for parameter extraction; further simulation demonstrated that parameters extracted from this critical point provided near-optimal walking behaviour from the robot in a variety of leg topologies. These results provide significant knowledge and intuition for designers of quadrupedal walking platforms, particularly those driven from Nervous network derived neuromorphic controllers. Implementation of these results in such a robotic platform will provide useful new "real world" data, allowing the developed models & heuristics to be further refined.