Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 6 of 6
  • Item
    Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2024-07-25) Ma, Yujun
    Action recognition aims to identify human actions in videos through complete action execution. Current action recognition approaches are primarily based on convolutional neural networks (CNNs), Transformers, or hybrids of both. Despite their strong performance, several challenges persist: insufficient disentangled modeling of spatio-temporal features, (ii) a lack of fine-grained motion modelling in action representation, and (iii) limited exploration of the positional embedding of spatial tokens. In this thesis, we introduce three novel deep-learning approaches that address these challenges and enhance spatial and temporal representation in diverse action recognition tasks, including RGB-D, coarse-grained, and fine-grained action recognition. Firstly, we develop a multi-stage factorized spatio-temporal model (MFST) for RGB-D action recognition. This model addresses the limitations of existing RGB-D approaches that rely on entangled spatio-temporal 3D convolution. The MFST em ploys a multi-stage hierarchical structure where each stage independently constructs spatio-temporal dimensions. This progression from low-level features to higher-order semantic primitives results in a robust spatio-temporal representation. Secondly, we introduce a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for coarse-grained and fine-grained action recognition. RPE-STDT addresses the high computational costs of Vision Transformers in video data processing, particularly due to the absolute-position embedding in frame patch tokenization. RPE-STDT utilizes dual Transformer encoder series: spatial encoders for intra-temporal index token interactions, and temporal encoders for inter-temporal dimension interactions with a subsampling strategy. Thirdly, we propose a convolutional transformer network (CTN) for fine-grained action recognition. Traditional Transformer models require extensive training data and additional supervision to rival CNNs in learning capabilities. The proposed CTN merges CNN’s strengths (e.g., weight sharing, and locality) with Transformer bene fits (e.g., dynamic attention, and long-range dependency learning), allows for superior fine-grained motion representation. In summary, we contribute three deep-learning models for diverse action recognition tasks. Each model achieves the state-of-the-art performance across multiple prestigious datasets, as validated by thorough experimentation.
  • Item
    End-to-end automatic speech recognition for low-resource languages : a thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science at the School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand
    (Massey University, 2023) Satwinder Singh
    Automatic speech recognition (ASR) for low-resource languages presents numerous challenges due to the lack of various crucial linguistic resources including annotated speech corpus, lexicon, and raw language text. In this thesis, we propose different approaches to improve fundamental frequency estimation and speech recognition for low-resource languages. Firstly, we propose DeepF0, a new deep learning technique for fundamental frequency (F0) estimation. Existing models have limited learning capabilities due to using a shallow receptive field. Our DeepF0 extends the receptive field by using dilated convolutional blocks. Additionally, we enhance training efficiency and speed by incorporating residual blocks with residual connections. We achieve state-of-the-art results with DeepF0, even using 77.4% fewer network parameters. Secondly, we introduce a new meta-learning framework for low-resource speech recognition that improves on the previous model-agnostic meta-learning (MAML) approach. Our framework addresses issues of MAML such as training instabilities and slower convergence by using a multi-step loss (MSL). MSL calculates losses at each step of MAML's inner loop and combines them using a weighted importance vector, which prioritizes the loss at the last step. Thirdly, we propose an end-to-end ASR approach for low-resource languages that exploit the synthesized datasets along with real speech datasets. We evaluate our approach on the low-resource Punjabi language, which is widely spoken across the globe by millions of speakers, however, still lacks annotated speech datasets. Our empirical results show that our synthesized datasets (Google-synth and CMU-synth) can significantly improve the accuracy of our ASR model. Lastly, we introduce a self-training approach, also known as the pseudo-labeling approach, to enhance the performance of low-resource speech recognition. While most self-training research has centered on high-resource languages such as English, our work is focused on the low-resource Punjabi language. To weed out the low-quality pseudo-labels, we employ length normalized confidence score. Overall, our experimental evaluation validates the efficacy of our proposed approaches and shows that they outperform existing baseline approaches for F0 estimation and low-resource speech recognition.
  • Item
    Multi-source multimodal deep learning to improve situation awareness : an application of emergency traffic management : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Emergency Management at Massey University, Wellington, New Zealand
    (Massey University, 2023) Hewa Algiriyage, Rangika Nilani
    Traditionally, disaster management has placed a great emphasis on institutional warning systems, and people have been treated as victims rather than active participants. However, with the evolution of communication technology, today, the general public significantly contributes towards performing disaster management tasks challenging traditional hierarchies in information distribution and acquisition. With mobile phones and Social Media (SM) platforms widely being used, people in disaster scenes act as non-technical sensors that provide contextual information in multiple modalities (e.g., text, image, audio and video) through these content-sharing applications. Research has shown that the general public has extensively used SM applications to report injuries or deaths, damage to infrastructure and utilities, caution, evacuation needs and missing or trapped people during disasters. Disaster responders significantly depend on data for their Situation Awareness (SA) or the dynamic understanding of “the big picture” in space and time for decision-making. However, despite the benefits, processing SM data for disaster response brings multiple challenges. Among them, the most significant challenge is that SM data contain rumours, fake information and false information. Thus, responding agencies have concerns regarding utilising SM for disaster response. Therefore, a high volume of important, real-time data that is very useful for disaster responders’ SA gets wasted. In addition to SM, many other data sources produce information during disasters, including CCTV monitoring, emergency call centres, and online news. The data from these sources come in multiple modalities such as text, images, video, audio and meta-data. To date, researchers have investigated how such data can be automatically processed for disaster response using machine learning and deep learning approaches using a single source/ single modality of data, and only a few have investigated the use of multiple sources and modalities. Furthermore, there is currently no real-time system designed and tested for real-world scenarios to improve responder SA while cross-validating and exploiting SM data. This doctoral project, written within a “PhD-thesis-withpublication” format, addresses this gap by investigating the use of SM data for disaster response while improving reliability through validating data from multiple sources in real-time. This doctoral research was guided by Design Science Research (DSR), which studies the creation of artefacts to solve practical problems of general interest. An artefact: a software prototype that integrates multisource multimodal data for disaster response was developed adopting a 5-stage design science method framework proposed by Johannesson et al. [175] as the roadmap for designing, developing and evaluating. First, the initial research problem was clearly stated, positioned, and root causes were identified. During this stage, the problem area was narrowed down to Emergency traffic management instead of all disaster types. This was done considering the real-time nature and data availability for the artefact’s design, development and evaluation. Second, the requirements for developing the software artefacts were captured using the interviewing technique. Interviews were conducted with stakeholders from a number of disaster and emergency management and transport and traffic agencies in New Zealand. Moreover, domain knowledge and experimental information were captured by analysing academic literature. Third, the artefact was designed and developed. The fourth and final step was focused on the demonstration and evaluation of the artefact. The outcomes of this doctoral research underpin the potential for using validated SM data to enhance the responder’s SA. Furthermore, the research explored appropriate ways to fuse text, visual and voice data in real-time, to provide a comprehensive picture for disaster responders. The achievement of data integration was made through multiple components. First, methodologies and algorithms were developed to estimate traffic flow from CCTV images and CCTV footage by counting vehicle objects. These outcomes extend the previous work by annotating a large New Zealand-based vehicle dataset for object detection and developing an algorithm for vehicle counting by vehicle class and movement direction. Second, a novel deep learning architecture is proposed for making short-term traffic flow predictions using weather data. Previous research has mostly used only traffic data for traffic flow prediction. This research goes beyond previous work by including the correlation between traffic flow and weather conditions. Third, an event extraction system is proposed to extract event templates from online news and SM text data, answering What (semantic), Where (spatial) and When (temporal) questions. Therefore, this doctoral project provides several contributions to the body of knowledge for deep learning and disaster research. In addition, an important practical outcome of this research is an extensible event extraction system for any disaster capable of generating event templates by integrating text and visual formats from online news and SM data that could assist disaster responders’ SA.
  • Item
    Deep learning for asteroid detection in large astronomical surveys : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
    (Massey University, 2022) Cowan, Preeti
    The MOA-II telescope has been operating at the Mt John Observatory since 2004 as part of a Japan/NZ collaboration looking for microlensing events. The telescope has a total field of view of 1.6 x 1.3 degrees and surveys the Galactic Bulge several times each night. This makes it particularly good for observing short duration events. While it has been successful in discovering exoplanets, the full scientific potential of the data has not yet been realised. In particular, numerous known asteroids are hidden amongst the MOA data. These can be clearly seen upon visual inspection of selected images. There are also potentially many undiscovered asteroids captured by the telescope. As yet, no tool exists to effectively mine archival data from large astronomical surveys, such as MOA, for asteroids. The appeal of deep learning is in its ability to learn useful representations from data without significant hand-engineering, making it an excellent tool for asteroid detection. Supervised learning requires labelled datasets, which are also unavailable. The goal of this research is to develop datasets suitable for supervised learning and to apply several CNN-based techniques to identify asteroids in the MOA-II data. Asteroid tracklets can be clearly seen by combining all the observations on a given night and these tracklets form the basis of the dataset. Known asteroids were identified within the composite images, forming the seed dataset for supervised learning. These images were used to train several CNNs to classify images as either containing asteroids or not. The top five networks were then configured as an ensemble that achieved a recall of 97.67%. Next, the YOLO object detector was trained to localise asteroid tracklets, achieving a mean average precision (mAP) of 90.97%. These trained networks will be applied to 16 years of MOA archival data to find both known and unknown asteroids that have been observed by the telescope over the years. The methodologies developed can also be used by other surveys for asteroid recovery and discovery.
  • Item
    Speech processing with deep learning for voice-based respiratory diagnosis : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
    (Massey University, 2022) Ma, Zhizhong
    Voice-based respiratory diagnosis research aims at automatically screening and diagnosing respiratory-related symptoms (e.g., smoking status, COVID-19 infection) from human-generated sounds (e.g., breath, cough, speech). It has the potential to be used as an objective, simple, reliable, and less time-consuming method than traditional biomedical diagnosis methods. In this thesis, we conduct one comprehensive literature review and propose three novel deep learning methods to enrich voice-based respiratory diagnosis research and improve its performance. Firstly, we conduct a comprehensive investigation of the effects of voice features on the detection of smoking status. Secondly, we propose a novel method that uses the combination of both high-level and low-level acoustic features along with deep neural networks for smoking status identification. Thirdly, we investigate various feature extraction/representation methods and propose a SincNet-based CNN method for feature representations to further improve the performance of smoking status identification. To the best of our knowledge, this is the first systemic study that applies speech processing with deep learning for voice-based smoking status identification. Moreover, we propose a novel transfer learning scheme and a task-driven feature representation method for diagnosing respiratory diseases (e.g., COVID-19) from human-generated sounds. We find those transfer learning methods using VGGish, wav2vec 2.0 and PASE+, and our proposed task-driven method Sinc-ResNet have achieved competitive performance compared with other work. The findings of this study provide a new perspective and insights for voice-based respiratory disease diagnosis. The experimental results demonstrate the effectiveness of our proposed methods and show that they have achieved better performances compared to other existing methods.
  • Item
    Graph learning and its applications : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Albany, Auckland, New Zealand
    (Massey University, 2022) Hu, Rongyao
    Since graph features consider the correlations between two data points to provide high-order information, i.e., more complex correlations than the low-order information which considers the correlations in the individual data, they have attracted much attention in real applications. The key of graph feature extraction is the graph construction. Previous study has demonstrated that the quality of the graph usually determines the effectiveness of the graph feature. However, the graph is usually constructed from the original data which often contain noise and redundancy. To address the above issue, graph learning is designed to iteratively adjust the graph and model parameters so that improving the quality of the graph and outputting optimal model parameters. As a result, graph learning has become a very popular research topic in traditional machine learning and deep learning. Although previous graph learning methods have been applied in many fields by adding a graph regularization to the objective function, they still have some issues to be addressed. This thesis focuses on the study of graph learning aiming to overcome the drawbacks in previous methods for different applications. We list the proposed methods as follows. • We propose a traditional graph learning method under supervised learning to consider the robustness and the interpretability of graph learning. Specifically, we propose utilizing self-paced learning to assign important samples with large weights, conducting feature selection to remove redundant features, and learning a graph matrix from the low dimensional data of the original data to preserve the local structure of the data. As a consequence, both important samples and useful features are used to select support vectors in the SVM framework. • We propose a traditional graph learning method under semi-supervised learning to explore parameter-free fusion of graph learning. Specifically, we first employ the discrete wavelet transform and Pearson correlation coefficient to obtain multiple fully connected Functional Connectivity brain Networks (FCNs) for every subject, and then learn a sparsely connected FCN for every subject. Finally, the ℓ1-SVM is employed to learn the important features and conduct disease diagnosis. • We propose a deep graph learning method to consider graph fusion of graph learning. Specifically, we first employ the Simple Linear Iterative Clustering (SLIC) method to obtain multi-scale features for every image, and then design a new graph fusion method to fine-tune features of every scale. As a result, the multi-scale feature fine-tuning, graph learning, and feature learning are embedded into a unified framework. All proposed methods are evaluated on real-world data sets, by comparing to state-of-the-art methods. Experimental results demonstrate that our methods outperformed all comparison methods.