Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand

dc.confidentialEmbargo : No
dc.contributor.advisorWang, Ruili
dc.contributor.authorMa, Yujun
dc.date.accessioned2024-08-13T03:28:52Z
dc.date.available2024-08-13T03:28:52Z
dc.date.issued2024-07-25
dc.description.abstractAction recognition aims to identify human actions in videos through complete action execution. Current action recognition approaches are primarily based on convolutional neural networks (CNNs), Transformers, or hybrids of both. Despite their strong performance, several challenges persist: insufficient disentangled modeling of spatio-temporal features, (ii) a lack of fine-grained motion modelling in action representation, and (iii) limited exploration of the positional embedding of spatial tokens. In this thesis, we introduce three novel deep-learning approaches that address these challenges and enhance spatial and temporal representation in diverse action recognition tasks, including RGB-D, coarse-grained, and fine-grained action recognition. Firstly, we develop a multi-stage factorized spatio-temporal model (MFST) for RGB-D action recognition. This model addresses the limitations of existing RGB-D approaches that rely on entangled spatio-temporal 3D convolution. The MFST em ploys a multi-stage hierarchical structure where each stage independently constructs spatio-temporal dimensions. This progression from low-level features to higher-order semantic primitives results in a robust spatio-temporal representation. Secondly, we introduce a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for coarse-grained and fine-grained action recognition. RPE-STDT addresses the high computational costs of Vision Transformers in video data processing, particularly due to the absolute-position embedding in frame patch tokenization. RPE-STDT utilizes dual Transformer encoder series: spatial encoders for intra-temporal index token interactions, and temporal encoders for inter-temporal dimension interactions with a subsampling strategy. Thirdly, we propose a convolutional transformer network (CTN) for fine-grained action recognition. Traditional Transformer models require extensive training data and additional supervision to rival CNNs in learning capabilities. The proposed CTN merges CNN’s strengths (e.g., weight sharing, and locality) with Transformer bene fits (e.g., dynamic attention, and long-range dependency learning), allows for superior fine-grained motion representation. In summary, we contribute three deep-learning models for diverse action recognition tasks. Each model achieves the state-of-the-art performance across multiple prestigious datasets, as validated by thorough experimentation.en
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/71268
dc.publisherMassey Universityen
dc.publisherListed in 2024 Dean's List of Exceptional Thesesen
dc.rightsCopyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author.en
dc.subjectComputer visionen
dc.subjectHuman activity recognitionen
dc.subjectNeural networks (Computer science)en
dc.subjectDeep learning (Machine learning)en
dc.subjectaction recognitionen
dc.subjectspatio-temporal representationen
dc.subjectvision transformeren
dc.subjectDean's List of Exceptional Thesesen
dc.subject.anzsrc460304 Computer visionen
dc.subject.anzsrc461103 Deep learningen
dc.titleDeep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealanden
thesis.degree.disciplineComputer Science
thesis.degree.nameDoctor of Philosophy (Ph.D.)
thesis.description.doctoral-citation-abridgedDr. Yujun Ma's doctoral research in computer vision concentrated on developing deep learning methods for action recognition. She published 12 papers in esteemed journals and conferences, making substantial contributions that promise to influence broader computer vision applications.
thesis.description.doctoral-citation-longDr. Ma's research focused on computer science and computer vision, with a particular emphasis on developing deep learning-based action recognition approaches. Throughout her PhD journey, she made significant contributions by publishing 12 papers, including in top-tier conferences and journals. Her work has the potential to impact and advance other computer vision tasks, demonstrating her innovative approach and dedication to the field.
thesis.description.name-pronounciationYujun Ma

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MaPhDThesis.pdf
Size:
11.89 MB
Format:
Adobe Portable Document Format