Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Albany, Auckland, New Zealand

Zong, Ming

Deep learning for action recognition in videos : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, Massey University, Albany, Auckland, New Zealand

Files

ZongPhDThesis.pdf(7.96 MB)

Date

2021

Authors

Zong, Ming

Publisher

Massey University

Rights

The Author

Abstract

Video action recognition is a difficult and challenging task in video processing. In this thesis, we propose three novel deep learning approaches to improve the accuracy of action recognition. The first approach aims to learn multi-cue based spatiotemporal features by performing 3D convolutions. Previous 3D CNN models mainly perform 3D convolutions on individual cues (e.g., appearance and motion cues), which lacks the effective overall integration of the appearance information and motion information of videos. To address this issue, we propose a novel multi-cue 3D convolutional neural network (named M3D model for short), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. The proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue, which can obtain more discriminative and robust features by integrating three different cues as a whole. In particular, we propose a novel deep residual multi-cue 3D convolution model (named R-M3D for short) to enhance the representation ability by benefitting from the increasing depth of the model, which can obtain more representative spatiotemporal features. The second approach aims to utilize the motion saliency information to enhance the accuracy of action recognition. We propose a novel motion saliency based multi-stream multiplier ResNets (named MSM-ResNets for short) for action recognition. The proposed MSM-ResNets model consists of three interactive streams: the appearance stream, motion stream and motion saliency stream. The appearance stream is responsible for capturing the appearance information with RGB video frames as input. The motion stream is responsible for capturing the motion information with optical flow frames as input. The motion saliency stream is responsible for capturing the salient motion information with motion saliency frames as input. In particular, to utilize the complementary information between different streams over time, the proposed MSM-ResNets model establishes multiplicative connections between different streams. Two kinds of different multiplicative connections are injected, the first one is to inject multiplicative connections to transmit the motion cue from the motion stream to the appearance stream, and the second one is to inject multiplicative connections to transmit the motion saliency cue from the motion saliency stream to the motion stream. The third approach aims to explore the salient spatiotemporal information over time evolution. We propose a novel spatial and temporal saliency based four-stream network with multi-task learning (named 3M model for short) for action recognition. The proposed 3M model comprises two parts: (i) The first part is a spatial and temporal saliency based four-stream network, which comprises four streams: an appearance stream, a motion stream, a novel spatial saliency stream and a novel temporal saliency stream. The novel spatial saliency stream is used to acquire spatial saliency information and the novel temporal saliency stream is used to acquire temporal saliency information. (ii) The second part is a multi-task learning based long short-term memory network (LSTM), which adopts the feature representations obtained by obtained convolutional neural networks (CNN) as input. The multi-task learning based LSTM can share the complementary knowledge between different streams and capture the long-term dependency relationships of consecutive frames. Experiments verify the effectiveness of all the proposed models and show that all the proposed models achieve a better performance than the state-of-the-art.

Keywords

Human activity recognition, Computer vision, Machine learning

URI

http://hdl.handle.net/10179/16537

Collections

Theses and Dissertations

Full item page