k-NN attention-based video vision transformer for action recognition

Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.

Keywords

Action recognition, Vision transforme, Transformer, Attention mechanism

Citation

Sun W, Ma Y, Wang R. (2024). k-NN attention-based video vision transformer for action recognition. Neurocomputing. 574.

URI

https://mro.massey.ac.nz/handle/10179/71614

Collections

Journal Articles

Creative Commons license

Full item page

k-NN attention-based video vision transformer for action recognition

Files

Date

DOI

Open Access Location

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Rights

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license