k-NN attention-based video vision transformer for action recognition

Loading...
Thumbnail Image
Date
2024-03-14
Open Access Location
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier B.V,
Rights
(c) 2024 The Author/s
CC BY 4.0
Abstract
Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.
Description
Keywords
Action recognition, Vision transforme, Transformer, Attention mechanism
Citation
Sun W, Ma Y, Wang R. (2024). k-NN attention-based video vision transformer for action recognition. Neurocomputing. 574.
Collections