k-NN attention-based video vision transformer for action recognition

Loading...
Thumbnail Image

Date

2024-03-14

DOI

Open Access Location

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier B.V,

Rights

(c) 2024 The Author/s
CC BY 4.0

Abstract

Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.

Description

Keywords

Action recognition, Vision transforme, Transformer, Attention mechanism

Citation

Sun W, Ma Y, Wang R. (2024). k-NN attention-based video vision transformer for action recognition. Neurocomputing. 574.

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as (c) 2024 The Author/s