k-NN attention-based video vision transformer for action recognition

Sun W; Ma Y; Wang R

k-NN attention-based video vision transformer for action recognition

dc.citation.volume	574
dc.contributor.author	Sun W
dc.contributor.author	Ma Y
dc.contributor.author	Wang R
dc.date.accessioned	2024-10-07T20:45:01Z
dc.date.available	2024-10-07T20:45:01Z
dc.date.issued	2024-03-14
dc.description.abstract	Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.
dc.description.confidential	false
dc.edition.edition	March 2024
dc.identifier.citation	Sun W, Ma Y, Wang R. (2024). k-NN attention-based video vision transformer for action recognition. Neurocomputing. 574.
dc.identifier.doi	10.1016/j.neucom.2024.127256
dc.identifier.eissn	1872-8286
dc.identifier.elements-type	journal-article
dc.identifier.issn	0925-2312
dc.identifier.number	127256
dc.identifier.uri	https://mro.massey.ac.nz/handle/10179/71614
dc.language	English
dc.publisher	Elsevier B.V,
dc.publisher.uri	https://www.sciencedirect.com/science/article/pii/S0925231224000274
dc.relation.isPartOf	Neurocomputing
dc.rights	(c) 2024 The Author/s
dc.rights	CC BY 4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Action recognition
dc.subject	Vision transforme
dc.subject	Transformer
dc.subject	Attention mechanism
dc.title	k-NN attention-based video vision transformer for action recognition
dc.type	Journal article
pubs.elements-id	486357
pubs.organisational-group	Other

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 486357 PDF.pdf
Size:: 1.2 MB
Format:: Adobe Portable Document Format
Description:: Published version.pdf

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 9.22 KB
Format:: Plain Text
Description:

Download

Collections

Journal Articles