k-NN attention-based video vision transformer for action recognition

dc.citation.volume574
dc.contributor.authorSun W
dc.contributor.authorMa Y
dc.contributor.authorWang R
dc.date.accessioned2024-10-07T20:45:01Z
dc.date.available2024-10-07T20:45:01Z
dc.date.issued2024-03-14
dc.description.abstractAction Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.
dc.description.confidentialfalse
dc.edition.editionMarch 2024
dc.identifier.citationSun W, Ma Y, Wang R. (2024). k-NN attention-based video vision transformer for action recognition. Neurocomputing. 574.
dc.identifier.doi10.1016/j.neucom.2024.127256
dc.identifier.eissn1872-8286
dc.identifier.elements-typejournal-article
dc.identifier.issn0925-2312
dc.identifier.number127256
dc.identifier.urihttps://mro.massey.ac.nz/handle/10179/71614
dc.languageEnglish
dc.publisherElsevier B.V,
dc.publisher.urihttps://www.sciencedirect.com/science/article/pii/S0925231224000274
dc.relation.isPartOfNeurocomputing
dc.rights(c) 2024 The Author/s
dc.rightsCC BY 4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectAction recognition
dc.subjectVision transforme
dc.subjectTransformer
dc.subjectAttention mechanism
dc.titlek-NN attention-based video vision transformer for action recognition
dc.typeJournal article
pubs.elements-id486357
pubs.organisational-groupOther
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Published version.pdf
Size:
1.2 MB
Format:
Adobe Portable Document Format
Description:
486357 PDF.pdf
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
9.22 KB
Format:
Plain Text
Description:
Collections