Abstract
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) |
Editors | Cristina Ceballos |
Place of Publication | Piscataway |
Publisher | IEEE |
Pages | 159-166 |
Number of pages | 8 |
ISBN (Electronic) | 979-8-3503-0744-3 |
ISBN (Print) | 979-8-3503-0745-0 |
DOIs | |
Publication status | Published - 2023 |
Event | 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) - Paris, France Duration: 2 Oct 2023 → 6 Oct 2023 |
Conference
Conference | 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) |
---|---|
Country/Territory | France |
City | Paris |
Period | 2/10/23 → 6/10/23 |
Bibliographical note
Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-careOtherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.
Fingerprint
Dive into the research topics of 'Video BagNet: Short temporal receptive fields increase robustness in long-term action recognition'. Together they form a unique fingerprint.Datasets
-
Code underlying the publication: "Video BagNet: short temporal receptive fields increase robustness in long-term action recognition"
van Gemert, J. C. (Creator), Strafforello, O. (Creator), Liu, X. (Creator) & Schutte, K. (Creator), TU Delft - 4TU.ResearchData, 24 May 2024
DOI: 10.4121/DC5E2FB8-6005-40CD-9AFA-FF03C57D0A23
Dataset/Software: Software