Description
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order. In this repository, we provide our code, including the implementation of Video Bag-Net.
| Date made available | 24 May 2024 |
|---|---|
| Publisher | TU Delft - 4TU.ResearchData |
| Date of data production | 2024 - |
Research output
- 1 Conference contribution
-
Video BagNet: Short temporal receptive fields increase robustness in long-term action recognition
Strafforello, O., Liu, X., Schutte, K. & van Gemert, J., 2023, Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Ceballos, C. (ed.). Piscataway: IEEE, p. 159-166 8 p.Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review
Open AccessFile2 Link opens in a new tab Citations (Scopus)11 Downloads (Pure)
Cite this
- DataSetCite