No frame left behind: Full Video Action Recognition

Xin Liu; Silvia L. Pintea; Fatemeh Karimi Nejadasl; Olaf Booij; Jan C. van Gemert

doi:10.1109/CVPR46437.2021.01465

No frame left behind: Full Video Action Recognition

Xin Liu, Silvia L. Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, Jan C. van Gemert

Pattern Recognition and Bioinformatics

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

17 Citations (Scopus)

Abstract

Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods.

Original language	English
Title of host publication	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Subtitle of host publication	Proceedings
Editors	L. O'Conner
Place of Publication	Piscataway
Publisher	IEEE
Pages	14887-14896
Number of pages	10
ISBN (Electronic)	978-1-6654-4509-2
ISBN (Print)	978-1-6654-4510-8
DOIs	https://doi.org/10.1109/CVPR46437.2021.01465
Publication status	Published - 2021
Event	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Virtual at Nashville, United States Duration: 20 Jun 2021 → 25 Jun 2021

Conference

Conference	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Abbreviated title	CVPR 2021
Country/Territory	United States
City	Virtual at Nashville
Period	20/06/21 → 25/06/21

Access to Document

10.1109/CVPR46437.2021.01465

Cite this

@inproceedings{36a98ed0ad254691a6c398ac48311104,

title = "No frame left behind: Full Video Action Recognition",

abstract = "Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods.",

author = "Xin Liu and Pintea, {Silvia L.} and Nejadasl, {Fatemeh Karimi} and Olaf Booij and {van Gemert}, {Jan C.}",

year = "2021",

doi = "10.1109/CVPR46437.2021.01465",

language = "English",

isbn = "978-1-6654-4510-8",

pages = "14887--14896",

editor = "L. O'Conner",

booktitle = "2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)",

publisher = "IEEE",

address = "United States",

note = "2021 IEEE/CVF Conference on Computer Vision<br/>and Pattern Recognition, CVPR 2021 ; Conference date: 20-06-2021 Through 25-06-2021",

}

Liu, X , Pintea, SL, Nejadasl, FK, Booij, O & van Gemert, JC 2021, No frame left behind: Full Video Action Recognition. in L O'Conner (ed.), 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): Proceedings., 9578276, IEEE, Piscataway, pp. 14887-14896, 2021 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Virtual at Nashville, United States, 20/06/21. https://doi.org/10.1109/CVPR46437.2021.01465

No frame left behind: Full Video Action Recognition. / Liu, Xin ; Pintea, Silvia L.; Nejadasl, Fatemeh Karimi et al.
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): Proceedings. ed. / L. O'Conner. Piscataway: IEEE, 2021. p. 14887-14896 9578276.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - No frame left behind

T2 - 2021 IEEE/CVF Conference on Computer Vision<br/>and Pattern Recognition

AU - Liu, Xin

AU - Pintea, Silvia L.

AU - Nejadasl, Fatemeh Karimi

AU - Booij, Olaf

AU - van Gemert, Jan C.

PY - 2021

Y1 - 2021

N2 - Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods.

AB - Not all video frames are equally informative for recognizing an action. It is computationally infeasible to train deep networks on all video frames when actions develop over hundreds of frames. A common heuristic is uniformly sampling a small number of video frames and using these to recognize the action. Instead, here we propose full video action recognition and consider all video frames. To make this computational tractable, we first cluster all frame activations along the temporal dimension based on their similarity with respect to the classification task, and then temporally aggregate the frames in the clusters into a smaller number of representations. Our method is end-to-end trainable and computationally efficient as it relies on temporally localized clustering in combination with fast Hamming distances in feature space. We evaluate on UCF101, HMDB51, Breakfast, and Something-Something V1 and V2, where we compare favorably to existing heuristic frame sampling methods.

UR - http://www.scopus.com/inward/record.url?scp=85121202112&partnerID=8YFLogxK

U2 - 10.1109/CVPR46437.2021.01465

DO - 10.1109/CVPR46437.2021.01465

M3 - Conference contribution

SN - 978-1-6654-4510-8

SP - 14887

EP - 14896

BT - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

A2 - O'Conner, L.

PB - IEEE

CY - Piscataway

Y2 - 20 June 2021 through 25 June 2021

ER -

No frame left behind: Full Video Action Recognition

Abstract

Conference

Access to Document

Other files and links

Fingerprint

Cite this