TY - JOUR
T1 - Tubelets
T2 - Unsupervised Action Proposals from Spatiotemporal Super-Voxels
AU - Jain, Mihir
AU - van Gemert, Jan
AU - Jégou, Hervé
AU - Bouthemy, Patrick
AU - Snoek, Cees G.M.
PY - 2017
Y1 - 2017
N2 - This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.
AB - This paper considers the problem of localizing actions in videos as sequences of bounding boxes. The objective is to generate action proposals that are likely to include the action of interest, ideally achieving high recall with few proposals. Our contributions are threefold. First, inspired by selective search for object proposals, we introduce an approach to generate action proposals from spatiotemporal super-voxels in an unsupervised manner, we call them Tubelets. Second, along with the static features from individual frames our approach advantageously exploits motion. We introduce independent motion evidence as a feature to characterize how the action deviates from the background and explicitly incorporate such motion information in various stages of the proposal generation. Finally, we introduce spatiotemporal refinement of Tubelets, for more precise localization of actions, and pruning to keep the number of Tubelets limited. We demonstrate the suitability of our approach by extensive experiments for action proposal quality and action localization on three public datasets: UCF Sports, MSR-II and UCF101. For action proposal quality, our unsupervised proposals beat all other existing approaches on the three datasets. For action localization, we show top performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.
KW - Action classification
KW - Action localization
KW - Video representation
UR - http://www.scopus.com/inward/record.url?scp=85020422221&partnerID=8YFLogxK
UR - http://resolver.tudelft.nl/uuid:65788802-2a35-429b-887a-40e5bff26663
U2 - 10.1007/s11263-017-1023-9
DO - 10.1007/s11263-017-1023-9
M3 - Article
AN - SCOPUS:85020422221
SN - 0920-5691
VL - 124
SP - 287
EP - 311
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 3
ER -