Long-term behaviour recognition in videos with actor-focused region attention

Luca Ballan, Ombretta Strafforello, Klamer Schutte

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

1 Citation (Scopus)
73 Downloads (Pure)

Abstract

Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.

Original languageEnglish
Title of host publicationVISAPP
EditorsGiovanni Maria Farinella, Petia Radeva, Jose Braz, Kadi Bouatouch
PublisherSciTePress
Pages362-369
Number of pages8
ISBN (Electronic)9789897584886
DOIs
Publication statusPublished - 2021
Event16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021 - Virtual, Online
Duration: 8 Feb 202110 Feb 2021

Publication series

NameVISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Volume5

Conference

Conference16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021
CityVirtual, Online
Period8/02/2110/02/21

Keywords

  • 3D convolutional neural networks
  • Action recognition
  • Region attention
  • Video classification

Fingerprint

Dive into the research topics of 'Long-term behaviour recognition in videos with actor-focused region attention'. Together they form a unique fingerprint.

Cite this