Long-term behaviour recognition in videos with actor-focused region attention

Luca Ballan; Ombretta Strafforello; Klamer Schutte

doi:10.5220/0010215803620369

Long-term behaviour recognition in videos with actor-focused region attention

Luca Ballan, Ombretta Strafforello, Klamer Schutte

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

1 Citation (Scopus)

73 Downloads (Pure)

Abstract

Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.

Original language	English
Title of host publication	VISAPP
Editors	Giovanni Maria Farinella, Petia Radeva, Jose Braz, Kadi Bouatouch
Publisher	SciTePress
Pages	362-369
Number of pages	8
ISBN (Electronic)	9789897584886
DOIs	https://doi.org/10.5220/0010215803620369
Publication status	Published - 2021
Event	16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021 - Virtual, Online Duration: 8 Feb 2021 → 10 Feb 2021

Publication series

Name	VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Volume	5

Conference

Conference	16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021
City	Virtual, Online
Period	8/02/21 → 10/02/21

Keywords

3D convolutional neural networks
Action recognition
Region attention
Video classification

Access to Document

10.5220/0010215803620369

102158Final published version, 2.09 MBLicence: CC BY-NC-ND

Cite this

Ballan, L., Strafforello, O., & Schutte, K. (2021). Long-term behaviour recognition in videos with actor-focused region attention. In G. M. Farinella, P. Radeva, J. Braz, & K. Bouatouch (Eds.), VISAPP (pp. 362-369). (VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications; Vol. 5). SciTePress. https://doi.org/10.5220/0010215803620369

Ballan, Luca ; Strafforello, Ombretta ; Schutte, Klamer. / Long-term behaviour recognition in videos with actor-focused region attention. VISAPP. editor / Giovanni Maria Farinella ; Petia Radeva ; Jose Braz ; Kadi Bouatouch. SciTePress, 2021. pp. 362-369 (VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications).

@inproceedings{bdc14b3bf4e5410ca7df4565e5bbdc33,

title = "Long-term behaviour recognition in videos with actor-focused region attention",

abstract = "Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.",

keywords = "3D convolutional neural networks, Action recognition, Region attention, Video classification",

author = "Luca Ballan and Ombretta Strafforello and Klamer Schutte",

year = "2021",

doi = "10.5220/0010215803620369",

language = "English",

series = "VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications",

publisher = "SciTePress",

pages = "362--369",

editor = "Farinella, {Giovanni Maria} and Petia Radeva and Jose Braz and Kadi Bouatouch",

booktitle = "VISAPP",

note = "16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021 ; Conference date: 08-02-2021 Through 10-02-2021",

}

Ballan, L, Strafforello, O & Schutte, K 2021, Long-term behaviour recognition in videos with actor-focused region attention. in GM Farinella, P Radeva, J Braz & K Bouatouch (eds), VISAPP. VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, vol. 5, SciTePress, pp. 362-369, 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021, Virtual, Online, 8/02/21. https://doi.org/10.5220/0010215803620369

Long-term behaviour recognition in videos with actor-focused region attention. / Ballan, Luca; Strafforello, Ombretta; Schutte, Klamer.
VISAPP. ed. / Giovanni Maria Farinella; Petia Radeva; Jose Braz; Kadi Bouatouch. SciTePress, 2021. p. 362-369 (VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications; Vol. 5).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Long-term behaviour recognition in videos with actor-focused region attention

AU - Ballan, Luca

AU - Strafforello, Ombretta

AU - Schutte, Klamer

PY - 2021

Y1 - 2021

N2 - Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.

AB - Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.

KW - 3D convolutional neural networks

KW - Action recognition

KW - Region attention

KW - Video classification

UR - http://www.scopus.com/inward/record.url?scp=85102967224&partnerID=8YFLogxK

U2 - 10.5220/0010215803620369

DO - 10.5220/0010215803620369

M3 - Conference contribution

AN - SCOPUS:85102967224

T3 - VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications

SP - 362

EP - 369

BT - VISAPP

A2 - Farinella, Giovanni Maria

A2 - Radeva, Petia

A2 - Braz, Jose

A2 - Bouatouch, Kadi

PB - SciTePress

T2 - 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021

Y2 - 8 February 2021 through 10 February 2021

ER -

Ballan L, Strafforello O, Schutte K. Long-term behaviour recognition in videos with actor-focused region attention. In Farinella GM, Radeva P, Braz J, Bouatouch K, editors, VISAPP. SciTePress. 2021. p. 362-369. (VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications). doi: 10.5220/0010215803620369

Long-term behaviour recognition in videos with actor-focused region attention

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this