Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals

Tianyi Zhang; Abdallah El Ali; Alan Hanjalic; Pablo Cesar

doi:10.1109/TMM.2022.3165715

Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals

Tianyi Zhang, Abdallah El Ali, Alan Hanjalic, Pablo Cesar

Research output: Contribution to journal › Article › Scientific › peer-review

1 Citation (Scopus)

33 Downloads (Pure)

Abstract

Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching.

Original language	English
Article number	9751421
Pages (from-to)	3773-3787
Number of pages	15
Journal	IEEE Transactions on Multimedia
Volume	25
DOIs	https://doi.org/10.1109/TMM.2022.3165715
Publication status	E-pub ahead of print - 2022

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

emotion recognition
deep siamese network
physiological signals
small data

Access to Document

10.1109/TMM.2022.3165715

Few-Shot_Learning_for_Fine-Grained_Emotion_Recognition_Using_Physiological_SignalsFinal published version, 4.25 MB

Cite this

@article{099b5b2fef8a4593bd858b188312a7b7,

title = "Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals",

abstract = "Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching.",

keywords = "emotion recognition, deep siamese network, physiological signals, small data",

author = "Tianyi Zhang and {El Ali}, Abdallah and Alan Hanjalic and Pablo Cesar",

note = "Green Open Access added to TU Delft Institutional Repository {\textquoteleft}You share, we take care!{\textquoteright} – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public. ",

year = "2022",

doi = "10.1109/TMM.2022.3165715",

language = "English",

volume = "25",

pages = "3773--3787",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "IEEE",

}

TY - JOUR

T1 - Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals

AU - Zhang, Tianyi

AU - El Ali, Abdallah

AU - Hanjalic, Alan

AU - Cesar, Pablo

N1 - Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2022

Y1 - 2022

N2 - Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching.

AB - Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching.

KW - emotion recognition

KW - deep siamese network

KW - physiological signals

KW - small data

UR - http://www.scopus.com/inward/record.url?scp=85128278546&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3165715

DO - 10.1109/TMM.2022.3165715

M3 - Article

AN - SCOPUS:85128278546

SN - 1520-9210

VL - 25

SP - 3773

EP - 3787

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

M1 - 9751421

ER -

Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this