Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals

Tianyi Zhang; Abdallah El Ali; Chen Wang; Alan Hanjalic; Pablo Cesar

doi:10.1109/TAFFC.2022.3158234

Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals

Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, Pablo Cesar

Research output: Contribution to journal › Article › Scientific › peer-review

8 Citations (Scopus)

59 Downloads (Pure)

Abstract

Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching.

Original language	English
Pages (from-to)	2304-2322
Number of pages	19
Journal	IEEE Transactions on Affective Computing
Volume	14
Issue number	3
DOIs	https://doi.org/10.1109/TAFFC.2022.3158234
Publication status	Published - 2023

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

Annotations
deep multiple instance learning
Emotion recognition
emotion recognition
Feature extraction
physiological signals
Physiology
Solid modeling
Task analysis
temporal ambiguity
Training

Access to Document

10.1109/TAFFC.2022.3158234

Weakly-Supervised_Learning_for_Fine-Grained_Emotion_Recognition_Using_Physiological_SignalsFinal published version, 1.49 MB

Cite this

@article{e978f7eb09db4c1aa89748dacb1bf57d,

title = "Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals",

abstract = "Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching.",

keywords = "Annotations, deep multiple instance learning, Emotion recognition, emotion recognition, Feature extraction, physiological signals, Physiology, Solid modeling, Task analysis, temporal ambiguity, Training",

author = "Tianyi Zhang and {El Ali}, Abdallah and Chen Wang and Alan Hanjalic and Pablo Cesar",

note = "Green Open Access added to TU Delft Institutional Repository {\textquoteleft}You share, we take care!{\textquoteright} – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public. ",

year = "2023",

doi = "10.1109/TAFFC.2022.3158234",

language = "English",

volume = "14",

pages = "2304--2322",

journal = "IEEE Transactions on Affective Computing",

issn = "1949-3045",

publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

number = "3",

}

TY - JOUR

T1 - Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals

AU - Zhang, Tianyi

AU - El Ali, Abdallah

AU - Wang, Chen

AU - Hanjalic, Alan

AU - Cesar, Pablo

N1 - Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2023

Y1 - 2023

N2 - Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching.

AB - Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching.

KW - Annotations

KW - deep multiple instance learning

KW - Emotion recognition

KW - emotion recognition

KW - Feature extraction

KW - physiological signals

KW - Physiology

KW - Solid modeling

KW - Task analysis

KW - temporal ambiguity

KW - Training

UR - http://www.scopus.com/inward/record.url?scp=85126288132&partnerID=8YFLogxK

U2 - 10.1109/TAFFC.2022.3158234

DO - 10.1109/TAFFC.2022.3158234

M3 - Article

AN - SCOPUS:85126288132

SN - 1949-3045

VL - 14

SP - 2304

EP - 2322

JO - IEEE Transactions on Affective Computing

JF - IEEE Transactions on Affective Computing

IS - 3

ER -

Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this