Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals

Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, Pablo Cesar

Research output: Contribution to journalArticleScientificpeer-review

8 Citations (Scopus)
71 Downloads (Pure)


Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching.

Original languageEnglish
Pages (from-to)2304-2322
Number of pages19
JournalIEEE Transactions on Affective Computing
Issue number3
Publication statusPublished - 2023

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.


  • Annotations
  • deep multiple instance learning
  • Emotion recognition
  • emotion recognition
  • Feature extraction
  • physiological signals
  • Physiology
  • Solid modeling
  • Task analysis
  • temporal ambiguity
  • Training


Dive into the research topics of 'Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals'. Together they form a unique fingerprint.

Cite this