TY - JOUR
T1 - Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals
AU - Zhang, Tianyi
AU - El Ali, Abdallah
AU - Hanjalic, Alan
AU - Cesar, Pablo
PY - 2022
Y1 - 2022
N2 - Fine-grained emotion recognition can model the temporal dynamics of emotions. It is temporally more precise when compared to predicting one emotion for activities (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model. However, the experiments to collect large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose a few-shot learning algorithm EmoDSN which can rapidly converge on a small amount of training data (typically < 10 samples per class (i.e., < 10 shot)) for fine-grained emotion recognition. EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2C) and two-dimensional 5-class (four quadrants of V-A space + neutral, 2D-5C) classification. We get an averaged accuracy of 76.04%, 76.62% and 57.62% for 1D-2C valence, 1D-2C arousal and 2D-5C respectively by using only 5 shot of training data. We also find that EmoDSN can achieve better recognition results trained with fewer annotated samples if we select training samples from the changing points of emotion and the ending moments of video watching.
AB - Fine-grained emotion recognition can model the temporal dynamics of emotions. It is temporally more precise when compared to predicting one emotion for activities (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model. However, the experiments to collect large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose a few-shot learning algorithm EmoDSN which can rapidly converge on a small amount of training data (typically < 10 samples per class (i.e., < 10 shot)) for fine-grained emotion recognition. EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2C) and two-dimensional 5-class (four quadrants of V-A space + neutral, 2D-5C) classification. We get an averaged accuracy of 76.04%, 76.62% and 57.62% for 1D-2C valence, 1D-2C arousal and 2D-5C respectively by using only 5 shot of training data. We also find that EmoDSN can achieve better recognition results trained with fewer annotated samples if we select training samples from the changing points of emotion and the ending moments of video watching.
KW - emotion recognition
KW - deep siamese network
KW - physiological signals
KW - small data
UR - http://www.scopus.com/inward/record.url?scp=85128278546&partnerID=8YFLogxK
U2 - 10.1109/TMM.2022.3165715
DO - 10.1109/TMM.2022.3165715
M3 - Article
AN - SCOPUS:85128278546
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
SN - 1520-9210
M1 - 9751421
ER -