Multimodal fusion of body movement signals for no-audio speech detection

Xinsheng Wang, Jihua Zhu, Odette Scharenborg

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

1 Citation (SciVal)
7 Downloads (Pure)

Abstract

No-audio Multimodal Speech Detection is one of the tasks in Media- Eval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement signals. In this paper, a multimodal fusion method, combining signals obtained by an overhead camera and a wearable accelerometer, was proposed to determine whether someone was speaking. The proposed system directly takes the accelerometer signals as input, while using a pre-trained 3D convolutional network to extract the video features that work as input. Experiments on the No-audio Multimodal Speech Detection task show that our method outperforms all submissions of previous years.

Original languageEnglish
Title of host publication MediaEval 2020: Multimedia Benchmark Workshop 2020
Subtitle of host publicationWorking Notes Proceedings of the MediaEval 2020 Workshop
EditorsSteven Hicks , Debesh Jha , Konstantin Pogorelov
Number of pages3
Volume2882
Publication statusPublished - 2020
EventMultimedia Evaluation Benchmark Workshop 2020, MediaEval 2020 - Virtual, Online
Duration: 14 Dec 202015 Dec 2020

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR-WS
ISSN (Print)1613-0073

Conference

ConferenceMultimedia Evaluation Benchmark Workshop 2020, MediaEval 2020
CityVirtual, Online
Period14/12/2015/12/20

Fingerprint

Dive into the research topics of 'Multimodal fusion of body movement signals for no-audio speech detection'. Together they form a unique fingerprint.

Cite this