Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

Hengshun Zhou; Jun Du; Gongzhen Zou; Zhaoxu Nian; Chin Hui Lee; Sabato Marco Siniscalchi; Shinji Watanabe; Odette Scharenborg; Jingdong Chen; null More Authors

doi:10.21437/Interspeech.2022-10650

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

Hengshun Zhou, Jun Du^*, Gongzhen Zou, Zhaoxu Nian, Chin Hui Lee, Sabato Marco Siniscalchi, Shinji Watanabe, Odette Scharenborg, Jingdong Chen, More Authors

^*Corresponding author for this work

Multimedia Computing

Research output: Contribution to journal › Conference article › Scientific › peer-review

3 Citations (Scopus)

14 Downloads (Pure)

Abstract

In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code ² are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.

Original language	English
Pages (from-to)	1111-1115
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-10650
Publication status	Published - 2022
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 18 Sept 2022 → 22 Sept 2022

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

analysis
audio-visual database
data augmentation
Wake word spotting

Access to Document

10.21437/Interspeech.2022-10650

zhou22g_interspeechFinal published version, 1.92 MB

Cite this

Zhou, H., Du, J., Zou, G., Nian, Z., Lee, C. H., Siniscalchi, S. M., Watanabe, S., Scharenborg, O., Chen, J., & More Authors (2022). Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 1111-1115. https://doi.org/10.21437/Interspeech.2022-10650

@article{74fc08160901423db917365574bf24a3,

title = "Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis",

abstract = "In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.",

keywords = "analysis, audio-visual database, data augmentation, Wake word spotting",

author = "Hengshun Zhou and Jun Du and Gongzhen Zou and Zhaoxu Nian and Lee, {Chin Hui} and Siniscalchi, {Sabato Marco} and Shinji Watanabe and Odette Scharenborg and Jingdong Chen and {More Authors}",

note = "Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-10650",

language = "English",

volume = "2022-September",

pages = "1111--1115",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

Zhou, H, Du, J, Zou, G, Nian, Z, Lee, CH, Siniscalchi, SM, Watanabe, S, Scharenborg, O, Chen, J & More Authors 2022, 'Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2022-September, pp. 1111-1115. https://doi.org/10.21437/Interspeech.2022-10650

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis. / Zhou, Hengshun; Du, Jun; Zou, Gongzhen et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2022-September, 2022, p. 1111-1115.

Research output: Contribution to journal › Conference article › Scientific › peer-review

TY - JOUR

T1 - Audio-Visual Wake Word Spotting in MISP2021 Challenge

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

AU - Zhou, Hengshun

AU - Du, Jun

AU - Zou, Gongzhen

AU - Nian, Zhaoxu

AU - Lee, Chin Hui

AU - Siniscalchi, Sabato Marco

AU - Watanabe, Shinji

AU - Scharenborg, Odette

AU - Chen, Jingdong

AU - More Authors, null

N1 - Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2022

Y1 - 2022

N2 - In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.

AB - In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.

KW - analysis

KW - audio-visual database

KW - data augmentation

KW - Wake word spotting

UR - http://www.scopus.com/inward/record.url?scp=85140071120&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-10650

DO - 10.21437/Interspeech.2022-10650

M3 - Conference article

AN - SCOPUS:85140071120

SN - 2308-457X

VL - 2022-September

SP - 1111

EP - 1115

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 18 September 2022 through 22 September 2022

ER -

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this