The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Zhe Wang; Shilong  Wu; Hang  Chen; Mao-Kui  He; Jun  Du; Chin-Hui  Lee; Jingdong  Chen; Shinji  Watanabe; Sabato Marco  Siniscalchi; Odette Scharenborg; Diyuan  Liu; null More Authors

doi:10.1109/ICASSP49357.2023.10094836

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du^*, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Diyuan Liu, More Authors

^*Corresponding author for this work

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

22 Downloads (Pure)

Abstract

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when" using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when" with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

Original language	English
Title of host publication	Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of Publication	Piscataway
Publisher	IEEE
Number of pages	5
ISBN (Electronic)	978-1-7281-6327-7
ISBN (Print)	978-1-7281-6328-4
DOIs	https://doi.org/10.1109/ICASSP49357.2023.10094836
Publication status	Published - 2023
Event	48th IEEE International Conference on Acoustics, Speech and Signal Processing 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2023-June
ISSN (Print)	1520-6149

Conference

Conference	48th IEEE International Conference on Acoustics, Speech and Signal Processing 2023
Abbreviated title	ICASSP 2023
Country/Territory	Greece
City	Rhodes Island
Period	4/06/23 → 10/06/23

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

MISP challenge
speaker diarization
speech recognition
multimodality

Access to Document

10.1109/ICASSP49357.2023.10094836

The_Multimodal_Information_Based_Speech_Processing_Misp_2022_Challenge_Audio-Visual_Diarization_And_RecognitionFinal published version, 1.12 MB

Cite this

Wang, Z., Wu, S., Chen, H., He, M.-K., Du, J., Lee, C.-H., Chen, J., Watanabe, S., Siniscalchi, S. M., Scharenborg, O., Liu, D., & More Authors (2023). The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. In Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June). IEEE. https://doi.org/10.1109/ICASSP49357.2023.10094836

Wang, Zhe ; Wu, Shilong ; Chen, Hang et al. / The Multimodal Information Based Speech Processing (Misp) 2022 Challenge : Audio-Visual Diarization And Recognition. Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway : IEEE, 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{3e54aaa046f84411a5ca351a314d73ce,

title = "The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition",

abstract = "The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve {"}who spoken when{"} using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing {"}who spoken what when{"} with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.",

keywords = "MISP challenge, speaker diarization, speech recognition, multimodality",

author = "Zhe Wang and Shilong Wu and Hang Chen and Mao-Kui He and Jun Du and Chin-Hui Lee and Jingdong Chen and Shinji Watanabe and Siniscalchi, {Sabato Marco} and Odette Scharenborg and Diyuan Liu and {More Authors}",

note = "Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.; 48th IEEE International Conference on Acoustics, Speech and Signal Processing 2023, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.1109/ICASSP49357.2023.10094836",

language = "English",

isbn = "978-1-7281-6328-4",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "IEEE",

booktitle = "Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

address = "United States",

}

Wang, Z, Wu, S, Chen, H, He, M-K, Du, J, Lee, C-H, Chen, J, Watanabe, S, Siniscalchi, SM, Scharenborg, O, Liu, D & More Authors 2023, The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. in Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2023-June, IEEE, Piscataway, 48th IEEE International Conference on Acoustics, Speech and Signal Processing 2023, Rhodes Island, Greece, 4/06/23. https://doi.org/10.1109/ICASSP49357.2023.10094836

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. / Wang, Zhe; Wu, Shilong ; Chen, Hang et al.
Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - The Multimodal Information Based Speech Processing (Misp) 2022 Challenge

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing 2023

AU - Wang, Zhe

AU - Wu, Shilong

AU - Chen, Hang

AU - He, Mao-Kui

AU - Du, Jun

AU - Lee, Chin-Hui

AU - Chen, Jingdong

AU - Watanabe, Shinji

AU - Siniscalchi, Sabato Marco

AU - Scharenborg, Odette

AU - Liu, Diyuan

AU - More Authors, null

N1 - Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2023

Y1 - 2023

N2 - The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when" using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when" with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

AB - The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when" using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when" with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

KW - MISP challenge

KW - speaker diarization

KW - speech recognition

KW - multimodality

UR - http://www.scopus.com/inward/record.url?scp=85177603071&partnerID=8YFLogxK

U2 - 10.1109/ICASSP49357.2023.10094836

DO - 10.1109/ICASSP49357.2023.10094836

M3 - Conference contribution

SN - 978-1-7281-6328-4

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

CY - Piscataway

Y2 - 4 June 2023 through 10 June 2023

ER -

Wang Z, Wu S, Chen H, He MK, Du J, Lee CH et al. The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition. In Proceedings of the ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP49357.2023.10094836

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Top 3% best paper ICASSP 2023

Cite this

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Prizes

Top 3% best paper ICASSP 2023

Cite this