TY - GEN
T1 - Improving child speech recognition with augmented child-like speech
AU - Zhang, Y.
AU - Yue, Z.
AU - Patel, T.B.
AU - Scharenborg, O.E.
PY - 2024
Y1 - 2024
N2 - State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.
AB - State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.
KW - Child speech recognition
KW - Child-to-child voice conversion
KW - Cross-lingual voice conversion
KW - Data augmentation
UR - http://www.scopus.com/inward/record.url?scp=85205815413&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2024-485
DO - 10.21437/Interspeech.2024-485
M3 - Conference contribution
VL - 2024
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 5183
EP - 5187
BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PB - Interspeech
T2 - INTERSPEECH 2024
Y2 - 1 September 2024 through 5 September 2024
ER -