How phonotactics affect multilingual and zero-shot asr performance

Siyuan Feng; Piotr Żelasko; Laureano Moro-Velázquez; Ali Abavisani; Mark Hasegawa-Johnson; Odette Scharenborg; Najim Dehak

doi:10.1109/ICASSP39728.2021.9414478

How phonotactics affect multilingual and zero-shot asr performance

Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

13 Citations (Scopus)

37 Downloads (Pure)

Abstract

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.

Original language	English
Title of host publication	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of Publication	Piscataway
Publisher	IEEE
Pages	7238-7242
Number of pages	5
ISBN (Electronic)	978-1-7281-7605-5
ISBN (Print)	978-1-7281-7606-2
DOIs	https://doi.org/10.1109/ICASSP39728.2021.9414478
Publication status	Published - 2021
Event	ICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021

Conference

Conference	ICASSP 2021
Country/Territory	Canada
City	Virtual Conference/Toronto
Period	6/06/21 → 11/06/21

Bibliographical note

Accepted author manuscript

Keywords

Automatic Speech Recognition
Multilingual
Phonotactics
Zero-shot learning

Access to Document

10.1109/ICASSP39728.2021.9414478

ICASSP2021_discophoneAccepted author manuscript, 383 KB

Cite this

Feng, S., Żelasko, P., Moro-Velázquez, L., Abavisani, A., Hasegawa-Johnson, M., Scharenborg, O., & Dehak, N. (2021). How phonotactics affect multilingual and zero-shot asr performance. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7238-7242). Article 9414478 IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414478

@inproceedings{c035edc6688f4d16aa7745351266dba2,

title = "How phonotactics affect multilingual and zero-shot asr performance",

abstract = "The idea of combining multiple languages{\textquoteright} recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system{\textquoteright}s performance, and retaining only the target language{\textquoteright}s phonotactic data in LM training is preferable.",

keywords = "Automatic Speech Recognition, Multilingual, Phonotactics, Zero-shot learning",

author = "Siyuan Feng and Piotr {\.Z}elasko and Laureano Moro-Vel{\'a}zquez and Ali Abavisani and Mark Hasegawa-Johnson and Odette Scharenborg and Najim Dehak",

note = "Accepted author manuscript; ICASSP 2021 : The IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 06-06-2021 Through 11-06-2021",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9414478",

language = "English",

isbn = "978-1-7281-7606-2",

pages = "7238--7242",

booktitle = "ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

publisher = "IEEE",

address = "United States",

}

Feng, S, Żelasko, P, Moro-Velázquez, L, Abavisani, A, Hasegawa-Johnson, M, Scharenborg, O & Dehak, N 2021, How phonotactics affect multilingual and zero-shot asr performance. in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 9414478, IEEE, Piscataway, pp. 7238-7242, ICASSP 2021, Virtual Conference/Toronto, Canada, 6/06/21. https://doi.org/10.1109/ICASSP39728.2021.9414478

How phonotactics affect multilingual and zero-shot asr performance. / Feng, Siyuan; Żelasko, Piotr; Moro-Velázquez, Laureano et al.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2021. p. 7238-7242 9414478.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - How phonotactics affect multilingual and zero-shot asr performance

AU - Feng, Siyuan

AU - Żelasko, Piotr

AU - Moro-Velázquez, Laureano

AU - Abavisani, Ali

AU - Hasegawa-Johnson, Mark

AU - Scharenborg, Odette

AU - Dehak, Najim

N1 - Accepted author manuscript

PY - 2021

Y1 - 2021

N2 - The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.

AB - The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.

KW - Automatic Speech Recognition

KW - Multilingual

KW - Phonotactics

KW - Zero-shot learning

UR - http://www.scopus.com/inward/record.url?scp=85106073237&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9414478

DO - 10.1109/ICASSP39728.2021.9414478

M3 - Conference contribution

SN - 978-1-7281-7606-2

SP - 7238

EP - 7242

BT - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

CY - Piscataway

T2 - ICASSP 2021

Y2 - 6 June 2021 through 11 June 2021

ER -

How phonotactics affect multilingual and zero-shot asr performance

Abstract

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this