How phonotactics affect multilingual and zero-shot asr performance

Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

2 Citations (Scopus)
19 Downloads (Pure)

Abstract

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.
Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of PublicationPiscataway
PublisherIEEE
Pages7238-7242
Number of pages5
ISBN (Electronic)978-1-7281-7605-5
ISBN (Print)978-1-7281-7606-2
DOIs
Publication statusPublished - 2021
EventICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada
Duration: 6 Jun 202111 Jun 2021

Conference

ConferenceICASSP 2021
CountryCanada
CityVirtual Conference/Toronto
Period6/06/2111/06/21

Bibliographical note

Accepted author manuscript

Keywords

  • Automatic Speech Recognition
  • Multilingual
  • Phonotactics
  • Zero-shot learning

Fingerprint

Dive into the research topics of 'How phonotactics affect multilingual and zero-shot asr performance'. Together they form a unique fingerprint.

Cite this