TY - GEN
T1 - That Sounds Familiar
T2 - INTERSPEECH 2020
AU - Żelasko, Piotr
AU - Moro-Velázquez, Laureano
AU - Hasegawa-Johnson, Mark
AU - Scharenborg, Odette
AU - Dehak, Najim
PY - 2020
Y1 - 2020
N2 - Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages — an encouraging result for the low-resource speech community.
AB - Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages — an encouraging result for the low-resource speech community.
KW - Crosslingual
KW - Multilingual
KW - Phone recognition
KW - Speech recognition
KW - Transfer learning
KW - Zero-shot
UR - http://www.scopus.com/inward/record.url?scp=85092138568&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2513
DO - 10.21437/Interspeech.2020-2513
M3 - Conference contribution
T3 - Interspeech 2020
SP - 3705
EP - 3709
BT - Proceedings of Interspeech 2020
PB - ISCA
Y2 - 25 October 2020 through 29 October 2020
ER -