TY - GEN
T1 - Unsupervised acoustic unit discovery by leveraging a language-independent subword discriminative feature representation
AU - Feng, Siyuan
AU - Zelasko, Piotr
AU - Moro-Velázquez, Laureano
AU - Scharenborg, Odette
PY - 2021
Y1 - 2021
N2 - This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a twostage approach: the first stage learns a subword-discriminative feature representation, and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. In the first stage, a recently proposed method in the task of unsupervised subword modeling is improved by replacing a monolingual outof-domain (OOD) ASR system with a multilingual one to create a subword-discriminative representation that is more language-independent. In the second stage, segment-level kmeans is adopted, and two methods to represent the variablelength speech segments as fixed-dimension feature vectors are compared. Experiments on a very low-resource Mboshi language corpus show that our approach outperforms state-of-theart AUD in both normalized mutual information (NMI) and F-score. The multilingual ASR improved upon the monolingual ASR in providing OOD phone labels and in estimating the phone boundaries. A comparison of our systems with and without knowing the ground-truth phone boundaries showed a 16% NMI performance gap, suggesting that the current approach can significantly benefit from improved phone boundary estimation.
AB - This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a twostage approach: the first stage learns a subword-discriminative feature representation, and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. In the first stage, a recently proposed method in the task of unsupervised subword modeling is improved by replacing a monolingual outof-domain (OOD) ASR system with a multilingual one to create a subword-discriminative representation that is more language-independent. In the second stage, segment-level kmeans is adopted, and two methods to represent the variablelength speech segments as fixed-dimension feature vectors are compared. Experiments on a very low-resource Mboshi language corpus show that our approach outperforms state-of-theart AUD in both normalized mutual information (NMI) and F-score. The multilingual ASR improved upon the monolingual ASR in providing OOD phone labels and in estimating the phone boundaries. A comparison of our systems with and without knowing the ground-truth phone boundaries showed a 16% NMI performance gap, suggesting that the current approach can significantly benefit from improved phone boundary estimation.
KW - Acoustic unit discovery
KW - Unsupervised subword modeling
KW - Zero-resource
UR - http://www.scopus.com/inward/record.url?scp=85119168788&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1664
DO - 10.21437/Interspeech.2021-1664
M3 - Conference contribution
AN - SCOPUS:85119168788
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 1534
EP - 1538
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -