The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

Siyuan Feng; Odette Scharenborg

doi:10.1109/OJSP.2021.3076914

The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

Siyuan Feng, Odette Scharenborg

Multimedia Computing

Research output: Contribution to journal › Article › Scientific › peer-review

52 Downloads (Pure)

Abstract

This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.

Original language	English
Article number	9420327
Pages (from-to)	230 - 247
Number of pages	18
Journal	IEEE Open Journal of Signal Processing
Volume	2
DOIs	https://doi.org/10.1109/OJSP.2021.3076914
Publication status	Published - 2021

Keywords

Unsupervised subword modeling
zero-resource
cross-lingual modeling
phoneme analysis
articulatory feature analysis

Access to Document

10.1109/OJSP.2021.3076914

09420327Final published version, 5.06 MBLicence: CC BY

Cite this

@article{4e07f84d91ea4811a53a39721746481a,

title = "The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks",

abstract = "This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.",

keywords = "Unsupervised subword modeling, zero-resource, cross-lingual modeling, phoneme analysis, articulatory feature analysis",

author = "Siyuan Feng and Odette Scharenborg",

year = "2021",

doi = "10.1109/OJSP.2021.3076914",

language = "English",

volume = "2",

pages = "230 -- 247",

journal = "IEEE Open Journal of Signal Processing",

issn = "2644-1322",

publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

}

TY - JOUR

T1 - The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

AU - Feng, Siyuan

AU - Scharenborg, Odette

PY - 2021

Y1 - 2021

N2 - This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.

AB - This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.

KW - Unsupervised subword modeling

KW - zero-resource

KW - cross-lingual modeling

KW - phoneme analysis

KW - articulatory feature analysis

U2 - 10.1109/OJSP.2021.3076914

DO - 10.1109/OJSP.2021.3076914

M3 - Article

SN - 2644-1322

VL - 2

SP - 230

EP - 247

JO - IEEE Open Journal of Signal Processing

JF - IEEE Open Journal of Signal Processing

M1 - 9420327

ER -

The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

Abstract

Keywords

Access to Document

Fingerprint

Cite this