The effectiveness of self-supervised representation learning in zero-resource subword modeling

Siyuan Feng; Odette Scharenborg

doi:10.1109/IEEECONF53345.2021.9723318

The effectiveness of self-supervised representation learning in zero-resource subword modeling

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

32 Downloads (Pure)

Abstract

For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., learning a feature representation that can distinguish subword units and is robust to speaker variation. Previous studies showed that self-supervised learning (SSL) has the potential to separate speaker and phonetic information in speech in an unsupervised manner, which is highly desired in USM. This paper compares two representative SSL algorithms, namely, contrastive predictive coding (CPC) and autoregressive predictive coding (APC), as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN. Experiments show that the bottleneck features extracted by the back-end achieved state of the art in a subword ABX task on the Libri-light and ZeroSpeech databases. In general, CPC is more effective than APC as the front-end in our approach, which is independent of the choice of the out-domain language identity in the back-end cross-lingual DNN and the training data amount. With very limited training data, APC is found similar or more effective than CPC when test data consists of long utterances.

Original language	English
Title of host publication	55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021
Subtitle of host publication	Proceedings
Editors	Michael B. Matthews
Publisher	IEEE
Pages	1414-1418
Number of pages	5
ISBN (Electronic)	978-1-6654-5828-3
ISBN (Print)	978-1-6654-5829-0
DOIs	https://doi.org/10.1109/IEEECONF53345.2021.9723318
Publication status	Published - 2021
Event	2021 55th Asilomar Conference on Signals, Systems, and Computers - Pacific Grove, United States Duration: 31 Oct 2021 → 3 Nov 2021 Conference number: 55th

Publication series

Name	Conference Record - Asilomar Conference on Signals, Systems and Computers
Volume	2021-October
ISSN (Print)	1058-6393

Conference

Conference	2021 55th Asilomar Conference on Signals, Systems, and Computers
Country/Territory	United States
City	Pacific Grove
Period	31/10/21 → 3/11/21

Bibliographical note

Accepted author manuscript

Keywords

zero-resource
unsupervised subword learning
contrastive predictive coding
autoregressive predictive coding
cross-lingual modeling

Access to Document

10.1109/IEEECONF53345.2021.9723318

Asilomar2021 (1)Accepted author manuscript, 431 KB

Cite this

Feng, S., & Scharenborg, O. (2021). The effectiveness of self-supervised representation learning in zero-resource subword modeling. In M. B. Matthews (Ed.), 55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021: Proceedings (pp. 1414-1418). Article 9723168 (Conference Record - Asilomar Conference on Signals, Systems and Computers; Vol. 2021-October). IEEE. https://doi.org/10.1109/IEEECONF53345.2021.9723318

@inproceedings{38006f51a3d040a29436b31e5ffb11e7,

title = "The effectiveness of self-supervised representation learning in zero-resource subword modeling",

abstract = "For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., learning a feature representation that can distinguish subword units and is robust to speaker variation. Previous studies showed that self-supervised learning (SSL) has the potential to separate speaker and phonetic information in speech in an unsupervised manner, which is highly desired in USM. This paper compares two representative SSL algorithms, namely, contrastive predictive coding (CPC) and autoregressive predictive coding (APC), as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN. Experiments show that the bottleneck features extracted by the back-end achieved state of the art in a subword ABX task on the Libri-light and ZeroSpeech databases. In general, CPC is more effective than APC as the front-end in our approach, which is independent of the choice of the out-domain language identity in the back-end cross-lingual DNN and the training data amount. With very limited training data, APC is found similar or more effective than CPC when test data consists of long utterances.",

keywords = "zero-resource, unsupervised subword learning, contrastive predictive coding, autoregressive predictive coding, cross-lingual modeling",

author = "Siyuan Feng and Odette Scharenborg",

note = "Accepted author manuscript; 2021 55th Asilomar Conference on Signals, Systems, and Computers ; Conference date: 31-10-2021 Through 03-11-2021",

year = "2021",

doi = "10.1109/IEEECONF53345.2021.9723318",

language = "English",

isbn = "978-1-6654-5829-0",

series = "Conference Record - Asilomar Conference on Signals, Systems and Computers",

publisher = "IEEE",

pages = "1414--1418",

editor = "Matthews, {Michael B.}",

booktitle = "55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021",

address = "United States",

}

Feng, S & Scharenborg, O 2021, The effectiveness of self-supervised representation learning in zero-resource subword modeling. in MB Matthews (ed.), 55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021: Proceedings., 9723168, Conference Record - Asilomar Conference on Signals, Systems and Computers, vol. 2021-October, IEEE, pp. 1414-1418, 2021 55th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, United States, 31/10/21. https://doi.org/10.1109/IEEECONF53345.2021.9723318

The effectiveness of self-supervised representation learning in zero-resource subword modeling. / Feng, Siyuan; Scharenborg, Odette.
55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021: Proceedings. ed. / Michael B. Matthews. IEEE, 2021. p. 1414-1418 9723168 (Conference Record - Asilomar Conference on Signals, Systems and Computers; Vol. 2021-October).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - The effectiveness of self-supervised representation learning in zero-resource subword modeling

AU - Feng, Siyuan

AU - Scharenborg, Odette

N1 - Conference code: 55th

PY - 2021

Y1 - 2021

N2 - For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., learning a feature representation that can distinguish subword units and is robust to speaker variation. Previous studies showed that self-supervised learning (SSL) has the potential to separate speaker and phonetic information in speech in an unsupervised manner, which is highly desired in USM. This paper compares two representative SSL algorithms, namely, contrastive predictive coding (CPC) and autoregressive predictive coding (APC), as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN. Experiments show that the bottleneck features extracted by the back-end achieved state of the art in a subword ABX task on the Libri-light and ZeroSpeech databases. In general, CPC is more effective than APC as the front-end in our approach, which is independent of the choice of the out-domain language identity in the back-end cross-lingual DNN and the training data amount. With very limited training data, APC is found similar or more effective than CPC when test data consists of long utterances.

AB - For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., learning a feature representation that can distinguish subword units and is robust to speaker variation. Previous studies showed that self-supervised learning (SSL) has the potential to separate speaker and phonetic information in speech in an unsupervised manner, which is highly desired in USM. This paper compares two representative SSL algorithms, namely, contrastive predictive coding (CPC) and autoregressive predictive coding (APC), as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN. Experiments show that the bottleneck features extracted by the back-end achieved state of the art in a subword ABX task on the Libri-light and ZeroSpeech databases. In general, CPC is more effective than APC as the front-end in our approach, which is independent of the choice of the out-domain language identity in the back-end cross-lingual DNN and the training data amount. With very limited training data, APC is found similar or more effective than CPC when test data consists of long utterances.

KW - zero-resource

KW - unsupervised subword learning

KW - contrastive predictive coding

KW - autoregressive predictive coding

KW - cross-lingual modeling

UR - http://www.scopus.com/inward/record.url?scp=85127082466&partnerID=8YFLogxK

U2 - 10.1109/IEEECONF53345.2021.9723318

DO - 10.1109/IEEECONF53345.2021.9723318

M3 - Conference contribution

SN - 978-1-6654-5829-0

T3 - Conference Record - Asilomar Conference on Signals, Systems and Computers

SP - 1414

EP - 1418

BT - 55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021

A2 - Matthews, Michael B.

PB - IEEE

T2 - 2021 55th Asilomar Conference on Signals, Systems, and Computers

Y2 - 31 October 2021 through 3 November 2021

ER -

Feng S, Scharenborg O. The effectiveness of self-supervised representation learning in zero-resource subword modeling. In Matthews MB, editor, 55th Asilomar Conference on Signals, Systems and Computers, ACSSC 2021: Proceedings. IEEE. 2021. p. 1414-1418. 9723168. (Conference Record - Asilomar Conference on Signals, Systems and Computers). doi: 10.1109/IEEECONF53345.2021.9723318

The effectiveness of self-supervised representation learning in zero-resource subword modeling

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this