Align or attend?: Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Liming Wang; Xinsheng Wang; Mark Hasegawa-Johnson; Odette Scharenborg; Najim Dehak

doi:10.1109/ICASSP39728.2021.9414418

Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Liming Wang, Xinsheng Wang, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

7 Citations (Scopus)

Abstract

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.

Original language	English
Title of host publication	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of Publication	Piscataway
Publisher	IEEE
Pages	7603-7607
Number of pages	5
ISBN (Electronic)	978-1-7281-7605-5
ISBN (Print)	978-1-7281-7606-2
DOIs	https://doi.org/10.1109/ICASSP39728.2021.9414418
Publication status	Published - 2021
Event	ICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021

Conference

Conference	ICASSP 2021
Country/Territory	Canada
City	Virtual Conference/Toronto
Period	6/06/21 → 11/06/21

Keywords

Language acquisition
Low-resource speech technology
Multimodal learning
Spoken term discovery

Access to Document

10.1109/ICASSP39728.2021.9414418

Cite this

Wang, L., Wang, X., Hasegawa-Johnson, M., Scharenborg, O., & Dehak, N. (2021). Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7603-7607). Article 9414418 IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414418

@inproceedings{043e055f194b4f66a3998e76e141429a,

title = "Align or attend?: Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval",

abstract = "Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.",

keywords = "Language acquisition, Low-resource speech technology, Multimodal learning, Spoken term discovery",

author = "Liming Wang and Xinsheng Wang and Mark Hasegawa-Johnson and Odette Scharenborg and Najim Dehak",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9414418",

language = "English",

isbn = "978-1-7281-7606-2",

pages = "7603--7607",

booktitle = "ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

publisher = "IEEE",

address = "United States",

note = "ICASSP 2021 : The IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 06-06-2021 Through 11-06-2021",

}

Wang, L, Wang, X, Hasegawa-Johnson, M, Scharenborg, O & Dehak, N 2021, Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval. in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 9414418, IEEE, Piscataway, pp. 7603-7607, ICASSP 2021, Virtual Conference/Toronto, Canada, 6/06/21. https://doi.org/10.1109/ICASSP39728.2021.9414418

Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval. / Wang, Liming; Wang, Xinsheng; Hasegawa-Johnson, Mark et al.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2021. p. 7603-7607 9414418.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Align or attend?

T2 - ICASSP 2021

AU - Wang, Liming

AU - Wang, Xinsheng

AU - Hasegawa-Johnson, Mark

AU - Scharenborg, Odette

AU - Dehak, Najim

PY - 2021

Y1 - 2021

N2 - Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.

AB - Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.

KW - Language acquisition

KW - Low-resource speech technology

KW - Multimodal learning

KW - Spoken term discovery

UR - http://www.scopus.com/inward/record.url?scp=85112704804&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9414418

DO - 10.1109/ICASSP39728.2021.9414418

M3 - Conference contribution

SN - 978-1-7281-7606-2

SP - 7603

EP - 7607

BT - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

CY - Piscataway

Y2 - 6 June 2021 through 11 June 2021

ER -