Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Liming Wang, Xinsheng Wang, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

Abstract

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.
Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of PublicationPiscataway
PublisherIEEE
Pages7603-7607
Number of pages5
ISBN (Electronic)978-1-7281-7605-5
ISBN (Print)978-1-7281-7606-2
DOIs
Publication statusPublished - 2021
EventICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada
Duration: 6 Jun 202111 Jun 2021

Conference

ConferenceICASSP 2021
CountryCanada
CityVirtual Conference/Toronto
Period6/06/2111/06/21

Keywords

  • Multimodal learning
  • spoken term discovery
  • language acquisition
  • low-resource speech technology

Fingerprint

Dive into the research topics of 'Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval'. Together they form a unique fingerprint.

Cite this