Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Liming Wang, Xinsheng Wang, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

7 Citations (Scopus)


Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.
Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of PublicationPiscataway
Number of pages5
ISBN (Electronic)978-1-7281-7605-5
ISBN (Print)978-1-7281-7606-2
Publication statusPublished - 2021
EventICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada
Duration: 6 Jun 202111 Jun 2021


ConferenceICASSP 2021
CityVirtual Conference/Toronto


  • Language acquisition
  • Low-resource speech technology
  • Multimodal learning
  • Spoken term discovery


Dive into the research topics of 'Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval'. Together they form a unique fingerprint.

Cite this