Abstract
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.
Original language | English |
---|---|
Title of host publication | ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Place of Publication | Piscataway |
Publisher | IEEE |
Pages | 7603-7607 |
Number of pages | 5 |
ISBN (Electronic) | 978-1-7281-7605-5 |
ISBN (Print) | 978-1-7281-7606-2 |
DOIs | |
Publication status | Published - 2021 |
Event | ICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021 |
Conference
Conference | ICASSP 2021 |
---|---|
Country/Territory | Canada |
City | Virtual Conference/Toronto |
Period | 6/06/21 → 11/06/21 |
Keywords
- Language acquisition
- Low-resource speech technology
- Multimodal learning
- Spoken term discovery