Show and speak: Directly synthesize spoken description of images

Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

2 Citations (Scopus)
21 Downloads (Pure)

Abstract

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.
Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of PublicationPiscataway
PublisherIEEE
Pages4190-4194
Number of pages5
ISBN (Electronic)978-1-7281-7605-5
ISBN (Print)978-1-7281-7606-2
DOIs
Publication statusPublished - 2021
EventICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada
Duration: 6 Jun 202111 Jun 2021

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

ConferenceICASSP 2021
Country/TerritoryCanada
CityVirtual Conference/Toronto
Period6/06/2111/06/21

Bibliographical note

Accepted author manuscript

Keywords

  • Encoder-decoder
  • Image captioning
  • Image-to-speech
  • Sequence-to-sequence
  • Speech synthesis

Fingerprint

Dive into the research topics of 'Show and speak: Directly synthesize spoken description of images'. Together they form a unique fingerprint.

Cite this