Generating Images from Spoken Descriptions

Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg

Research output: Contribution to journalArticleScientificpeer-review

4 Citations (Scopus)
29 Downloads (Pure)


Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

Original languageEnglish
Article number9333641
Pages (from-to)850-865
Number of pages16
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Publication statusPublished - 2021


  • adversarial learning
  • Birds
  • Databases
  • Electronic mail
  • Image synthesis
  • multimodal modelling
  • Semantics
  • speech embedding
  • Speech processing
  • Speech-to-image generation
  • Task analysis
  • speech-to-image generation
  • Adversarial learning


Dive into the research topics of 'Generating Images from Spoken Descriptions'. Together they form a unique fingerprint.

Cite this