Synthesizing Spoken Descriptions of Images

Xinsheng Wang, Justin van der Hout, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

Research output: Contribution to journalArticleScientificpeer-review

5 Downloads (Pure)

Abstract

Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages’ lack of a written form. To solve this problem, recently the image-to-speech task was proposed, which generates spoken descriptions of images bypassing any text via an intermediate representation consisting of phonemes (image-to-phoneme). Here, we present a comprehensive study on the image-to-speech task in which, 1) several representative image-to-text generation methods are implemented for the image-to-phoneme task, 2) objective metrics are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-to-speech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. Extensive experiments are conducted on the public benchmark database Flickr8k. Results of our experiments demonstrate that 1) State-of-the-art image-to-text models can perform well on the image-to-phoneme task, and 2) several evaluation metrics, including BLEU3, BLEU4, BLEU5, and ROUGE-L can be used to evaluate image-to-phoneme performance. Finally, 3) end-to-end image-to-speech bypassing text and phonemes is feasible.
Original languageEnglish
Article number9581052
Pages (from-to)3242-3254
Number of pages13
JournalIEEE - ACM Transactions on Audio, Speech, and Language Processing
Volume29
DOIs
Publication statusPublished - 2021

Bibliographical note


Accepted author manuscript

Keywords

  • Speech processing
  • Image-to-speech generation
  • multimodal modelling
  • speech synthesis
  • cross-modal captioning

Fingerprint

Dive into the research topics of 'Synthesizing Spoken Descriptions of Images'. Together they form a unique fingerprint.

Cite this