Generating Images from Spoken Descriptions

Xinsheng Wang; Tingting Qiao; Jihua Zhu; Alan Hanjalic; Odette Scharenborg

doi:10.1109/TASLP.2021.3053391

Generating Images from Spoken Descriptions

Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg

Research output: Contribution to journal › Article › Scientific › peer-review

11 Citations (Scopus)

63 Downloads (Pure)

Abstract

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

Original language	English
Article number	9333641
Pages (from-to)	850-865
Number of pages	16
Journal	IEEE/ACM Transactions on Audio Speech and Language Processing
Volume	29
DOIs	https://doi.org/10.1109/TASLP.2021.3053391
Publication status	Published - 2021

Keywords

adversarial learning
Birds
Databases
Electronic mail
Image synthesis
multimodal modelling
Semantics
speech embedding
Speech processing
Speech-to-image generation
Task analysis
speech-to-image generation
Adversarial learning

Access to Document

10.1109/TASLP.2021.3053391

ManuscriptAccepted author manuscript, 2.27 MB

Cite this

@article{b1ba78ac15c94e79b978d58c720810b2,

title = "Generating Images from Spoken Descriptions",

abstract = "Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.",

keywords = "adversarial learning, Birds, Databases, Electronic mail, Image synthesis, multimodal modelling, Semantics, speech embedding, Speech processing, Speech-to-image generation, Task analysis, speech-to-image generation, Adversarial learning",

author = "Xinsheng Wang and Tingting Qiao and Jihua Zhu and Alan Hanjalic and Odette Scharenborg",

year = "2021",

doi = "10.1109/TASLP.2021.3053391",

language = "English",

volume = "29",

pages = "850--865",

journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",

issn = "2329-9290",

publisher = "IEEE Advancing Technology for Humanity",

}

TY - JOUR

T1 - Generating Images from Spoken Descriptions

AU - Wang, Xinsheng

AU - Qiao, Tingting

AU - Zhu, Jihua

AU - Hanjalic, Alan

AU - Scharenborg, Odette

PY - 2021

Y1 - 2021

N2 - Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

AB - Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

KW - adversarial learning

KW - Birds

KW - Databases

KW - Electronic mail

KW - Image synthesis

KW - multimodal modelling

KW - Semantics

KW - speech embedding

KW - Speech processing

KW - Speech-to-image generation

KW - Task analysis

KW - speech-to-image generation

KW - Adversarial learning

UR - http://www.scopus.com/inward/record.url?scp=85100448681&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2021.3053391

DO - 10.1109/TASLP.2021.3053391

M3 - Article

AN - SCOPUS:85100448681

SN - 2329-9290

VL - 29

SP - 850

EP - 865

JO - IEEE/ACM Transactions on Audio Speech and Language Processing

JF - IEEE/ACM Transactions on Audio Speech and Language Processing

M1 - 9333641

ER -

Generating Images from Spoken Descriptions

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this