TY - GEN
T1 - S2IGAN
T2 - INTERSPEECH 2020
AU - Wang, Xinsheng
AU - Qiao, Tingting
AU - Zhu, Jihua
AU - Hanjalic, Alan
AU - Scharenborg, Odette
PY - 2020
Y1 - 2020
N2 - An estimated half of the world’s languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed S2IG framework, named S2IGAN, consists of a speech embedding network (SEN) and a relation-supervised densely-stacked generative model (RDG). SEN learns the speech embedding with the supervision of the corresponding visual information. Conditioned on the speech embedding produced by SEN, the proposed RDG synthesizes images that are semantically consistent with the corresponding speech descriptions. Extensive experiments on datasets CUB and Oxford-102 demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.
AB - An estimated half of the world’s languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed S2IG framework, named S2IGAN, consists of a speech embedding network (SEN) and a relation-supervised densely-stacked generative model (RDG). SEN learns the speech embedding with the supervision of the corresponding visual information. Conditioned on the speech embedding produced by SEN, the proposed RDG synthesizes images that are semantically consistent with the corresponding speech descriptions. Extensive experiments on datasets CUB and Oxford-102 demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.
KW - Adversarial learning
KW - Multimodal modelling
KW - Speech embedding
KW - Speech-to-image generation
UR - http://www.scopus.com/inward/record.url?scp=85098124905&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1759
DO - 10.21437/Interspeech.2020-1759
M3 - Conference contribution
T3 - Interspeech 2020
SP - 2292
EP - 2296
BT - Proceedings of Interspeech 2020
PB - ISCA
Y2 - 25 October 2020 through 29 October 2020
ER -