Show and speak: Directly synthesize spoken description of images

Xinsheng Wang; Siyuan  Feng; Jihua Zhu; Mark Hasegawa-Johnson; Odette Scharenborg

doi:10.1109/ICASSP39728.2021.9414021

Show and speak: Directly synthesize spoken description of images

Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

2 Citations (Scopus)

21 Downloads (Pure)

Abstract

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

Original language	English
Title of host publication	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Place of Publication	Piscataway
Publisher	IEEE
Pages	4190-4194
Number of pages	5
ISBN (Electronic)	978-1-7281-7605-5
ISBN (Print)	978-1-7281-7606-2
DOIs	https://doi.org/10.1109/ICASSP39728.2021.9414021
Publication status	Published - 2021
Event	ICASSP 2021: The IEEE International Conference on Acoustics, Speech, and Signal Processing - Virtual Conference/Toronto, Canada Duration: 6 Jun 2021 → 11 Jun 2021

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)	1520-6149

Conference

Conference	ICASSP 2021
Country/Territory	Canada
City	Virtual Conference/Toronto
Period	6/06/21 → 11/06/21

Bibliographical note

Accepted author manuscript

Keywords

Encoder-decoder
Image captioning
Image-to-speech
Sequence-to-sequence
Speech synthesis

Access to Document

10.1109/ICASSP39728.2021.9414021

ICASSP2021_Image2SpeechAccepted author manuscript, 684 KB

Cite this

Wang, X., Feng, S., Zhu, J., Hasegawa-Johnson, M., & Scharenborg, O. (2021). Show and speak: Directly synthesize spoken description of images. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4190-4194). Article 9414021 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414021

@inproceedings{5ce6b416ef8141b6adf98456cf455992,

title = "Show and speak: Directly synthesize spoken description of images",

abstract = "This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.",

keywords = "Encoder-decoder, Image captioning, Image-to-speech, Sequence-to-sequence, Speech synthesis",

author = "Xinsheng Wang and Siyuan Feng and Jihua Zhu and Mark Hasegawa-Johnson and Odette Scharenborg",

note = "Accepted author manuscript; ICASSP 2021 : The IEEE International Conference on Acoustics, Speech, and Signal Processing ; Conference date: 06-06-2021 Through 11-06-2021",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9414021",

language = "English",

isbn = "978-1-7281-7606-2",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "IEEE",

pages = "4190--4194",

booktitle = "ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

address = "United States",

}

Wang, X, Feng, S, Zhu, J, Hasegawa-Johnson, M & Scharenborg, O 2021, Show and speak: Directly synthesize spoken description of images. in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 9414021, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, IEEE, Piscataway, pp. 4190-4194, ICASSP 2021, Virtual Conference/Toronto, Canada, 6/06/21. https://doi.org/10.1109/ICASSP39728.2021.9414021

Show and speak: Directly synthesize spoken description of images. / Wang, Xinsheng; Feng, Siyuan ; Zhu, Jihua et al.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE, 2021. p. 4190-4194 9414021 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Show and speak

T2 - ICASSP 2021

AU - Wang, Xinsheng

AU - Feng, Siyuan

AU - Zhu, Jihua

AU - Hasegawa-Johnson, Mark

AU - Scharenborg, Odette

N1 - Accepted author manuscript

PY - 2021

Y1 - 2021

N2 - This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

AB - This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

KW - Encoder-decoder

KW - Image captioning

KW - Image-to-speech

KW - Sequence-to-sequence

KW - Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85115170797&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9414021

DO - 10.1109/ICASSP39728.2021.9414021

M3 - Conference contribution

SN - 978-1-7281-7606-2

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 4190

EP - 4194

BT - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

CY - Piscataway

Y2 - 6 June 2021 through 11 June 2021

ER -

Wang X, Feng S, Zhu J, Hasegawa-Johnson M, Scharenborg O. Show and speak: Directly synthesize spoken description of images. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway: IEEE. 2021. p. 4190-4194. 9414021. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP39728.2021.9414021

Show and speak: Directly synthesize spoken description of images

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this