Evaluating automatically generated phoneme captions for images

Justin  van der Hout; Zoltán  D’Haese; Mark Hasegawa-Johnson; Odette Scharenborg

doi:10.21437/Interspeech.2020-2870

Evaluating automatically generated phoneme captions for images

Justin van der Hout, Zoltán D’Haese, Mark Hasegawa-Johnson, Odette Scharenborg

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

3 Citations (Scopus)

30 Downloads (Pure)

Abstract

Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.

Original language	English
Title of host publication	Proceedings of Interspeech 2020
Publisher	ISCA
Pages	2317 - 2321
Number of pages	5
DOIs	https://doi.org/10.21437/Interspeech.2020-2870
Publication status	Published - 2020
Event	INTERSPEECH 2020 - Shanghai, Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Publication series

Name	Interspeech 2020
Publisher	ISCA
ISSN (Print)	1990-9772

Conference

Conference	INTERSPEECH 2020
Country/Territory	China
City	Shanghai
Period	25/10/20 → 29/10/20

Keywords

Image captioning
Speech
Unwritten languages

Access to Document

10.21437/Interspeech.2020-2870

vanderHout_Interspeech2020Accepted author manuscript, 280 KB

Cite this

@inproceedings{d1373fda1ee342c096d2a2f44d668d82,

title = "Evaluating automatically generated phoneme captions for images",

abstract = "Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.",

keywords = "Image captioning, Speech, Unwritten languages",

author = "{van der Hout}, Justin and Zolt{\'a}n D{\textquoteright}Haese and Mark Hasegawa-Johnson and Odette Scharenborg",

year = "2020",

doi = "10.21437/Interspeech.2020-2870",

language = "English",

series = "Interspeech 2020",

publisher = "ISCA",

pages = "2317 -- 2321",

booktitle = "Proceedings of Interspeech 2020",

note = "INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

}

TY - GEN

T1 - Evaluating automatically generated phoneme captions for images

AU - van der Hout, Justin

AU - D’Haese, Zoltán

AU - Hasegawa-Johnson, Mark

AU - Scharenborg, Odette

PY - 2020

Y1 - 2020

N2 - Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.

AB - Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.

KW - Image captioning

KW - Speech

KW - Unwritten languages

UR - http://www.scopus.com/inward/record.url?scp=85098111607&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-2870

DO - 10.21437/Interspeech.2020-2870

M3 - Conference contribution

T3 - Interspeech 2020

SP - 2317

EP - 2321

BT - Proceedings of Interspeech 2020

PB - ISCA

T2 - INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Evaluating automatically generated phoneme captions for images

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this