Matching images and text with multi-modal tensor fusion and re-ranking

Tan Wang; Alan Hanjalic; Xing Xu; Heng Tao Shen; Yang Yang; Jingkuan Song

doi:10.1145/3343031.3350875

Matching images and text with multi-modal tensor fusion and re-ranking

Tan Wang, Alan Hanjalic, Xing Xu^*, Heng Tao Shen, Yang Yang, Jingkuan Song

^*Corresponding author for this work

Intelligent Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

90 Citations (Scopus)

109 Downloads (Pure)

Abstract

A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity.

Original language	English
Title of host publication	MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
Publisher	Association for Computing Machinery (ACM)
Pages	12-20
Number of pages	9
ISBN (Electronic)	9781450368896
DOIs	https://doi.org/10.1145/3343031.3350875
Publication status	Published - 15 Oct 2019
Event	27th ACM International Conference on Multimedia, MM 2019 - Nice, France Duration: 21 Oct 2019 → 25 Oct 2019

Conference

Conference	27th ACM International Conference on Multimedia, MM 2019
Country/Territory	France
City	Nice
Period	21/10/19 → 25/10/19

Bibliographical note

Accepted author manuscript

Keywords

Cross-modal re-ranking
Image-text matching
Tensor fusion

Access to Document

10.1145/3343031.3350875

mm2019_final (002)Accepted author manuscript, 4.98 MB

Cite this

@inproceedings{9f1971860b194057812c4ef1db17c5db,

title = "Matching images and text with multi-modal tensor fusion and re-ranking",

abstract = "A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity.",

keywords = "Cross-modal re-ranking, Image-text matching, Tensor fusion",

author = "Tan Wang and Alan Hanjalic and Xing Xu and Shen, {Heng Tao} and Yang Yang and Jingkuan Song",

note = "Accepted author manuscript; 27th ACM International Conference on Multimedia, MM 2019 ; Conference date: 21-10-2019 Through 25-10-2019",

year = "2019",

month = oct,

day = "15",

doi = "10.1145/3343031.3350875",

language = "English",

pages = "12--20",

booktitle = "MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

}

Wang, T, Hanjalic, A, Xu, X, Shen, HT, Yang, Y & Song, J 2019, Matching images and text with multi-modal tensor fusion and re-ranking. in MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. Association for Computing Machinery (ACM), pp. 12-20, 27th ACM International Conference on Multimedia, MM 2019, Nice, France, 21/10/19. https://doi.org/10.1145/3343031.3350875

Matching images and text with multi-modal tensor fusion and re-ranking. / Wang, Tan; Hanjalic, Alan; Xu, Xing et al.
MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. Association for Computing Machinery (ACM), 2019. p. 12-20.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Matching images and text with multi-modal tensor fusion and re-ranking

AU - Wang, Tan

AU - Hanjalic, Alan

AU - Xu, Xing

AU - Shen, Heng Tao

AU - Yang, Yang

AU - Song, Jingkuan

N1 - Accepted author manuscript

PY - 2019/10/15

Y1 - 2019/10/15

N2 - A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity.

AB - A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity.

KW - Cross-modal re-ranking

KW - Image-text matching

KW - Tensor fusion

UR - http://www.scopus.com/inward/record.url?scp=85074815792&partnerID=8YFLogxK

U2 - 10.1145/3343031.3350875

DO - 10.1145/3343031.3350875

M3 - Conference contribution

SP - 12

EP - 20

BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia

PB - Association for Computing Machinery (ACM)

T2 - 27th ACM International Conference on Multimedia, MM 2019

Y2 - 21 October 2019 through 25 October 2019

ER -

Matching images and text with multi-modal tensor fusion and re-ranking

Abstract

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this