Adversarial Cross-Modal Retrieval

Bokun Wang; Yang Yang; Xu Xing; Alan Hanjalic; Heng Tao Shen

doi:10.1145/3123266.3123326

Adversarial Cross-Modal Retrieval

Bokun Wang, Yang Yang^*, Xu Xing, Alan Hanjalic, Heng Tao Shen

^*Corresponding author for this work

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

617 Citations (Scopus)

470 Downloads (Pure)

Abstract

Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

Original language	English
Title of host publication	MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
Place of Publication	New York
Publisher	Association for Computing Machinery (ACM)
Pages	154-162
Number of pages	9
ISBN (Electronic)	978-1-4503-4906-2
DOIs	https://doi.org/10.1145/3123266.3123326
Publication status	Published - 2017
Event	MM'17 : 25th ACM Multimedia Conference - Computer History Museum, Mountain View, CA, United States Duration: 23 Oct 2017 → 27 Oct 2017 Conference number: 25 http://www.acmmm.org/2017/

Conference

Conference	MM'17
Abbreviated title	ACM MM 2017
Country/Territory	United States
City	Mountain View, CA
Period	23/10/17 → 27/10/17
Internet address	http://www.acmmm.org/2017/

Keywords

Adversarial learning
Cross-modal retrieval
Modality gap

Access to Document

10.1145/3123266.3123326

acmr author copyAccepted author manuscript, 783 KB

Cite this

@inproceedings{fef219f615764202b29433f5271e8e5a,

title = "Adversarial Cross-Modal Retrieval",

abstract = "Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.",

keywords = "Adversarial learning, Cross-modal retrieval, Modality gap",

author = "Bokun Wang and Yang Yang and Xu Xing and Alan Hanjalic and Shen, {Heng Tao}",

year = "2017",

doi = "10.1145/3123266.3123326",

language = "English",

pages = "154--162",

booktitle = "MM 2017 - Proceedings of the 2017 ACM Multimedia Conference",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

note = "MM'17 : 25th ACM Multimedia Conference, ACM MM 2017 ; Conference date: 23-10-2017 Through 27-10-2017",

url = "http://www.acmmm.org/2017/",

}

TY - GEN

T1 - Adversarial Cross-Modal Retrieval

AU - Wang, Bokun

AU - Yang, Yang

AU - Xing, Xu

AU - Hanjalic, Alan

AU - Shen, Heng Tao

N1 - Conference code: 25

PY - 2017

Y1 - 2017

N2 - Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

AB - Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.

KW - Adversarial learning

KW - Cross-modal retrieval

KW - Modality gap

UR - http://www.scopus.com/inward/record.url?scp=85035233555&partnerID=8YFLogxK

U2 - 10.1145/3123266.3123326

DO - 10.1145/3123266.3123326

M3 - Conference contribution

AN - SCOPUS:85035233555

SP - 154

EP - 162

BT - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference

PB - Association for Computing Machinery (ACM)

CY - New York

T2 - MM'17

Y2 - 23 October 2017 through 27 October 2017

ER -

Adversarial Cross-Modal Retrieval

Abstract

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this