Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes

Fuyang Yu; Zhen Wang; Dongyuan Li; Peide Zhu; Xiaohui Liang; Xiaochuan Wang; Manabu Okumura

doi:10.1007/978-3-031-53302-0_7

Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes

Fuyang Yu, Zhen Wang, Dongyuan Li, Peide Zhu, Xiaohui Liang^*, Xiaochuan Wang, Manabu Okumura

^*Corresponding author for this work

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

Abstract

Cross-modal retrieval, as an important emerging foundational information retrieval task, benefits from recent advances in multimodal technologies. However, current cross-modal retrieval methods mainly focus on the interaction between textual information and 2D images, lacking research on 3D data, especially point clouds at scene level, despite the increasing role point clouds play in daily life. Therefore, in this paper, we proposed a cross-modal point cloud retrieval benchmark that focuses on using text or images to retrieve point clouds of indoor scenes. Given the high cost of obtaining point cloud compared to text and images, we first designed a pipeline to automatically generate a large number of indoor scenes and their corresponding scene graphs. Based on this pipeline, we collected a balanced dataset called CRISP, which contains 10K point cloud scenes along with their corresponding scene images and descriptions. We then used state-of-the-art models to design baseline methods on CRISP. Our experiments demonstrated that point cloud retrieval accuracy is much lower than cross-modal retrieval of 2D images, especially for textual queries. Furthermore, we proposed ModalBlender, a tri-modal framework which can greatly improve the Text-PointCloud retrieval performance. Through extensive experiments, CRISP proved to be a valuable dataset and worth researching. (Dataset can be downloaded at https://github.com/CRISPdataset/CRISP.)

Original language	English
Title of host publication	MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings
Editors	Stevan Rudinac, Marcel Worring, Cynthia Liem, Alan Hanjalic, Björn Pór Jónsson, Yoko Yamakata, Bei Liu
Place of Publication	Cham
Publisher	Springer
Pages	89-102
Number of pages	14
ISBN (Electronic)	978-3-031-53302-0
ISBN (Print)	978-3-031-53301-3
DOIs	https://doi.org/10.1007/978-3-031-53302-0_7
Publication status	Published - 2024
Event	30th International Conference on MultiMedia Modeling, MMM 2024 - Amsterdam, Netherlands Duration: 29 Jan 2024 → 2 Feb 2024

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14557 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	30th International Conference on MultiMedia Modeling, MMM 2024
Country/Territory	Netherlands
City	Amsterdam
Period	29/01/24 → 2/02/24

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

Cross-modal Retrieval
Indoor Scene
Point Cloud

Access to Document

10.1007/978-3-031-53302-0_7

Cite this

Yu, F., Wang, Z., Li, D., Zhu, P., Liang, X., Wang, X., & Okumura, M. (2024). Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes. In S. Rudinac, M. Worring, C. Liem, A. Hanjalic, B. P. Jónsson, Y. Yamakata, & B. Liu (Eds.), MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings (pp. 89-102). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14557 LNCS). Springer. https://doi.org/10.1007/978-3-031-53302-0_7

Yu, Fuyang ; Wang, Zhen ; Li, Dongyuan et al. / Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes. MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. editor / Stevan Rudinac ; Marcel Worring ; Cynthia Liem ; Alan Hanjalic ; Björn Pór Jónsson ; Yoko Yamakata ; Bei Liu. Cham : Springer, 2024. pp. 89-102 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{3c4e3e23f4bc44feb737ce1e0fcdfed0,

title = "Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes",

abstract = "Cross-modal retrieval, as an important emerging foundational information retrieval task, benefits from recent advances in multimodal technologies. However, current cross-modal retrieval methods mainly focus on the interaction between textual information and 2D images, lacking research on 3D data, especially point clouds at scene level, despite the increasing role point clouds play in daily life. Therefore, in this paper, we proposed a cross-modal point cloud retrieval benchmark that focuses on using text or images to retrieve point clouds of indoor scenes. Given the high cost of obtaining point cloud compared to text and images, we first designed a pipeline to automatically generate a large number of indoor scenes and their corresponding scene graphs. Based on this pipeline, we collected a balanced dataset called CRISP, which contains 10K point cloud scenes along with their corresponding scene images and descriptions. We then used state-of-the-art models to design baseline methods on CRISP. Our experiments demonstrated that point cloud retrieval accuracy is much lower than cross-modal retrieval of 2D images, especially for textual queries. Furthermore, we proposed ModalBlender, a tri-modal framework which can greatly improve the Text-PointCloud retrieval performance. Through extensive experiments, CRISP proved to be a valuable dataset and worth researching. (Dataset can be downloaded at https://github.com/CRISPdataset/CRISP.)",

keywords = "Cross-modal Retrieval, Indoor Scene, Point Cloud",

author = "Fuyang Yu and Zhen Wang and Dongyuan Li and Peide Zhu and Xiaohui Liang and Xiaochuan Wang and Manabu Okumura",

note = "Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.; 30th International Conference on MultiMedia Modeling, MMM 2024 ; Conference date: 29-01-2024 Through 02-02-2024",

year = "2024",

doi = "10.1007/978-3-031-53302-0_7",

language = "English",

isbn = "978-3-031-53301-3",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "89--102",

editor = "Stevan Rudinac and Marcel Worring and Cynthia Liem and Alan Hanjalic and J{\'o}nsson, {Bj{\"o}rn P{\'o}r} and Yoko Yamakata and Bei Liu",

booktitle = "MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings",

}

Yu, F, Wang, Z, Li, D, Zhu, P, Liang, X, Wang, X & Okumura, M 2024, Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes. in S Rudinac, M Worring, C Liem, A Hanjalic, BP Jónsson, Y Yamakata & B Liu (eds), MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14557 LNCS, Springer, Cham, pp. 89-102, 30th International Conference on MultiMedia Modeling, MMM 2024, Amsterdam, Netherlands, 29/01/24. https://doi.org/10.1007/978-3-031-53302-0_7

Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes. / Yu, Fuyang; Wang, Zhen; Li, Dongyuan et al.
MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. ed. / Stevan Rudinac; Marcel Worring; Cynthia Liem; Alan Hanjalic; Björn Pór Jónsson; Yoko Yamakata; Bei Liu. Cham: Springer, 2024. p. 89-102 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14557 LNCS).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes

AU - Yu, Fuyang

AU - Wang, Zhen

AU - Li, Dongyuan

AU - Zhu, Peide

AU - Liang, Xiaohui

AU - Wang, Xiaochuan

AU - Okumura, Manabu

N1 - Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2024

Y1 - 2024

N2 - Cross-modal retrieval, as an important emerging foundational information retrieval task, benefits from recent advances in multimodal technologies. However, current cross-modal retrieval methods mainly focus on the interaction between textual information and 2D images, lacking research on 3D data, especially point clouds at scene level, despite the increasing role point clouds play in daily life. Therefore, in this paper, we proposed a cross-modal point cloud retrieval benchmark that focuses on using text or images to retrieve point clouds of indoor scenes. Given the high cost of obtaining point cloud compared to text and images, we first designed a pipeline to automatically generate a large number of indoor scenes and their corresponding scene graphs. Based on this pipeline, we collected a balanced dataset called CRISP, which contains 10K point cloud scenes along with their corresponding scene images and descriptions. We then used state-of-the-art models to design baseline methods on CRISP. Our experiments demonstrated that point cloud retrieval accuracy is much lower than cross-modal retrieval of 2D images, especially for textual queries. Furthermore, we proposed ModalBlender, a tri-modal framework which can greatly improve the Text-PointCloud retrieval performance. Through extensive experiments, CRISP proved to be a valuable dataset and worth researching. (Dataset can be downloaded at https://github.com/CRISPdataset/CRISP.)

AB - Cross-modal retrieval, as an important emerging foundational information retrieval task, benefits from recent advances in multimodal technologies. However, current cross-modal retrieval methods mainly focus on the interaction between textual information and 2D images, lacking research on 3D data, especially point clouds at scene level, despite the increasing role point clouds play in daily life. Therefore, in this paper, we proposed a cross-modal point cloud retrieval benchmark that focuses on using text or images to retrieve point clouds of indoor scenes. Given the high cost of obtaining point cloud compared to text and images, we first designed a pipeline to automatically generate a large number of indoor scenes and their corresponding scene graphs. Based on this pipeline, we collected a balanced dataset called CRISP, which contains 10K point cloud scenes along with their corresponding scene images and descriptions. We then used state-of-the-art models to design baseline methods on CRISP. Our experiments demonstrated that point cloud retrieval accuracy is much lower than cross-modal retrieval of 2D images, especially for textual queries. Furthermore, we proposed ModalBlender, a tri-modal framework which can greatly improve the Text-PointCloud retrieval performance. Through extensive experiments, CRISP proved to be a valuable dataset and worth researching. (Dataset can be downloaded at https://github.com/CRISPdataset/CRISP.)

KW - Cross-modal Retrieval

KW - Indoor Scene

KW - Point Cloud

UR - http://www.scopus.com/inward/record.url?scp=85184797734&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-53302-0_7

DO - 10.1007/978-3-031-53302-0_7

M3 - Conference contribution

AN - SCOPUS:85184797734

SN - 978-3-031-53301-3

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 89

EP - 102

BT - MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings

A2 - Rudinac, Stevan

A2 - Worring, Marcel

A2 - Liem, Cynthia

A2 - Hanjalic, Alan

A2 - Jónsson, Björn Pór

A2 - Yamakata, Yoko

A2 - Liu, Bei

PB - Springer

CY - Cham

T2 - 30th International Conference on MultiMedia Modeling, MMM 2024

Y2 - 29 January 2024 through 2 February 2024

ER -

Yu F, Wang Z, Li D, Zhu P, Liang X, Wang X et al. Towards Cross-Modal Point Cloud Retrieval for Indoor Scenes. In Rudinac S, Worring M, Liem C, Hanjalic A, Jónsson BP, Yamakata Y, Liu B, editors, MultiMedia Modeling - 30th International Conference, MMM 2024, Proceedings. Cham: Springer. 2024. p. 89-102. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-53302-0_7