Automatic Table Union Search with Tabular Representation Learning

Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Faloutsos, Asterios Katsifodimos, George Karypis, Lijie Wen, Philip S. Yu

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

Abstract

Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate massive open data repositories. Existing methods identify uniability based on column representations (word surface forms or token embeddings) and column relation represented by column representation similarity. However, the semantic similarity obtained between column representations is often insufficient to reveal latent relational features to describe the column relation between pair of columns and not robust to the table noise. To address these issues, in this paper, we propose a multi-stage self-supervised table union search framework called AUTOTUS, which represents column relation as a vector- column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for table unionability prediction. In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned. Moreover, to improve the robustness of the model against table noises, we propose table noise generator to add table noise to the training table data. Experiments on real-world datasets and synthetic test set augmented with table noise show that AUTOTUS achieves 5.2% performance gain over the SOTA baseline.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics, ACL 2023
PublisherAssociation for Computational Linguistics (ACL)
Pages3786-3800
Number of pages15
ISBN (Electronic)9781959429623
Publication statusPublished - 2023
Externally publishedYes
Event61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada
Duration: 9 Jul 202314 Jul 2023

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Country/TerritoryCanada
CityToronto
Period9/07/2314/07/23

Fingerprint

Dive into the research topics of 'Automatic Table Union Search with Tabular Representation Learning'. Together they form a unique fingerprint.

Cite this