TY - GEN
T1 - Automatic Table Union Search with Tabular Representation Learning
AU - Hu, Xuming
AU - Wang, Shen
AU - Qin, Xiao
AU - Lei, Chuan
AU - Shen, Zhengyuan
AU - Faloutsos, Christos
AU - Katsifodimos, Asterios
AU - Karypis, George
AU - Wen, Lijie
AU - Yu, Philip S.
PY - 2023
Y1 - 2023
N2 - Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate massive open data repositories. Existing methods identify uniability based on column representations (word surface forms or token embeddings) and column relation represented by column representation similarity. However, the semantic similarity obtained between column representations is often insufficient to reveal latent relational features to describe the column relation between pair of columns and not robust to the table noise. To address these issues, in this paper, we propose a multi-stage self-supervised table union search framework called AUTOTUS, which represents column relation as a vector- column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for table unionability prediction. In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned. Moreover, to improve the robustness of the model against table noises, we propose table noise generator to add table noise to the training table data. Experiments on real-world datasets and synthetic test set augmented with table noise show that AUTOTUS achieves 5.2% performance gain over the SOTA baseline.
AB - Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate massive open data repositories. Existing methods identify uniability based on column representations (word surface forms or token embeddings) and column relation represented by column representation similarity. However, the semantic similarity obtained between column representations is often insufficient to reveal latent relational features to describe the column relation between pair of columns and not robust to the table noise. To address these issues, in this paper, we propose a multi-stage self-supervised table union search framework called AUTOTUS, which represents column relation as a vector- column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for table unionability prediction. In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned. Moreover, to improve the robustness of the model against table noises, we propose table noise generator to add table noise to the training table data. Experiments on real-world datasets and synthetic test set augmented with table noise show that AUTOTUS achieves 5.2% performance gain over the SOTA baseline.
UR - http://www.scopus.com/inward/record.url?scp=85168822251&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85168822251
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 3786
EP - 3800
BT - Findings of the Association for Computational Linguistics, ACL 2023
PB - Association for Computational Linguistics (ACL)
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
Y2 - 9 July 2023 through 14 July 2023
ER -