Valentine in Action: Matching Tabular Data at Scale

Research output: Contribution to journalArticleScientificpeer-review

7 Downloads (Pure)

Abstract

Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of the schema matching methods in use. However, despite the wealth of research, evaluation of schema matching methods is still a daunting task: there is a lack of openly-available datasets with ground truth, reference method implementations, and comprehensible GUIs that would facilitate development of both novel state-of-the-art schema matching techniques and novel data discovery methods.Our recently proposed Valentine is the first system to offer an open-source experiment suite to organize, execute and orchestrate large-scale matching experiments. In this demonstration we present its functionalities and enhancements: i) a scalable system, with a user-centric GUI, that enables the fabrication of datasets and the evaluation of matching methods on schema matching scenarios tailored to the scope of tabular dataset discovery, ii) a scalable holistic matching system that can receive tabular datasets from heterogeneous sources and provide with similarity scores among their columns, in order to facilitate modern procedures in data lakes, such as dataset discovery.
Original languageEnglish
Pages (from-to)2871–2874
Number of pages4
JournalProceedings of the VLDB Endowment
Volume14
Issue number12
DOIs
Publication statusPublished - 2021

Fingerprint

Dive into the research topics of 'Valentine in Action: Matching Tabular Data at Scale'. Together they form a unique fingerprint.

Cite this