Valentine: Evaluating Matching Techniques for Dataset Discovery

Christos Koutras; George Siachamis; Andra Ionescu; Kyriakos Psarakis; Jerry Brons; Marios Fragkoulis; Christoph Lofi; Angela Bonifati; Asterios  Katsifodimos

doi:10.1109/ICDE51399.2021.00047

Valentine: Evaluating Matching Techniques for Dataset Discovery

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios Katsifodimos

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

24 Citations (Scopus)

4 Downloads (Pure)

Abstract

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

Original language	English
Title of host publication	Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
Place of Publication	Chania, Greece
Publisher	IEEE
Pages	468-479
Number of pages	12
ISBN (Electronic)	9781728191843
DOIs	https://doi.org/10.1109/ICDE51399.2021.00047
Publication status	Published - 2021
Event	37th IEEE International Conference on Data Engineering - Virtual/online event Duration: 19 Apr 2021 → 22 Apr 2021

Publication series

Name	Proceedings - International Conference on Data Engineering
Volume	2021-April
ISSN (Print)	1084-4627

Conference

Conference	37th IEEE International Conference on Data Engineering
Abbreviated title	ICDE2021
Period	19/04/21 → 22/04/21

Access to Document

10.1109/ICDE51399.2021.00047

Cite this

Koutras, C., Siachamis, G., Ionescu, A., Psarakis, K., Brons, J., Fragkoulis, M., Lofi, C., Bonifati, A., & Katsifodimos, A. (2021). Valentine: Evaluating Matching Techniques for Dataset Discovery. In Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021 (pp. 468-479). Article 9458921 (Proceedings - International Conference on Data Engineering; Vol. 2021-April). IEEE. https://doi.org/10.1109/ICDE51399.2021.00047

@inproceedings{1335cd6180764327923a12ea5cfb6eed,

title = "Valentine: Evaluating Matching Techniques for Dataset Discovery",

abstract = "Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method{\textquoteright}s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods. ",

author = "Christos Koutras and George Siachamis and Andra Ionescu and Kyriakos Psarakis and Jerry Brons and Marios Fragkoulis and Christoph Lofi and Angela Bonifati and Asterios Katsifodimos",

year = "2021",

doi = "10.1109/ICDE51399.2021.00047",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE",

pages = "468--479",

booktitle = "Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021",

address = "United States",

note = "37th IEEE International Conference on Data Engineering, ICDE2021 ; Conference date: 19-04-2021 Through 22-04-2021",

}

Koutras, C , Siachamis, G , Ionescu, A , Psarakis, K , Brons, J , Fragkoulis, M , Lofi, C, Bonifati, A & Katsifodimos, A 2021, Valentine: Evaluating Matching Techniques for Dataset Discovery. in Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021., 9458921, Proceedings - International Conference on Data Engineering, vol. 2021-April, IEEE, Chania, Greece, pp. 468-479, 37th IEEE International Conference on Data Engineering, 19/04/21. https://doi.org/10.1109/ICDE51399.2021.00047

Valentine: Evaluating Matching Techniques for Dataset Discovery. / Koutras, Christos ; Siachamis, George ; Ionescu, Andra et al.
Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021. Chania, Greece: IEEE, 2021. p. 468-479 9458921 (Proceedings - International Conference on Data Engineering; Vol. 2021-April).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Valentine: Evaluating Matching Techniques for Dataset Discovery

AU - Koutras, Christos

AU - Siachamis, George

AU - Ionescu, Andra

AU - Psarakis, Kyriakos

AU - Brons, Jerry

AU - Fragkoulis, Marios

AU - Lofi, Christoph

AU - Bonifati, Angela

AU - Katsifodimos, Asterios

PY - 2021

Y1 - 2021

N2 - Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

AB - Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

UR - http://www.scopus.com/inward/record.url?scp=85112864391&partnerID=8YFLogxK

U2 - 10.1109/ICDE51399.2021.00047

DO - 10.1109/ICDE51399.2021.00047

M3 - Conference contribution

T3 - Proceedings - International Conference on Data Engineering

SP - 468

EP - 479

BT - Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021

PB - IEEE

CY - Chania, Greece

T2 - 37th IEEE International Conference on Data Engineering

Y2 - 19 April 2021 through 22 April 2021

ER -

Koutras C , Siachamis G , Ionescu A , Psarakis K , Brons J , Fragkoulis M et al. Valentine: Evaluating Matching Techniques for Dataset Discovery. In Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021. Chania, Greece: IEEE. 2021. p. 468-479. 9458921. (Proceedings - International Conference on Data Engineering). doi: 10.1109/ICDE51399.2021.00047

Valentine: Evaluating Matching Techniques for Dataset Discovery

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this