TY - GEN
T1 - Relaxed functional dependency discovery in heterogeneous data lakes
AU - Hai, Rihan
AU - Quix, Christoph
AU - Wang, Dan
PY - 2019
Y1 - 2019
N2 - Functional dependencies are important for the definition of constraints and relationships that have to be satisfied by every database instance. Relaxed functional dependencies (RFDs) can be used for data exploration and profiling in datasets with lower data quality. In this work, we present an approach for RFD discovery in heterogeneous data lakes. More specifically, the goal of this work is to find RFDs from structured, semi-structured, and graph data. Our solution brings novelty to this problem in the following aspects: (1) We introduce a generic metamodel to the problem of RFD discovery, which allows us to define and detect RFDs for data stored in heterogeneous sources in an integrated manner. (2) We apply clustering techniques during RFD discovery for partitioning and pruning. (3) We performed an intensive evaluation with nine datasets, which shows that our approach is effective for discovering meaningful RFDs, reducing redundancy, and detecting inconsistent data.
AB - Functional dependencies are important for the definition of constraints and relationships that have to be satisfied by every database instance. Relaxed functional dependencies (RFDs) can be used for data exploration and profiling in datasets with lower data quality. In this work, we present an approach for RFD discovery in heterogeneous data lakes. More specifically, the goal of this work is to find RFDs from structured, semi-structured, and graph data. Our solution brings novelty to this problem in the following aspects: (1) We introduce a generic metamodel to the problem of RFD discovery, which allows us to define and detect RFDs for data stored in heterogeneous sources in an integrated manner. (2) We apply clustering techniques during RFD discovery for partitioning and pruning. (3) We performed an intensive evaluation with nine datasets, which shows that our approach is effective for discovering meaningful RFDs, reducing redundancy, and detecting inconsistent data.
UR - http://www.scopus.com/inward/record.url?scp=85076141961&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-33223-5_19
DO - 10.1007/978-3-030-33223-5_19
M3 - Conference contribution
SN - 9783030332228
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 225
EP - 239
BT - Conceptual Modeling - 38th International Conference, ER 2019, Proceedings
A2 - Laender, Alberto H.F.
A2 - Pernici, Barbara
A2 - Lim, Ee-Peng
A2 - de Oliveira, José Palazzo M.
ER -