Query rewriting for heterogeneous data lakes

Rihan Hai; Christoph Quix; Chen Zhou

doi:10.1007/978-3-319-98398-1_3

Query rewriting for heterogeneous data lakes

Rihan Hai, Christoph Quix, Chen Zhou

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

Abstract

The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.

Original language	English
Title of host publication	European Conference on Advances in Databases and Information Systems
Pages	35-49
Number of pages	15
DOIs	https://doi.org/10.1007/978-3-319-98398-1_3
Publication status	Published - 2018
Externally published	Yes

Access to Document

10.1007/978-3-319-98398-1_3

Cite this

@inproceedings{baf7ff86268c4a6e86122ab47f7d0efa,

title = "Query rewriting for heterogeneous data lakes",

abstract = "The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.",

author = "Rihan Hai and Christoph Quix and Chen Zhou",

year = "2018",

doi = "10.1007/978-3-319-98398-1_3",

language = "English",

pages = "35--49",

booktitle = "European Conference on Advances in Databases and Information Systems",

}

TY - GEN

T1 - Query rewriting for heterogeneous data lakes

AU - Hai, Rihan

AU - Quix, Christoph

AU - Zhou, Chen

PY - 2018

Y1 - 2018

N2 - The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.

AB - The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach is based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.

U2 - 10.1007/978-3-319-98398-1_3

DO - 10.1007/978-3-319-98398-1_3

M3 - Conference contribution

SP - 35

EP - 49

BT - European Conference on Advances in Databases and Information Systems

ER -

Query rewriting for heterogeneous data lakes

Abstract

Access to Document

Fingerprint

Cite this