Abstract
Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) |
Editors | L. O'Conner |
Place of Publication | Piscataway |
Publisher | IEEE |
Pages | 104-114 |
Number of pages | 11 |
ISBN (Electronic) | 978-1-6654-8106-9 |
ISBN (Print) | 978-1-6654-8107-6 |
DOIs | |
Publication status | Published - 2022 |
Event | 2022 IEEE 36th International Parallel and Distributed Processing Symposium - Vitual at Lyon, France Duration: 30 May 2022 → 3 Jun 2022 Conference number: 36th |
Conference
Conference | 2022 IEEE 36th International Parallel and Distributed Processing Symposium |
---|---|
Abbreviated title | IPDPS 2022 |
Country/Territory | France |
City | Vitual at Lyon |
Period | 30/05/22 → 3/06/22 |
Bibliographical note
Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-careOtherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.