In-Memory Indexed Caching for Distributed Data Processing

Alexandru Uta, Bogdan Ghit, Ankur Dave, Jan Rellermeyer, Peter Boncz

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

43 Downloads (Pure)

Abstract

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
Original languageEnglish
Title of host publicationProceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
EditorsL. O'Conner
Place of PublicationPiscataway
PublisherIEEE
Pages104-114
Number of pages11
ISBN (Electronic)978-1-6654-8106-9
ISBN (Print)978-1-6654-8107-6
DOIs
Publication statusPublished - 2022
Event2022 IEEE 36th International Parallel and Distributed Processing Symposium - Vitual at Lyon, France
Duration: 30 May 20223 Jun 2022
Conference number: 36th

Conference

Conference2022 IEEE 36th International Parallel and Distributed Processing Symposium
Abbreviated titleIPDPS 2022
Country/TerritoryFrance
CityVitual at Lyon
Period30/05/223/06/22

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care

Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Fingerprint

Dive into the research topics of 'In-Memory Indexed Caching for Distributed Data Processing'. Together they form a unique fingerprint.

Cite this