In-Memory Indexed Caching for Distributed Data Processing

Alexandru Uta, Bogdan Ghit, Ankur Dave, Jan Rellermeyer, Peter Boncz

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

13 Downloads (Pure)


Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
Original languageEnglish
Title of host publicationProceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
EditorsL. O'Conner
Place of PublicationPiscataway
Number of pages11
ISBN (Electronic)978-1-6654-8106-9
ISBN (Print)978-1-6654-8107-6
Publication statusPublished - 2022
Event2022 IEEE 36th International Parallel and Distributed Processing Symposium - Vitual at Lyon, France
Duration: 30 May 20223 Jun 2022
Conference number: 36th


Conference2022 IEEE 36th International Parallel and Distributed Processing Symposium
Abbreviated titleIPDPS 2022
CityVitual at Lyon

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project

Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.


Dive into the research topics of 'In-Memory Indexed Caching for Distributed Data Processing'. Together they form a unique fingerprint.

Cite this