Characterization of a Big Data Storage Workload in the Cloud

Sacheendra Talluri, Cristina L. Abad, Alicja Łuszczak, Alexandru Iosup

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

5 Citations (Scopus)
17 Downloads (Pure)


The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.

Original languageEnglish
Title of host publicationICPE '19
Subtitle of host publicationProceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Number of pages12
ISBN (Print)978-1-4503-6239-9
Publication statusPublished - 2019
Event10th International Conference on Power Electronics - ECCE Asia, ICPE 2019 - ECCE Asia: 10th International Conference on Power Electronics - Busan, Korea, Republic of
Duration: 27 May 201930 May 2019
Conference number: 10th


Conference10th International Conference on Power Electronics - ECCE Asia, ICPE 2019 - ECCE Asia
Country/TerritoryKorea, Republic of


Dive into the research topics of 'Characterization of a Big Data Storage Workload in the Cloud'. Together they form a unique fingerprint.

Cite this