Courier: Real-Time Optimal Batch Size Prediction for Latency SLOs in BigDL

Diego Albo Martínez, Sharwin Bobde, Tomasz Motyka, Lydia Chen

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

Abstract

Distributed machine learning has seen immense rise in popularity in recent years. Many companies and universities are utilizing computational clusters to train and run machine learning models. Unfortunately, operating such a cluster imposes large costs. It is therefore crucial to attain as high system utilization as possible. Moreover, those who offer computational clusters as a service, apart from keeping high utilization, also have to meet the required Service Level Agreements (SLAs) for the system response time. This becomes increasingly more complex in multitenant scenarios, where the time dedicated to each task has to be limited to achieve fairness. In this work, we analyze how different parameters of the machine learning job influence the response time as well as system utilization and propose Courier. Courier is a model that, based on the type of machine learning job, can select a batch size such that the response time adheres to the Service Level Objectives (SLOs) specified, while also rendering the highest possible accuracy. We gather the data by conducting real-world experiments on a BigDL cluster. Later on, we study the influence of the factors and build several predictive models which lead us to the proposed Courier model.

Original languageEnglish
Title of host publicationICPE 2021 - Proceedings of the ACM/SPEC International Conference on Performance Engineering
PublisherAssociation for Computing Machinery (ACM)
Pages133-144
Number of pages12
ISBN (Electronic)9781450381949
DOIs
Publication statusPublished - 2021
Event2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021 - Virtual, Online, France
Duration: 19 Apr 202121 Apr 2021

Publication series

NameICPE 2021 - Proceedings of the ACM/SPEC International Conference on Performance Engineering

Conference

Conference2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021
Country/TerritoryFrance
CityVirtual, Online
Period19/04/2121/04/21

Keywords

  • deep learning
  • distributed systems
  • hyperparameter optimization
  • provisioning
  • resource management
  • scheduling

Fingerprint

Dive into the research topics of 'Courier: Real-Time Optimal Batch Size Prediction for Latency SLOs in BigDL'. Together they form a unique fingerprint.

Cite this