An Intermediate Representation for Optimizing Machine Learning Pipelines

Andreas Kunft; Asterios Katsifodimos; Sebastian Schelter; Sebastian Bress; Tilmann Rabl; Volker Markl

doi:10.14778/3342263.3342633

An Intermediate Representation for Optimizing Machine Learning Pipelines

Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Bress, Tilmann Rabl, Volker Markl

Web Information Systems

Research output: Contribution to journal › Conference article › Scientific › peer-review

23 Citations (Scopus)

100 Downloads (Pure)

Abstract

Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.

Original language	English
Pages (from-to)	1553-1567
Number of pages	15
Journal	Proceedings of the VLDB Endowment
Volume	12
Issue number	11
DOIs	https://doi.org/10.14778/3342263.3342633
Publication status	Published - Jul 2019

Access to Document

10.14778/3342263.3342633

end-to-end-ml-pipelinesFinal published version, 737 KBLicence: CC BY

Cite this

@article{3970f98fbcf14ead93f72a2b20968bf8,

title = "An Intermediate Representation for Optimizing Machine Learning Pipelines",

abstract = "Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.",

author = "Andreas Kunft and Asterios Katsifodimos and Sebastian Schelter and Sebastian Bress and Tilmann Rabl and Volker Markl",

year = "2019",

month = jul,

doi = "10.14778/3342263.3342633",

language = "English",

volume = "12",

pages = "1553--1567",

journal = "Proceedings of the VLDB Endowment ",

issn = "2150-8097",

publisher = "VLDB Endowment",

number = "11",

}

TY - JOUR

T1 - An Intermediate Representation for Optimizing Machine Learning Pipelines

AU - Kunft, Andreas

AU - Katsifodimos, Asterios

AU - Schelter, Sebastian

AU - Bress, Sebastian

AU - Rabl, Tilmann

AU - Markl, Volker

PY - 2019/7

Y1 - 2019/7

N2 - Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.

AB - Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.

UR - http://www.scopus.com/inward/record.url?scp=85082818643&partnerID=8YFLogxK

U2 - 10.14778/3342263.3342633

DO - 10.14778/3342263.3342633

M3 - Conference article

SN - 2150-8097

VL - 12

SP - 1553

EP - 1567

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 11

ER -

An Intermediate Representation for Optimizing Machine Learning Pipelines

Abstract

Access to Document

Other files and links

Fingerprint

Cite this