Abstract
The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control ow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataows and (ii) hides the notion of dataparallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.
Original language | English |
---|---|
Title of host publication | SIGMOD'15 |
Subtitle of host publication | Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data |
Publisher | Association for Computing Machinery (ACM) |
Pages | 47-61 |
Number of pages | 15 |
ISBN (Print) | 978-1-4503-2758-9 |
DOIs | |
Publication status | Published - 27 May 2015 |
Externally published | Yes |
Event | ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia Duration: 31 May 2015 → 4 Jun 2015 |
Conference
Conference | ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 |
---|---|
Country/Territory | Australia |
City | Melbourne |
Period | 31/05/15 → 4/06/15 |