Abstract
How can we perform similarity joins of multi-dimensional streams in a distributed fashion, achieving low latency? Can we adaptively repartition those streams in order to retain high performance under concept drifts? Current approaches to similarity joins are either restricted to single-node deployments or focus on set-similarity joins, failing to cover the ubiquitous case of metric-space similarity joins. In this paper, we propose the first adaptive distributed streaming similarity join approach that gracefully scales with variable velocity and distribution of multi-dimensional data streams. Our approach can adaptively rebalance the load of nodes in the case of concept drifts, allowing for similarity computations in the general metric space. We implement our approach on top of Apache Flink and evaluate its data partitioning and load balancing schemes on a set of synthetic datasets in terms of latency, comparisons ratio, and data duplication ratio
Original language | English |
---|---|
Title of host publication | DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems |
Editors | Marcelo Pasin |
Pages | 25-36 |
ISBN (Electronic) | 979-8-4007-0122-1 |
Publication status | Published - 2023 |
Event | 17th ACM International Conference on Distributed and Event-based Systems - DEBS '23: 17th ACM International Conference on Distributed and Event-based Systems Neuchatel Switzerland June 27 - 30, 2023, Switzerland Duration: 27 Jun 2023 → 30 Jun 2023 Conference number: 17 |
Conference
Conference | 17th ACM International Conference on Distributed and Event-based Systems |
---|---|
Abbreviated title | DEBS '23 |
Country/Territory | Switzerland |
City | DEBS '23: 17th ACM International Conference on Distributed and Event-based Systems Neuchatel Switzerland June 27 - 30, 2023 |
Period | 27/06/23 → 30/06/23 |