Amalur: Data Integration Meets Machine Learning

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review


Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the premises of data silos, hence model training should proceed in a decentralized manner. In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning and federated learning.

Original languageEnglish
Title of host publicationProceedings of the 2023 IEEE 39th International Conference on Data Engineering, ICDE 2023
Place of PublicationPiscataway
Number of pages11
ISBN (Electronic)979-8-3503-2227-9
ISBN (Print)979-8-3503-2228-6
Publication statusPublished - 2023
Event39th IEEE International Conference on Data Engineering, ICDE 2023 - Anaheim, United States
Duration: 3 Apr 20237 Apr 2023

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627


Conference39th IEEE International Conference on Data Engineering, ICDE 2023
Country/TerritoryUnited States

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.


Dive into the research topics of 'Amalur: Data Integration Meets Machine Learning'. Together they form a unique fingerprint.

Cite this