Amalur: The Convergence of Data Integration and Machine Learning

Research output: Contribution to journalArticleScientificpeer-review

3 Downloads (Pure)

Abstract

Machine learning (ML) training data is often scattered across disparate collections of datasets, called <italic>data silos</italic>. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.

Original languageEnglish
Pages (from-to)7353-7367
Number of pages15
JournalIEEE Transactions on Knowledge and Data Engineering
Volume36
Issue number12
DOIs
Publication statusPublished - 2024

Keywords

  • Data integration
  • data integration
  • Data privacy
  • Federated learning
  • federated learning
  • Machine learning
  • Metadata
  • Soft sensors
  • Training
  • Training data

Fingerprint

Dive into the research topics of 'Amalur: The Convergence of Data Integration and Machine Learning'. Together they form a unique fingerprint.

Cite this