Amalur: Data Integration Meets Machine Learning

Ziyu Li; Wenbo Sun; Danning Zhan; Yan Kang; Lydia Chen; Alessandro Bozzon; Rihan Hai

doi:10.1109/TKDE.2024.3357389

Amalur: Data Integration Meets Machine Learning

Ziyu Li, Wenbo Sun, Danning Zhan, Yan Kang, Lydia Chen, Alessandro Bozzon, Rihan Hai

Research output: Contribution to journal › Article › Scientific › peer-review

Abstract

Machine learning (ML) training data is often scattered across disparate collections of datasets, called <italic>data silos</italic>. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.

Original language	English
Pages (from-to)	1-14
Number of pages	14
Journal	IEEE Transactions on Knowledge and Data Engineering
DOIs	https://doi.org/10.1109/TKDE.2024.3357389
Publication status	E-pub ahead of print - 2024

Keywords

Data integration
data integration
Data privacy
Federated learning
federated learning
Machine learning
Metadata
Soft sensors
Training
Training data

Access to Document

10.1109/TKDE.2024.3357389

Cite this

@article{88dde3a2cdc84384b845501045c6def4,

title = "Amalur: Data Integration Meets Machine Learning",

abstract = "Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.",

keywords = "Data integration, data integration, Data privacy, Federated learning, federated learning, Machine learning, Metadata, Soft sensors, Training, Training data",

author = "Ziyu Li and Wenbo Sun and Danning Zhan and Yan Kang and Lydia Chen and Alessandro Bozzon and Rihan Hai",

year = "2024",

doi = "10.1109/TKDE.2024.3357389",

language = "English",

pages = "1--14",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE",

}

TY - JOUR

T1 - Amalur

T2 - Data Integration Meets Machine Learning

AU - Li, Ziyu

AU - Sun, Wenbo

AU - Zhan, Danning

AU - Kang, Yan

AU - Chen, Lydia

AU - Bozzon, Alessandro

AU - Hai, Rihan

PY - 2024

Y1 - 2024

N2 - Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.

AB - Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.

KW - Data integration

KW - data integration

KW - Data privacy

KW - Federated learning

KW - federated learning

KW - Machine learning

KW - Metadata

KW - Soft sensors

KW - Training

KW - Training data

UR - http://www.scopus.com/inward/record.url?scp=85183980623&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2024.3357389

DO - 10.1109/TKDE.2024.3357389

M3 - Article

AN - SCOPUS:85183980623

SN - 1041-4347

SP - 1

EP - 14

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

ER -

Amalur: Data Integration Meets Machine Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this