SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

Abstract

Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked dis-tributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
PublisherIEEE
Pages110-123
Number of pages14
ISBN (Electronic)9798350317152
DOIs
Publication statusPublished - 2024
Event40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, Netherlands
Duration: 13 May 202417 May 2024

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627
ISSN (Electronic)2375-0286

Conference

Conference40th IEEE International Conference on Data Engineering, ICDE 2024
Country/TerritoryNetherlands
CityUtrecht
Period13/05/2417/05/24

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

  • Data privacy
  • Distributed databases
  • Distributed training
  • Synthetic data

Fingerprint

Dive into the research topics of 'SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models'. Together they form a unique fingerprint.

Cite this