When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Manel Slokom; Peter Paul de Wolf; Martha Larson

doi:10.1007/978-3-031-13945-1_20

When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Manel Slokom^*, Peter Paul de Wolf, Martha Larson

^*Corresponding author for this work

Multimedia Computing

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

10 Downloads (Pure)

Abstract

We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to information about a certain number of target individuals. That attacker might also have information about another set of people to train an auxiliary classifier. We show that the attack is possible for target individuals independently of whether they were contained in the original training set of the classifier. However, the attack is somewhat less successful for individuals that were not contained in the original data. Based on this observation, we investigate whether training the classifier on a data set that is synthesized from the original training data, rather than using the original training data directly, would help to mitigate the effectiveness of the attack. Our experimental results show that it does not, leading us to conclude that new approaches to data synthesis must be developed if synthesized data is to resemble “unseen” individuals to an extent great enough to help to block machine learning model attacks.

Original language	English
Title of host publication	Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings
Editors	Josep Domingo-Ferrer, Maryline Laurent
Publisher	Springer
Pages	283-296
Number of pages	14
ISBN (Print)	9783031139444
DOIs	https://doi.org/10.1007/978-3-031-13945-1_20
Publication status	Published - 2022
Event	International Conference on Privacy in Statistical Databases, PSD 2022 - Paris, France Duration: 21 Sept 2022 → 23 Sept 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13463 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	International Conference on Privacy in Statistical Databases, PSD 2022
Country/Territory	France
City	Paris
Period	21/09/22 → 23/09/22

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care
Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Keywords

Attribute inference
Machine learning
Propensity to move
Synthetic data

Access to Document

10.1007/978-3-031-13945-1_20

978-3-031-13945-1_20Final published version, 231 KB

Cite this

Slokom, M., de Wolf, P. P., & Larson, M. (2022). When Machine Learning Models Leak: An Exploration of Synthetic Training Data. In J. Domingo-Ferrer, & M. Laurent (Eds.), Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings (pp. 283-296). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13463 LNCS). Springer. https://doi.org/10.1007/978-3-031-13945-1_20

Slokom, Manel ; de Wolf, Peter Paul ; Larson, Martha. / When Machine Learning Models Leak : An Exploration of Synthetic Training Data. Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings. editor / Josep Domingo-Ferrer ; Maryline Laurent. Springer, 2022. pp. 283-296 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{c6a83f5c18084cd2b725e7b474ab54ff,

title = "When Machine Learning Models Leak: An Exploration of Synthetic Training Data",

abstract = "We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to information about a certain number of target individuals. That attacker might also have information about another set of people to train an auxiliary classifier. We show that the attack is possible for target individuals independently of whether they were contained in the original training set of the classifier. However, the attack is somewhat less successful for individuals that were not contained in the original data. Based on this observation, we investigate whether training the classifier on a data set that is synthesized from the original training data, rather than using the original training data directly, would help to mitigate the effectiveness of the attack. Our experimental results show that it does not, leading us to conclude that new approaches to data synthesis must be developed if synthesized data is to resemble “unseen” individuals to an extent great enough to help to block machine learning model attacks.",

keywords = "Attribute inference, Machine learning, Propensity to move, Synthetic data",

author = "Manel Slokom and {de Wolf}, {Peter Paul} and Martha Larson",

note = "Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.; International Conference on Privacy in Statistical Databases, PSD 2022 ; Conference date: 21-09-2022 Through 23-09-2022",

year = "2022",

doi = "10.1007/978-3-031-13945-1_20",

language = "English",

isbn = "9783031139444",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer",

pages = "283--296",

editor = "Josep Domingo-Ferrer and Maryline Laurent",

booktitle = "Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings",

}

Slokom, M, de Wolf, PP & Larson, M 2022, When Machine Learning Models Leak: An Exploration of Synthetic Training Data. in J Domingo-Ferrer & M Laurent (eds), Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13463 LNCS, Springer, pp. 283-296, International Conference on Privacy in Statistical Databases, PSD 2022, Paris, France, 21/09/22. https://doi.org/10.1007/978-3-031-13945-1_20

When Machine Learning Models Leak: An Exploration of Synthetic Training Data. / Slokom, Manel; de Wolf, Peter Paul; Larson, Martha.
Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings. ed. / Josep Domingo-Ferrer; Maryline Laurent. Springer, 2022. p. 283-296 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13463 LNCS).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - When Machine Learning Models Leak

T2 - International Conference on Privacy in Statistical Databases, PSD 2022

AU - Slokom, Manel

AU - de Wolf, Peter Paul

AU - Larson, Martha

N1 - Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

PY - 2022

Y1 - 2022

N2 - We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to information about a certain number of target individuals. That attacker might also have information about another set of people to train an auxiliary classifier. We show that the attack is possible for target individuals independently of whether they were contained in the original training set of the classifier. However, the attack is somewhat less successful for individuals that were not contained in the original data. Based on this observation, we investigate whether training the classifier on a data set that is synthesized from the original training data, rather than using the original training data directly, would help to mitigate the effectiveness of the attack. Our experimental results show that it does not, leading us to conclude that new approaches to data synthesis must be developed if synthesized data is to resemble “unseen” individuals to an extent great enough to help to block machine learning model attacks.

AB - We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to information about a certain number of target individuals. That attacker might also have information about another set of people to train an auxiliary classifier. We show that the attack is possible for target individuals independently of whether they were contained in the original training set of the classifier. However, the attack is somewhat less successful for individuals that were not contained in the original data. Based on this observation, we investigate whether training the classifier on a data set that is synthesized from the original training data, rather than using the original training data directly, would help to mitigate the effectiveness of the attack. Our experimental results show that it does not, leading us to conclude that new approaches to data synthesis must be developed if synthesized data is to resemble “unseen” individuals to an extent great enough to help to block machine learning model attacks.

KW - Attribute inference

KW - Machine learning

KW - Propensity to move

KW - Synthetic data

UR - http://www.scopus.com/inward/record.url?scp=85138766477&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-13945-1_20

DO - 10.1007/978-3-031-13945-1_20

M3 - Conference contribution

AN - SCOPUS:85138766477

SN - 9783031139444

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 283

EP - 296

BT - Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings

A2 - Domingo-Ferrer, Josep

A2 - Laurent, Maryline

PB - Springer

Y2 - 21 September 2022 through 23 September 2022

ER -

Slokom M, de Wolf PP, Larson M. When Machine Learning Models Leak: An Exploration of Synthetic Training Data. In Domingo-Ferrer J, Laurent M, editors, Privacy in Statistical Databases - International Conference, PSD 2022, Proceedings. Springer. 2022. p. 283-296. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-13945-1_20

When Machine Learning Models Leak: An Exploration of Synthetic Training Data

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this