Federated Geometric Monte Carlo Clustering to Counter Non-IID Datasets

Federico Lucchetti; Maria  Fernandes; Lydia Y. Chen; J.E.A.P. Decouchant; Marcus Völp

Federated Geometric Monte Carlo Clustering to Counter Non-IID Datasets

Federico Lucchetti, Maria Fernandes, Lydia Y. Chen, J.E.A.P. Decouchant, Marcus Völp

Data-Intensive Systems

Research output: Working paper/Preprint › Preprint

Abstract

ederated learning allows clients to collaboratively
train models on datasets that are acquired in different locations
and that cannot be exchanged because of their size or regulations.
Such collected data is increasingly non-independent and non-
identically distributed (non-IID), negatively affecting training
accuracy. Previous works tried to mitigate the effects of non-
IID datasets on training accuracy, focusing mainly on non-IID
labels, however practical datasets often also contain non-IID
features. To address both non-IID labels and features, we propose
FedGMCC1, a novel framework where a central server aggregates
client models that it can cluster together. FedGMCC clustering relies
on a Monte Carlo procedure that samples the output space of
client models, infers their position in the weight space on a loss
manifold and computes their geometric connection via an affine
curve parametrization. FedGMCC aggregates connected models
along their path connectivity to produce a richer global model,
incorporating knowledge of all connected client models. FedGMCC
outperforms FedAvg and FedProx in terms of convergence rates
on the EMNIST62 and a genomic sequence classification datasets
(by up to +63%). FedGMCC yields an improved accuracy (+4%)
on the genomic dataset with respect to CFL, in high non-IID
feature space settings and label incongruency.

Original language	English
Publication status	Published - 23 Apr 2022

Access to Document

https://arxiv.org/pdf/2204.11017.pdf

Cite this

@techreport{1aabf1c546c34217a20ec45c6506c491,

title = "Federated Geometric Monte Carlo Clustering to Counter Non-IID Datasets",

abstract = "ederated learning allows clients to collaborativelytrain models on datasets that are acquired in different locationsand that cannot be exchanged because of their size or regulations.Such collected data is increasingly non-independent and non-identically distributed (non-IID), negatively affecting trainingaccuracy. Previous works tried to mitigate the effects of non-IID datasets on training accuracy, focusing mainly on non-IIDlabels, however practical datasets often also contain non-IIDfeatures. To address both non-IID labels and features, we proposeFedGMCC1, a novel framework where a central server aggregatesclient models that it can cluster together. FedGMCC clustering relieson a Monte Carlo procedure that samples the output space ofclient models, infers their position in the weight space on a lossmanifold and computes their geometric connection via an affinecurve parametrization. FedGMCC aggregates connected modelsalong their path connectivity to produce a richer global model,incorporating knowledge of all connected client models. FedGMCCoutperforms FedAvg and FedProx in terms of convergence rateson the EMNIST62 and a genomic sequence classification datasets(by up to +63%). FedGMCC yields an improved accuracy (+4%)on the genomic dataset with respect to CFL, in high non-IIDfeature space settings and label incongruency.",

author = "Federico Lucchetti and Maria Fernandes and Chen, {Lydia Y.} and J.E.A.P. Decouchant and Marcus V{\"o}lp",

year = "2022",

month = apr,

day = "23",

language = "English",

type = "WorkingPaper",

}

TY - UNPB

T1 - Federated Geometric Monte Carlo Clustering to Counter Non-IID Datasets

AU - Lucchetti, Federico

AU - Fernandes, Maria

AU - Chen, Lydia Y.

AU - Decouchant, J.E.A.P.

AU - Völp, Marcus

PY - 2022/4/23

Y1 - 2022/4/23

N2 - ederated learning allows clients to collaborativelytrain models on datasets that are acquired in different locationsand that cannot be exchanged because of their size or regulations.Such collected data is increasingly non-independent and non-identically distributed (non-IID), negatively affecting trainingaccuracy. Previous works tried to mitigate the effects of non-IID datasets on training accuracy, focusing mainly on non-IIDlabels, however practical datasets often also contain non-IIDfeatures. To address both non-IID labels and features, we proposeFedGMCC1, a novel framework where a central server aggregatesclient models that it can cluster together. FedGMCC clustering relieson a Monte Carlo procedure that samples the output space ofclient models, infers their position in the weight space on a lossmanifold and computes their geometric connection via an affinecurve parametrization. FedGMCC aggregates connected modelsalong their path connectivity to produce a richer global model,incorporating knowledge of all connected client models. FedGMCCoutperforms FedAvg and FedProx in terms of convergence rateson the EMNIST62 and a genomic sequence classification datasets(by up to +63%). FedGMCC yields an improved accuracy (+4%)on the genomic dataset with respect to CFL, in high non-IIDfeature space settings and label incongruency.

AB - ederated learning allows clients to collaborativelytrain models on datasets that are acquired in different locationsand that cannot be exchanged because of their size or regulations.Such collected data is increasingly non-independent and non-identically distributed (non-IID), negatively affecting trainingaccuracy. Previous works tried to mitigate the effects of non-IID datasets on training accuracy, focusing mainly on non-IIDlabels, however practical datasets often also contain non-IIDfeatures. To address both non-IID labels and features, we proposeFedGMCC1, a novel framework where a central server aggregatesclient models that it can cluster together. FedGMCC clustering relieson a Monte Carlo procedure that samples the output space ofclient models, infers their position in the weight space on a lossmanifold and computes their geometric connection via an affinecurve parametrization. FedGMCC aggregates connected modelsalong their path connectivity to produce a richer global model,incorporating knowledge of all connected client models. FedGMCCoutperforms FedAvg and FedProx in terms of convergence rateson the EMNIST62 and a genomic sequence classification datasets(by up to +63%). FedGMCC yields an improved accuracy (+4%)on the genomic dataset with respect to CFL, in high non-IIDfeature space settings and label incongruency.

M3 - Preprint

BT - Federated Geometric Monte Carlo Clustering to Counter Non-IID Datasets

ER -

Federated Geometric Monte Carlo Clustering to Counter Non-IID Datasets

Abstract

Access to Document

Fingerprint

Cite this