TY - GEN
T1 - Spyker
T2 - 25th ACM International Middleware Conference, Middleware 2024
AU - Zuo, Yuncong
AU - Cox, Bart
AU - Chen, Lydia Y.
AU - Decouchant, Jérémie
PY - 2024
Y1 - 2024
N2 - Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, Spyker, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Spyker keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare Spyker to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Spyker converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
AB - Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, Spyker, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Spyker keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare Spyker to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Spyker converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
KW - Asynchronous Learning
KW - Byzantine Learning
KW - Resource Heterogeneity
UR - http://www.scopus.com/inward/record.url?scp=85215536658&partnerID=8YFLogxK
U2 - 10.1145/3652892.3700778
DO - 10.1145/3652892.3700778
M3 - Conference contribution
AN - SCOPUS:85215536658
T3 - Middleware 2024 - Proceedings of the 25th ACM International Middleware Conference
SP - 367
EP - 378
BT - Middleware 2024 - Proceedings of the 25th ACM International Middleware Conference
PB - ACM
Y2 - 2 December 2024 through 6 December 2024
ER -