Loading…

Accelerating local SGD for non-IID data using variance reduction

Distr ibuted stochastic gradient descent and its variants have been widely adopted in the training of machine learning models, which apply multiple workers in parallel. Among them, local-based algorithms, including Local SGD and FedAvg, have gained much attention due to their superior properties, su...

Full description

Saved in:

Bibliographic Details
Published in:	Frontiers of Computer Science 2023-04, Vol.17 (2), p.172311, Article 172311
Main Authors:	LIANG, Xianfeng, SHEN, Shuheng, CHEN, Enhong, LIU, Jinchang, LIU, Qi, CHENG, Yifei, PAN, Zhen
Format:	Article
Language:	English
Subjects:	Algorithms Cognitive tasks Communication Computer Science Convergence distributed optimization federated learning Iterative methods local SGD Machine learning non-IID data Research Article variance reduction
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Distr ibuted stochastic gradient descent and its variants have been widely adopted in the training of machine learning models, which apply multiple workers in parallel. Among them, local-based algorithms, including Local SGD and FedAvg, have gained much attention due to their superior properties, such as low communication cost and privacy-preserving. Nevertheless, when the data distribution on workers is non-identical, local-based algorithms would encounter a significant degradation in the convergence rate. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to deal with the heterogeneous data. Without extra communication cost, VRL-SGD can reduce the gradient variance among workers caused by the heterogeneous data, and thus it prevents local-based algorithms from slow convergence rate. Moreover, we present VRL-SGD-W with an effective warm-up mechanism for the scenarios, where the data among workers are quite diverse. Benefiting from eliminating the impact of such heterogeneous data, we theoretically prove that VRL-SGD achieves a linear iteration speedup with lower communication complexity even if workers access non-identical datasets. We conduct experiments on three machine learning tasks. The experimental results demonstrate that VRL-SGD performs impressively better than Local SGD for the heterogeneous data and VRL-SGD-W is much robust under high data variance among workers.
ISSN:	2095-2228 2095-2236
DOI:	10.1007/s11704-021-1018-0