Loading…

Differentially Quantized Gradient Methods

Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on information theory 2022-09, Vol.68 (9), p.6078-6097
Main Authors:	Lin, Chung-Yi, Kostina, Victoria, Hassibi, Babak
Format:	Article
Language:	English
Subjects:	accelerated gradient descent Algorithms Convergence Counters Descent trajectories error compensation error feedback federated learning Gradient descent Gradient methods heavy ball method Iterative methods linear convergence Measurement Optimization Principles Quantization (signal) quantized gradient descent Servers sigma-delta modulation Training Trajectory
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call differential quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that differentially quantized gradient descent (DQ-GD) attains a linear contraction factor of \max \{\sigma _{\mathrm {GD}}, \rho _{n} 2^{-R}\} , where \sigma _{\mathrm {GD}} is the contraction factor of unquantized gradient descent (GD), \rho _{n} \geq 1 is the covering efficiency of the quantizer, and R is the bitrate per problem dimension n . Thus at any R\geq \log _{2} \rho _{n} /\sigma _{\mathrm {GD}} bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show a converse demonstrating that no algorithm within a certain class can converge faster than \max \{\sigma _{\mathrm {GD}}, 2^{-R}\} . Since quantizers exist with \rho _{n} \to 1 as n \to \infty (Rogers, 1963), this means that DQ-GD is asymptotically optimal. In contrast, naively quantized GD where the worker directly quantizes the gradient barely attains \sigma _{\mathrm {GD}} + \rho _{n}2^{-R} . The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak
ISSN:	0018-9448 1557-9654
DOI:	10.1109/TIT.2022.3171173