Loading…

Differentially Quantized Gradient Methods

Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on information theory 2022-09, Vol.68 (9), p.6078-6097
Main Authors: Lin, Chung-Yi, Kostina, Victoria, Hassibi, Babak
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call differential quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that differentially quantized gradient descent (DQ-GD) attains a linear contraction factor of \max \{\sigma _{\mathrm {GD}}, \rho _{n} 2^{-R}\} , where \sigma _{\mathrm {GD}} is the contraction factor of unquantized gradient descent (GD), \rho _{n} \geq 1 is the covering efficiency of the quantizer, and R is the bitrate per problem dimension n . Thus at any R\geq \log _{2} \rho _{n} /\sigma _{\mathrm {GD}} bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show a converse demonstrating that no algorithm within a certain class can converge faster than \max \{\sigma _{\mathrm {GD}}, 2^{-R}\} . Since quantizers exist with \rho _{n} \to 1 as n \to \infty (Rogers, 1963), this means that DQ-GD is asymptotically optimal. In contrast, naively quantized GD where the worker directly quantizes the gradient barely attains \sigma _{\mathrm {GD}} + \rho _{n}2^{-R} . The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak
ISSN:0018-9448
1557-9654
DOI:10.1109/TIT.2022.3171173