Loading…
Differentially Quantized Gradient Methods
Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem...
Saved in:
Published in: | IEEE transactions on information theory 2022-09, Vol.68 (9), p.6078-6097 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call differential quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that differentially quantized gradient descent (DQ-GD) attains a linear contraction factor of \max \{\sigma _{\mathrm {GD}}, \rho _{n} 2^{-R}\} , where \sigma _{\mathrm {GD}} is the contraction factor of unquantized gradient descent (GD), \rho _{n} \geq 1 is the covering efficiency of the quantizer, and R is the bitrate per problem dimension n . Thus at any R\geq \log _{2} \rho _{n} /\sigma _{\mathrm {GD}} bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show a converse demonstrating that no algorithm within a certain class can converge faster than \max \{\sigma _{\mathrm {GD}}, 2^{-R}\} . Since quantizers exist with \rho _{n} \to 1 as n \to \infty (Rogers, 1963), this means that DQ-GD is asymptotically optimal. In contrast, naively quantized GD where the worker directly quantizes the gradient barely attains \sigma _{\mathrm {GD}} + \rho _{n}2^{-R} . The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak |
---|---|
ISSN: | 0018-9448 1557-9654 |
DOI: | 10.1109/TIT.2022.3171173 |