Loading…

Low-Rank Gradient Descent

Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank stru...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE Open Journal of Control Systems 2023, Vol.2, p.380-395
Main Authors:	Cosson, Romain, Jadbabaie, Ali, Makur, Anuran, Reisizadeh, Amirhossein, Shah, Devavrat
Format:	Article
Language:	English
Subjects:	Active subspace Artificial neural networks Complexity theory Computational modeling first order optimization Linear programming low-rank functions oracle complexity Training
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent ( GD ). Our proposed Low-Rank Gradient Descent ( LRGD ) algorithm finds an \epsilon-approximate stationary point of a p-dimensional function by first identifying r \leq p significant directions, and then estimating the true p-dimensional gradient at every iteration by computing directional derivatives only along those r directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are {\mathcal {O}}(r \log (1/\epsilon) + rp) and {\mathcal {O}}(r/\epsilon ^{2} + rp), respectively. Therefore, when r \ll p, LRGD provides significant improvement over the known complexities of {\mathcal {O}}(p \log (1/\epsilon)) and {\mathcal {O}}(p/\epsilon ^{2}) of GD in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, LRGD provides significant gains over GD when the data has low-rank structure, and in the absence of such structure, LRGD does not degrade performance compared to GD . This suggests that LRGD could be used in practice in any setting in place of GD .
ISSN:	2694-085X 2694-085X
DOI:	10.1109/OJCSYS.2023.3315088