Loading…

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

The choice of batch sizes in stochastic gradient optimizers is critical for model training. However, the practice of varying batch sizes throughout the training process is less explored compared to other hyperparameters. We investigate adaptive batch size strategies derived from adaptive sampling me...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-02
Main Authors:	Lau, Tim Tsz-Kit, Liu, Han, Kolar, Mladen
Format:	Article
Language:	English
Subjects:	Adaptive sampling Convergence Deep learning Image classification Image enhancement Mathematical analysis Scale models
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The choice of batch sizes in stochastic gradient optimizers is critical for model training. However, the practice of varying batch sizes throughout the training process is less explored compared to other hyperparameters. We investigate adaptive batch size strategies derived from adaptive sampling methods, traditionally applied only in stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which incrementally increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdaGradNorm converges with high probability at a rate of \(\mathscr{O}(1/K)\) for finding a first-order stationary point of smooth nonconvex functions within \(K\) iterations. AdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. Our theoretical claims are supported by numerical experiments on various image classification tasks, highlighting the enhanced adaptability of progressive batching protocols in deep learning and the potential of such adaptive batch size strategies with adaptive gradient optimizers in large-scale model training.
ISSN:	2331-8422