Loading…

Variance-Reduced Deep Actor-Critic With an Optimally Subsampled Actor Recursion

Reinforcement learning (RL) algorithms combined with deep learning architectures have achieved tremendous success in many practical applications. However, the policies obtained by many deep reinforcement learning (DRL) algorithms are seen to suffer from high variance making them less useful in safet...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on artificial intelligence 2024-07, Vol.5 (7), p.3607-3623
Main Authors: Mandal, Lakshmi, Diddigi, Raghuram Bharadwaj, Bhatnagar, Shalabh
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Reinforcement learning (RL) algorithms combined with deep learning architectures have achieved tremendous success in many practical applications. However, the policies obtained by many deep reinforcement learning (DRL) algorithms are seen to suffer from high variance making them less useful in safety-critical applications. In general, it is desirable to have algorithms that give a low iterate-variance while providing a high long-term reward. In this work, we consider the actor-critic (AC) paradigm, where the critic is responsible for evaluating the policy while the feedback from the critic is used by the actor for updating the policy. The updates of both the critic and the actor in the standard AC procedure are run concurrently until convergence. It has been previously observed that updating the actor once after every L>1 steps of the critic reduces the iterate variance. In this article, we address the question of what optimal L-value to use in the recursions and propose a data-driven L-update rule that runs concurrently with the AC algorithm with the objective being to minimize the variance of the infinite horizon discounted reward. This update is based on a random search (discrete) parameter optimization procedure that incorporates smoothed functional (SF) estimates. We prove the convergence of our proposed multitimescale scheme to the optimal L and optimal policy pair. Subsequently, through numerical evaluations on benchmark RL tasks, we demonstrate the advantages of our proposed algorithm over multiple state-of-the-art algorithms in the literature.
ISSN:2691-4581
2691-4581
DOI:10.1109/TAI.2024.3379109