Loading…

Online Model-Free n-Step HDP With Stability Analysis

Because of a powerful temporal-difference (TD) with \lambda [TD( \lambda )] learning method, this paper presents a novel n -step adaptive dynamic programming (ADP) architecture that combines TD( \lambda ) with regular TD learning for solving optimal control problems with reduced iterations. In co...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transaction on neural networks and learning systems 2020-04, Vol.31 (4), p.1255-1269
Main Authors: Al-Dabooni, Seaar, Wunsch, Donald C.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Because of a powerful temporal-difference (TD) with \lambda [TD( \lambda )] learning method, this paper presents a novel n -step adaptive dynamic programming (ADP) architecture that combines TD( \lambda ) with regular TD learning for solving optimal control problems with reduced iterations. In contrast with a backward view learning of TD( \lambda ) that is required an extra parameter named eligibility traces to update at the end of each episode (offline training), the new design in this paper has forward view learning, which is updated at each time step (online training) without needing the eligibility trace parameter in various applications without mathematical models. Therefore, the new design is called the online model-free n -step action-dependent (AD) heuristic dynamic programming [NSHDP( \lambda )]. NSHDP( \lambda ) has three neural networks: the critic network (CN) with regular one-step TD [TD(0)], the CN with n -step TD learning [or TD( \lambda )], and the actor network (AN). Because the forward view learning does not require any extra eligibility traces associated with each state, the NSHDP( \lambda ) architecture has low computational costs and is memory efficient. Furthermore, the stability is proven for NSHDP( \lambda ) under certain conditions by using Lyapunov analysis to obtain the uniformly ultimately bounded (UUB) property. We compare the results with the performance of HDP and traditional action-dependent HDP( \lambda ) [ADHDP( \lambda )] with different
ISSN:2162-237X
2162-2388
DOI:10.1109/TNNLS.2019.2919614