Loading…

Uncorrected Least-Squares Temporal Difference with Lambda-Return

Temporal difference, TD(λ), learning is a foundation of reinforcement learning and also of interest in its own right for the tasks of prediction. Recently, true online TD(λ) has been shown to closely approximate the “forward view” at every step, while conventional TD(λ) does this only at the end of...

Full description

Saved in:
Bibliographic Details
Main Author: Osogami, Takayuki
Format: Conference Proceeding
Language:English
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Temporal difference, TD(λ), learning is a foundation of reinforcement learning and also of interest in its own right for the tasks of prediction. Recently, true online TD(λ) has been shown to closely approximate the “forward view” at every step, while conventional TD(λ) does this only at the end of an episode. We re-examine least-squares temporal difference, LSTD(λ), which has been derived from conventional TD(λ). We design Uncorrected LSTD(λ) in such a way that, when λ = 1, Uncorrected LSTD(1) is equivalent to the least-squares method for the linear regression of Monte Carlo (MC) return at every step, while conventional LSTD(1) has this equivalence only at the end of an episode, since the MC return is corrected to be unbiased. We prove that Uncorrected LSTD(λ) can have smaller variance than conventional LSTD(λ), and this allows Uncorrected LSTD(λ) to sometimes outperform conventional LSTD(λ) in practice. When λ = 0, however, Uncorrected LSTD(0) is not equivalent to LSTD. We thus also propose Mixed LSTD(λ), which % mixes the two LSTD(λ)s in a way that it matches conventional LSTD(λ) at λ = 0 and Uncorrected LSTD(λ) at λ = 1. In numerical experiments, we study how the three LSTD(λ)s behave under limited training data.
ISSN:2159-5399
2374-3468
DOI:10.1609/aaai.v34i04.5979