Loading…

Sublinear regret for learning POMDPs

We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for th...

Full description

Saved in:

Bibliographic Details
Published in:	Production and operations management 2022-09, Vol.31 (9), p.3491-3504
Main Authors:	Xiong, Yi, Chen, Ningyuan, Gao, Xuefeng, Zhou, Xiang
Format:	Article
Language:	English
Subjects:	exploration–exploitation online learning partially observable MDP spectral estimator
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O ( T 2 / 3 log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
ISSN:	1059-1478 1937-5956
DOI:	10.1111/poms.13778