Loading…

Sublinear regret for learning POMDPs

We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for th...

Full description

Saved in:
Bibliographic Details
Published in:Production and operations management 2022-09, Vol.31 (9), p.3491-3504
Main Authors: Xiong, Yi, Chen, Ningyuan, Gao, Xuefeng, Zhou, Xiang
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3
container_end_page 3504
container_issue 9
container_start_page 3491
container_title Production and operations management
container_volume 31
creator Xiong, Yi
Chen, Ningyuan
Gao, Xuefeng
Zhou, Xiang
description We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O ( T 2 / 3 log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
doi_str_mv 10.1111/poms.13778
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2709864697</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1111_poms.13778</sage_id><sourcerecordid>2709864697</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3</originalsourceid><addsrcrecordid>eNp9kF1LwzAUhoMoOKc3_oKCgiB0Jk2Tk1zK_ITJBtPrkLRJ6ejamazI_r2ZVbwRz805B573fLwInRM8ITFuNt06TAgFEAdoRCSFlEnGD2ONmUxJDuIYnYSwwhgDzfAIXS5709St1T7xtvJ2m7jOJ03s27qtksX85W4RTtGR002wZ995jN4e7l-nT-ls_vg8vZ2lBRVcpDxjIIzBnHKKGQDRGS5kUYB1whpTsjIvnXS8BGACtM0Y4cZxThy1UmBHx-himLvx3Xtvw1atut63caXKAEvBcy4hUtcDVfguBG-d2vh6rf1OEaz2Lqi9C-rLhQiTAf6oG7v7h1Tx1eWP5mrQBF3Z3xP-mP4JY_Bqcw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2709864697</pqid></control><display><type>article</type><title>Sublinear regret for learning POMDPs</title><source>EBSCOhost Business Source Ultimate</source><creator>Xiong, Yi ; Chen, Ningyuan ; Gao, Xuefeng ; Zhou, Xiang</creator><creatorcontrib>Xiong, Yi ; Chen, Ningyuan ; Gao, Xuefeng ; Zhou, Xiang</creatorcontrib><description>We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O ( T 2 / 3 log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.</description><identifier>ISSN: 1059-1478</identifier><identifier>EISSN: 1937-5956</identifier><identifier>DOI: 10.1111/poms.13778</identifier><language>eng</language><publisher>Los Angeles, CA: SAGE Publications</publisher><subject>exploration–exploitation ; online learning ; partially observable MDP ; spectral estimator</subject><ispartof>Production and operations management, 2022-09, Vol.31 (9), p.3491-3504</ispartof><rights>2022 The Authors</rights><rights>2022 Production and Operations Management Society.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Xiong, Yi</creatorcontrib><creatorcontrib>Chen, Ningyuan</creatorcontrib><creatorcontrib>Gao, Xuefeng</creatorcontrib><creatorcontrib>Zhou, Xiang</creatorcontrib><title>Sublinear regret for learning POMDPs</title><title>Production and operations management</title><description>We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O ( T 2 / 3 log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.</description><subject>exploration–exploitation</subject><subject>online learning</subject><subject>partially observable MDP</subject><subject>spectral estimator</subject><issn>1059-1478</issn><issn>1937-5956</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kF1LwzAUhoMoOKc3_oKCgiB0Jk2Tk1zK_ITJBtPrkLRJ6ejamazI_r2ZVbwRz805B573fLwInRM8ITFuNt06TAgFEAdoRCSFlEnGD2ONmUxJDuIYnYSwwhgDzfAIXS5709St1T7xtvJ2m7jOJ03s27qtksX85W4RTtGR002wZ995jN4e7l-nT-ls_vg8vZ2lBRVcpDxjIIzBnHKKGQDRGS5kUYB1whpTsjIvnXS8BGACtM0Y4cZxThy1UmBHx-himLvx3Xtvw1atut63caXKAEvBcy4hUtcDVfguBG-d2vh6rf1OEaz2Lqi9C-rLhQiTAf6oG7v7h1Tx1eWP5mrQBF3Z3xP-mP4JY_Bqcw</recordid><startdate>202209</startdate><enddate>202209</enddate><creator>Xiong, Yi</creator><creator>Chen, Ningyuan</creator><creator>Gao, Xuefeng</creator><creator>Zhou, Xiang</creator><general>SAGE Publications</general><general>Blackwell Publishers Inc</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202209</creationdate><title>Sublinear regret for learning POMDPs</title><author>Xiong, Yi ; Chen, Ningyuan ; Gao, Xuefeng ; Zhou, Xiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>exploration–exploitation</topic><topic>online learning</topic><topic>partially observable MDP</topic><topic>spectral estimator</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xiong, Yi</creatorcontrib><creatorcontrib>Chen, Ningyuan</creatorcontrib><creatorcontrib>Gao, Xuefeng</creatorcontrib><creatorcontrib>Zhou, Xiang</creatorcontrib><collection>CrossRef</collection><jtitle>Production and operations management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xiong, Yi</au><au>Chen, Ningyuan</au><au>Gao, Xuefeng</au><au>Zhou, Xiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Sublinear regret for learning POMDPs</atitle><jtitle>Production and operations management</jtitle><date>2022-09</date><risdate>2022</risdate><volume>31</volume><issue>9</issue><spage>3491</spage><epage>3504</epage><pages>3491-3504</pages><issn>1059-1478</issn><eissn>1937-5956</eissn><abstract>We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of O ( T 2 / 3 log T ) $O(T^{2/3}\sqrt {\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.</abstract><cop>Los Angeles, CA</cop><pub>SAGE Publications</pub><doi>10.1111/poms.13778</doi><tpages>14</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1059-1478
ispartof Production and operations management, 2022-09, Vol.31 (9), p.3491-3504
issn 1059-1478
1937-5956
language eng
recordid cdi_proquest_journals_2709864697
source EBSCOhost Business Source Ultimate
subjects exploration–exploitation
online learning
partially observable MDP
spectral estimator
title Sublinear regret for learning POMDPs
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T21%3A09%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Sublinear%20regret%20for%20learning%20POMDPs&rft.jtitle=Production%20and%20operations%20management&rft.au=Xiong,%20Yi&rft.date=2022-09&rft.volume=31&rft.issue=9&rft.spage=3491&rft.epage=3504&rft.pages=3491-3504&rft.issn=1059-1478&rft.eissn=1937-5956&rft_id=info:doi/10.1111/poms.13778&rft_dat=%3Cproquest_cross%3E2709864697%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2709864697&rft_id=info:pmid/&rft_sage_id=10.1111_poms.13778&rfr_iscdi=true