Loading…
Sublinear regret for learning POMDPs
We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for th...
Saved in:
Published in: | Production and operations management 2022-09, Vol.31 (9), p.3491-3504 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3 |
container_end_page | 3504 |
container_issue | 9 |
container_start_page | 3491 |
container_title | Production and operations management |
container_volume | 31 |
creator | Xiong, Yi Chen, Ningyuan Gao, Xuefeng Zhou, Xiang |
description | We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of
O
(
T
2
/
3
log
T
)
$O(T^{2/3}\sqrt {\log T})$
for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs. |
doi_str_mv | 10.1111/poms.13778 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2709864697</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1111_poms.13778</sage_id><sourcerecordid>2709864697</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3</originalsourceid><addsrcrecordid>eNp9kF1LwzAUhoMoOKc3_oKCgiB0Jk2Tk1zK_ITJBtPrkLRJ6ejamazI_r2ZVbwRz805B573fLwInRM8ITFuNt06TAgFEAdoRCSFlEnGD2ONmUxJDuIYnYSwwhgDzfAIXS5709St1T7xtvJ2m7jOJ03s27qtksX85W4RTtGR002wZ995jN4e7l-nT-ls_vg8vZ2lBRVcpDxjIIzBnHKKGQDRGS5kUYB1whpTsjIvnXS8BGACtM0Y4cZxThy1UmBHx-himLvx3Xtvw1atut63caXKAEvBcy4hUtcDVfguBG-d2vh6rf1OEaz2Lqi9C-rLhQiTAf6oG7v7h1Tx1eWP5mrQBF3Z3xP-mP4JY_Bqcw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2709864697</pqid></control><display><type>article</type><title>Sublinear regret for learning POMDPs</title><source>EBSCOhost Business Source Ultimate</source><creator>Xiong, Yi ; Chen, Ningyuan ; Gao, Xuefeng ; Zhou, Xiang</creator><creatorcontrib>Xiong, Yi ; Chen, Ningyuan ; Gao, Xuefeng ; Zhou, Xiang</creatorcontrib><description>We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of
O
(
T
2
/
3
log
T
)
$O(T^{2/3}\sqrt {\log T})$
for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.</description><identifier>ISSN: 1059-1478</identifier><identifier>EISSN: 1937-5956</identifier><identifier>DOI: 10.1111/poms.13778</identifier><language>eng</language><publisher>Los Angeles, CA: SAGE Publications</publisher><subject>exploration–exploitation ; online learning ; partially observable MDP ; spectral estimator</subject><ispartof>Production and operations management, 2022-09, Vol.31 (9), p.3491-3504</ispartof><rights>2022 The Authors</rights><rights>2022 Production and Operations Management Society.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Xiong, Yi</creatorcontrib><creatorcontrib>Chen, Ningyuan</creatorcontrib><creatorcontrib>Gao, Xuefeng</creatorcontrib><creatorcontrib>Zhou, Xiang</creatorcontrib><title>Sublinear regret for learning POMDPs</title><title>Production and operations management</title><description>We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of
O
(
T
2
/
3
log
T
)
$O(T^{2/3}\sqrt {\log T})$
for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.</description><subject>exploration–exploitation</subject><subject>online learning</subject><subject>partially observable MDP</subject><subject>spectral estimator</subject><issn>1059-1478</issn><issn>1937-5956</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kF1LwzAUhoMoOKc3_oKCgiB0Jk2Tk1zK_ITJBtPrkLRJ6ejamazI_r2ZVbwRz805B573fLwInRM8ITFuNt06TAgFEAdoRCSFlEnGD2ONmUxJDuIYnYSwwhgDzfAIXS5709St1T7xtvJ2m7jOJ03s27qtksX85W4RTtGR002wZ995jN4e7l-nT-ls_vg8vZ2lBRVcpDxjIIzBnHKKGQDRGS5kUYB1whpTsjIvnXS8BGACtM0Y4cZxThy1UmBHx-himLvx3Xtvw1atut63caXKAEvBcy4hUtcDVfguBG-d2vh6rf1OEaz2Lqi9C-rLhQiTAf6oG7v7h1Tx1eWP5mrQBF3Z3xP-mP4JY_Bqcw</recordid><startdate>202209</startdate><enddate>202209</enddate><creator>Xiong, Yi</creator><creator>Chen, Ningyuan</creator><creator>Gao, Xuefeng</creator><creator>Zhou, Xiang</creator><general>SAGE Publications</general><general>Blackwell Publishers Inc</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202209</creationdate><title>Sublinear regret for learning POMDPs</title><author>Xiong, Yi ; Chen, Ningyuan ; Gao, Xuefeng ; Zhou, Xiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>exploration–exploitation</topic><topic>online learning</topic><topic>partially observable MDP</topic><topic>spectral estimator</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xiong, Yi</creatorcontrib><creatorcontrib>Chen, Ningyuan</creatorcontrib><creatorcontrib>Gao, Xuefeng</creatorcontrib><creatorcontrib>Zhou, Xiang</creatorcontrib><collection>CrossRef</collection><jtitle>Production and operations management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xiong, Yi</au><au>Chen, Ningyuan</au><au>Gao, Xuefeng</au><au>Zhou, Xiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Sublinear regret for learning POMDPs</atitle><jtitle>Production and operations management</jtitle><date>2022-09</date><risdate>2022</risdate><volume>31</volume><issue>9</issue><spage>3491</spage><epage>3504</epage><pages>3491-3504</pages><issn>1059-1478</issn><eissn>1937-5956</eissn><abstract>We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of
O
(
T
2
/
3
log
T
)
$O(T^{2/3}\sqrt {\log T})$
for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.</abstract><cop>Los Angeles, CA</cop><pub>SAGE Publications</pub><doi>10.1111/poms.13778</doi><tpages>14</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1059-1478 |
ispartof | Production and operations management, 2022-09, Vol.31 (9), p.3491-3504 |
issn | 1059-1478 1937-5956 |
language | eng |
recordid | cdi_proquest_journals_2709864697 |
source | EBSCOhost Business Source Ultimate |
subjects | exploration–exploitation online learning partially observable MDP spectral estimator |
title | Sublinear regret for learning POMDPs |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T21%3A09%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Sublinear%20regret%20for%20learning%20POMDPs&rft.jtitle=Production%20and%20operations%20management&rft.au=Xiong,%20Yi&rft.date=2022-09&rft.volume=31&rft.issue=9&rft.spage=3491&rft.epage=3504&rft.pages=3491-3504&rft.issn=1059-1478&rft.eissn=1937-5956&rft_id=info:doi/10.1111/poms.13778&rft_dat=%3Cproquest_cross%3E2709864697%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c3868-62578bb0636305771a20c9cc7ef8ebbd5d4df9f6d77587ae2516bf661f3e980f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2709864697&rft_id=info:pmid/&rft_sage_id=10.1111_poms.13778&rfr_iscdi=true |