Loading…

Learning in a Changing World: Restless Multiarmed Bandit With Unknown Dynamics

We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of N arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on information theory 2013-03, Vol.59 (3), p.1902-1916
Main Authors: Liu, Haoyang, Liu, Keqin, Zhao, Qing
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We consider the restless multiarmed bandit problem with unknown dynamics in which a player chooses one out of N arms to play at each time. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. The performance of an arm selection policy is measured by regret, defined as the reward loss with respect to the case where the player knows which arm is the most rewarding and always plays the best arm. We construct a policy with an interleaving exploration and exploitation epoch structure that achieves a regret with logarithmic order. We further extend the problem to a decentralized setting where multiple distributed players share the arms without information exchange. Under both an exogenous restless model and an endogenous restless model, we show that a decentralized extension of the proposed policy preserves the logarithmic regret order as in the centralized setting. The results apply to adaptive learning in various dynamic systems and communication networks, as well as financial investment.
ISSN:0018-9448
1557-9654
DOI:10.1109/TIT.2012.2230215