Loading…

An Adaptive Sampling Algorithm for Solving Markov Decision Processes

Based on recent results for multiarmed bandit problems, we propose an adaptive sampling algorithm that approximates the optimal value of a finite-horizon Markov decision process (MDP) with finite state and action spaces. The algorithm adaptively chooses which action to sample as the sampling process...

Full description

Saved in:

Bibliographic Details
Published in:	Operations research 2005-01, Vol.53 (1), p.126-139
Main Authors:	Chang, Hyeong Soo, Fu, Michael C, Hu, Jiaqiao, Marcus, Steven I
Format:	Article
Language:	English
Subjects:	Algorithms Applied sciences Business orders Decision making Dynamic programming dynamic programming/optimal control:Markov finite state Estimation bias Estimators Exact sciences and technology Inventory control Markov analysis Markov processes Mathematical programming Mathematics Methods Modeling Operational research and scientific management Operational research. Management science Optimal policy Probability and statistics Sampling Sampling bias Sampling theory, sample surveys Sciences and techniques of general use Statistical sampling Statistics Studies Unbiased estimators
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Based on recent results for multiarmed bandit problems, we propose an adaptive sampling algorithm that approximates the optimal value of a finite-horizon Markov decision process (MDP) with finite state and action spaces. The algorithm adaptively chooses which action to sample as the sampling process proceeds and generates an asymptotically unbiased estimator, whose bias is bounded by a quantity that converges to zero at rate (ln N )/ N , where N is the total number of samples that are used per state sampled in each stage. The worst-case running-time complexity of the algorithm is O (( \|A\|N ) H ), independent of the size of the state space, where \| A \| is the size of the action space and H is the horizon length. The algorithm can be used to create an approximate receding horizon control to solve infinite-horizon MDPs. To illustrate the algorithm, computational results are reported on simple examples from inventory control.
ISSN:	0030-364X 1526-5463
DOI:	10.1287/opre.1040.0145