Loading…

Learning Unknown Service Rates in Queues: A Multiarmed Bandit Approach

Traditional scheduling problems in stochastic queueing systems assume that the statistical parameters are known a priori . In ''Learning unknown service rates in queues: A multiarmed bandit approach'', Krishnasamy, Sen, Johari, and Shakkottai consider the problem of online schedu...

Full description

Saved in:

Bibliographic Details
Published in:	Operations research 2021-01, Vol.69 (1), p.315-330
Main Authors:	Krishnasamy, Subhashini, Sen, Rajat, Johari, Ramesh, Shakkottai, Sanjay
Format:	Article
Language:	English
Subjects:	Algorithms Analysis bandit algorithms Decision-making Methods Multi-armed bandit problems Operations research Probability queueing systems queues (statistical inference): bandit learning for queueing and scheduling Queuing theory regret scheduling Scheduling (Management) Servers Stochastic Models
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Traditional scheduling problems in stochastic queueing systems assume that the statistical parameters are known a priori . In ''Learning unknown service rates in queues: A multiarmed bandit approach'', Krishnasamy, Sen, Johari, and Shakkottai consider the problem of online scheduling in a parallel-server system when the statistical parameters are unknown. They study this question in the stochastic multiarmed bandits framework with the queue length as the performance objective. In contrast to the classic stochastic multiarmed bandits problem, where the regret scales logarithmically with time, the authors show that the queue regret (difference in expected queue length between a bandit algorithm and a genie policy) exhibits a more complex behavior. It grows logarithmically in the initial stage and eventually decays almost inversely with time. This remarkable behavior is explained through the analysis of regenerative cycle lengths, which shorten with time as the bandit algorithm learns to stabilize the queues. Consider a queueing system consisting of multiple servers. Jobs arrive over time and enter a queue for service; the goal is to minimize the size of this queue. At each opportunity for service, at most one server can be chosen, and at most one job can be served. Service is successful with a probability (the service probability ) that is a priori unknown for each server. An algorithm that knows the service probabilities (the “genie”) can always choose the server of highest service probability. We study algorithms that learn the unknown service probabilities. Our goal is to minimize queue regret : the (expected) difference between the queue lengths obtained by the algorithm and those obtained by the “genie.” Because queue regret cannot be larger than classical regret, results for the standard multiarmed bandit problem give algorithms for which queue regret increases no more than logarithmically in time. Our paper shows surprisingly more complex behavior. In particular, as long as the bandit algorithm’s queues have relatively long regenerative cycles, queue regret is similar to cumulative regret and scales (essentially) logarithmically. However, we show that this “early stage” of the queueing bandit eventually gives way to a “late stage,” where the optimal queue-regret scaling is O (1/ t ). We demonstrate an algorithm that (orderwise) achieves this asymptotic queue regret in the late stage. Our results are developed in a more general model that allows for mu
ISSN:	0030-364X 1526-5463
DOI:	10.1287/opre.2020.1995