Loading…

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

We study \(K\)-armed bandit problems where the reward distributions of the arms are all supported on the \([0,1]\) interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternati...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-04
Main Authors: Qin, Hao, Kwang-Sung, Jun, Zhang, Chicheng
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We study \(K\)-armed bandit problems where the reward distributions of the arms are all supported on the \([0,1]\) interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form \(O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)\), where \(\mu^*\) is the expected reward of the optimal arm, and \(T\) is the time horizon length.
ISSN:2331-8422