Loading…

Delaytron: Efficient Learning of Multiclass Classifiers with Delayed Bandit Feedbacks

This paper presents an online algorithm Delaytron for learning multiclass classifiers using delayed bandit feedback. At the t-th round, the algorithm observes an example \mathbf{x}_{t} and predicts a label \tilde{y}_{t} and receives the bandit feedback II [\tilde{y}_{t}=y_{t}] only d_{t} rounds late...

Full description

Saved in:

Bibliographic Details
Main Authors:	Manwani, Naresh, Agarwal, Mudit
Format:	Conference Proceeding
Language:	English
Subjects:	bandit feedback classification Classification algorithms Delays feedback delay Neural networks Online learning Prediction algorithms regret analysis
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper presents an online algorithm Delaytron for learning multiclass classifiers using delayed bandit feedback. At the t-th round, the algorithm observes an example \mathbf{x}_{t} and predicts a label \tilde{y}_{t} and receives the bandit feedback II [\tilde{y}_{t}=y_{t}] only d_{t} rounds later. When t+d_{t} > T , we consider the feedback for the t-th round missing. The sequence of feedback delays \{d_{t}\}_{t=1}^{T} is unknown to the algorithm. For the K -class classification problem, we show that the proposed algorithm achieves regret of \mathcal{O}\left(\sqrt{\frac{2K}{\gamma}\left[\frac{T}{2}+(2+\frac{L^{2}}{R^{2}\Vert \mathbf{W}\Vert_{F}^{2}})\sum_{t-1}^{T}d_{t}\right]}\right) when the loss for each missing sample is upper bounded by L . In the case when the loss for missing samples is not upper-bounded, the regret achieved by Delaytron is \mathcal{O}\left(\sqrt{\frac{2K}{\gamma}\left[\frac{T}{2}+2\Sigma_{t=1}^{T}d_{t}+\vert \mathcal{M} \vert T\right]}\right) where \mathcal{M} is the set of missing samples in T rounds. These bounds were achieved with a constant step size which requires the knowledge of T and \sum_{t=1}^{T}d_{t} . When T and \sum_{t=1}^{T}d_{t} are unknown, we use a doubling trick for online learning and propose Adaptive Delaytron. We show that Adaptive Delaytron achieves a regret bound of \mathcal{O}\left(\sqrt{T+\sum_{t=1}^{T}d_{t}}\right) . We experimentally show that the proposed approach can learn efficient classifiers even with delayed bandit feedback, and the accuracy does not degrade much due to delays in feedback.
ISSN:	2161-4407
DOI:	10.1109/IJCNN54540.2023.10191245