[Paper Review] Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit
M-UCB combines uniform exploration, UCB1, and a simple sliding-window change-point detector to handle piecewise-stationary bandits, achieving a regret of O(sqrt(MKT log T)) which is nearly optimal up to log factors.
Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component with classic UCB algorithms to detect and adapt to changes, our so-called M-UCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with the state-of-the-art algorithms in numerical experiments using a public Yahoo! dataset to demonstrate its superior performance.
Motivation & Objective
- Motivate the study of bandits with piecewise-stationary reward distributions in real-world applications.
- Propose a practical algorithm (M-UCB) that integrates change-point detection with UCB to adapt to changes.
- Establish a near-optimal regret bound for M-UCB under mild assumptions.
- Demonstrate empirical advantages of M-UCB on synthetic data and a Yahoo dataset benchmark.
Proposed method
- Introduce a simple change-point detector based on comparing running window means (Algorithm 1).
- Embed the detector into UCB-style learning to create Monitored-UCB (M-UCB, Algorithm 2).
- Ensure exploration via a mix of uniform sampling and UCB-based selection to detect changes across all arms.
- Provide theoretical regret analysis showing R(T) = O(sqrt(MKT log T)) under Assumption 1.
- Relate regret to four components: exploration costs, uniform sampling cost, detection delay, and false alarms (Theorem 1).
Experimental results
Research questions
- RQ1Can a simple change-point detector integrated with a UCB approach yield strong regret guarantees in piecewise-stationary bandits?
- RQ2What is the regret scaling of such a method in terms of time horizon T, number of arms K, and number of stationary segments M?
- RQ3How do the proposed parameters (window w, threshold b, uniform-sampling fraction gamma) influence detection and regret?
- RQ4How does M-UCB perform empirically against state-of-the-art non-stationary bandit algorithms on real-world data?
- RQ5Are the theoretical bounds robust to deviations from assumptions (e.g., non-Bernoulli rewards, small changes)?
Key findings
- M-UCB achieves a regret upper bound of O(sqrt(MKT log T)) under mild technical assumptions, nearly matching the known lower bound up to log factors.
- Regret scales approximately as sqrt(M) in the number of segments and sqrt(K) in the number of arms, per empirical verifications.
- The simple sliding-window change-detection approach suffices to guide learning and restarts after detected changes.
- M-UCB outperforms state-of-the-art baselines (e.g., EXP3, EXP3.S, SW-UCB, D-UCB, SHIFTBAND) on Yahoo! data by at least 50-60% in cumulative regret reductions.
- Experiments on Yahoo! and synthetic data indicate robustness to changes without requiring strong parametric assumptions.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.