QUICK REVIEW

[Paper Review] Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Tor Lattimore|arXiv (Cornell University)|Mar 29, 2016

Advanced Bandit Algorithms Research17 references23 citations

TL;DR

This paper introduces OCUCB-$n$, an anytime variant of the Optimally Confident UCB algorithm for stochastic multi-armed bandits with subgaussian noise. It achieves near-optimal finite-time regret bounds without requiring prior knowledge of the horizon, matching asymptotic lower bounds up to a factor of $\eta$ and $\sqrt{\log\log n}$, with a novel confidence bound that adapts to effective arm counts via parameter $\rho$. The algorithm uses a dynamic $B_i(t)$ term to refine exploration, improving on standard UCB and MOSS in finite-time performance while maintaining theoretical optimality in the asymptotic regime.

ABSTRACT

I introduce and analyse an anytime version of the Optimally Confident UCB (OCUCB) algorithm designed for minimising the cumulative regret in finite-armed stochastic bandits with subgaussian noise. The new algorithm is simple, intuitive (in hindsight) and comes with the strongest finite-time regret guarantees for a horizon-free algorithm so far. I also show a finite-time lower bound that nearly matches the upper bound.

Motivation & Objective

To develop an anytime version of the Optimally Confident UCB (OCUCB) algorithm that does not require prior knowledge of the horizon $n$.
To achieve finite-time regret bounds that are nearly optimal, matching known lower bounds up to $\sqrt{\log\log n}$ terms.
To refine the notion of problem difficulty by introducing $k_{i,\rho}$, representing the number of 'effective' arms with larger mean gaps.
To provide a rigorous regret analysis for the new algorithm, showing asymptotic optimality up to a factor $\eta > 1$.
To improve upon existing horizon-free algorithms like UCB and MOSS by incorporating adaptive confidence bounds based on arm similarity and sampling counts.

Proposed method

The algorithm selects arms using an upper confidence bound $\gamma_i(t) = \hat{\mu}_i(t-1) + \sqrt{\frac{2\eta \log(B_i(t-1))}{T_i(t-1)}}$, where $B_i(t-1)$ adapts based on sampling counts and arm similarities.
The confidence term $B_i(t-1)$ is defined as the maximum of $e$, $\log t$, and $t\log t$ divided by a sum of minima involving $T_i(t-1)$ and $T_j(t-1)^\rho T_i(t-1)^{1-\rho}$, capturing effective arm interactions.
The parameter $\rho \in (1/2,1]$ controls the sensitivity to arm similarity, with $\rho = 1/2$ being the canonical choice that balances robustness and performance.
The algorithm initializes by pulling each arm once for the first $K$ rounds, then proceeds with index-based selection using the confidence bound.
The regret analysis relies on concentration inequalities and a novel confidence level selection that depends on $\tau_{i,n}$, the time when arm $i$ is expected to be sampled sufficiently.
A lower bound is derived in Appendix A that nearly matches the upper bound, validating the tightness of the regret guarantee up to $\log\log n$ terms.

Experimental results

Research questions

RQ1Can an anytime version of OCUCB be designed that achieves near-optimal regret without requiring knowledge of the horizon $n$?
RQ2How does the choice of $\rho$ affect the finite-time and asymptotic regret performance of the algorithm?
RQ3Can the confidence bound in UCB be refined using a dynamic term $B_i(t)$ that accounts for effective arm counts and sampling balance?
RQ4What is the tightest possible finite-time regret bound for a horizon-free UCB variant in subgaussian bandits?
RQ5To what extent can the confidence level be shrunk without sacrificing theoretical guarantees, and how does this affect empirical performance?

Key findings

The algorithm OCUCB-$n$ achieves a finite-time regret bound of $R^{\text{OCUCB-}n}_{\mu}(n) \leq C_{\eta} \sum_{i:\Delta_i>0} \left( \Delta_i + \frac{1}{\Delta_i} \log \max\left\{ \frac{n\Delta_i^2 \log n}{k_{i,\rho}}, \log n \right\} \right)$, which is nearly optimal.
The asymptotic regret satisfies $\limsup_{n\to\infty} R^{\text{OCUCB-}n}_{\mu}(n)/\log n \leq \sum_{i:\Delta_i>0} \frac{2\eta}{\Delta_i}$, matching the Lai-Robbins lower bound up to the factor $\eta > 1$.
The term $k_{i,\rho} = \sum_{j=1}^K \min\{1, \Delta_i^{2\rho}/\Delta_j^{2\rho}\}$ quantifies the number of effective arms influencing regret, and is non-increasing in $\rho$, with $\rho=1/2$ being optimal for theoretical tightness.
Empirically, the algorithm shows little sensitivity to $\rho \in [1/2,1]$, and the performance remains stable across different configurations.
The analysis shows that the $\log\log n$ term in the regret bound is unavoidable under current techniques, and the bound is nearly tight as confirmed by a matching lower bound in Appendix A.
The algorithm remains robust even when the logarithmic terms in $B_i(t-1)$ are simplified, suggesting potential for empirical improvements without theoretical loss.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.