Skip to main content
QUICK REVIEW

[Paper Review] Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Tor Lattimore|arXiv (Cornell University)|Mar 29, 2016
Advanced Bandit Algorithms Research17 references23 citations
TL;DR

This paper introduces OCUCB-$n$, an anytime variant of the Optimally Confident UCB algorithm for stochastic multi-armed bandits with subgaussian noise. It achieves near-optimal finite-time regret bounds without requiring prior knowledge of the horizon, matching asymptotic lower bounds up to a factor of $\eta$ and $\sqrt{\log\log n}$, with a novel confidence bound that adapts to effective arm counts via parameter $\rho$. The algorithm uses a dynamic $B_i(t)$ term to refine exploration, improving on standard UCB and MOSS in finite-time performance while maintaining theoretical optimality in the asymptotic regime.

ABSTRACT

I introduce and analyse an anytime version of the Optimally Confident UCB (OCUCB) algorithm designed for minimising the cumulative regret in finite-armed stochastic bandits with subgaussian noise. The new algorithm is simple, intuitive (in hindsight) and comes with the strongest finite-time regret guarantees for a horizon-free algorithm so far. I also show a finite-time lower bound that nearly matches the upper bound.

Motivation & Objective

  • To develop an anytime version of the Optimally Confident UCB (OCUCB) algorithm that does not require prior knowledge of the horizon $n$.
  • To achieve finite-time regret bounds that are nearly optimal, matching known lower bounds up to $\sqrt{\log\log n}$ terms.
  • To refine the notion of problem difficulty by introducing $k_{i,\rho}$, representing the number of 'effective' arms with larger mean gaps.
  • To provide a rigorous regret analysis for the new algorithm, showing asymptotic optimality up to a factor $\eta > 1$.
  • To improve upon existing horizon-free algorithms like UCB and MOSS by incorporating adaptive confidence bounds based on arm similarity and sampling counts.

Proposed method

  • The algorithm selects arms using an upper confidence bound $\gamma_i(t) = \hat{\mu}_i(t-1) + \sqrt{\frac{2\eta \log(B_i(t-1))}{T_i(t-1)}}$, where $B_i(t-1)$ adapts based on sampling counts and arm similarities.
  • The confidence term $B_i(t-1)$ is defined as the maximum of $e$, $\log t$, and $t\log t$ divided by a sum of minima involving $T_i(t-1)$ and $T_j(t-1)^\rho T_i(t-1)^{1-\rho}$, capturing effective arm interactions.
  • The parameter $\rho \in (1/2,1]$ controls the sensitivity to arm similarity, with $\rho = 1/2$ being the canonical choice that balances robustness and performance.
  • The algorithm initializes by pulling each arm once for the first $K$ rounds, then proceeds with index-based selection using the confidence bound.
  • The regret analysis relies on concentration inequalities and a novel confidence level selection that depends on $\tau_{i,n}$, the time when arm $i$ is expected to be sampled sufficiently.
  • A lower bound is derived in Appendix A that nearly matches the upper bound, validating the tightness of the regret guarantee up to $\log\log n$ terms.

Experimental results

Research questions

  • RQ1Can an anytime version of OCUCB be designed that achieves near-optimal regret without requiring knowledge of the horizon $n$?
  • RQ2How does the choice of $\rho$ affect the finite-time and asymptotic regret performance of the algorithm?
  • RQ3Can the confidence bound in UCB be refined using a dynamic term $B_i(t)$ that accounts for effective arm counts and sampling balance?
  • RQ4What is the tightest possible finite-time regret bound for a horizon-free UCB variant in subgaussian bandits?
  • RQ5To what extent can the confidence level be shrunk without sacrificing theoretical guarantees, and how does this affect empirical performance?

Key findings

  • The algorithm OCUCB-$n$ achieves a finite-time regret bound of $R^{\text{OCUCB-}n}_{\mu}(n) \leq C_{\eta} \sum_{i:\Delta_i>0} \left( \Delta_i + \frac{1}{\Delta_i} \log \max\left\{ \frac{n\Delta_i^2 \log n}{k_{i,\rho}}, \log n \right\} \right)$, which is nearly optimal.
  • The asymptotic regret satisfies $\limsup_{n\to\infty} R^{\text{OCUCB-}n}_{\mu}(n)/\log n \leq \sum_{i:\Delta_i>0} \frac{2\eta}{\Delta_i}$, matching the Lai-Robbins lower bound up to the factor $\eta > 1$.
  • The term $k_{i,\rho} = \sum_{j=1}^K \min\{1, \Delta_i^{2\rho}/\Delta_j^{2\rho}\}$ quantifies the number of effective arms influencing regret, and is non-increasing in $\rho$, with $\rho=1/2$ being optimal for theoretical tightness.
  • Empirically, the algorithm shows little sensitivity to $\rho \in [1/2,1]$, and the performance remains stable across different configurations.
  • The analysis shows that the $\log\log n$ term in the regret bound is unavoidable under current techniques, and the bound is nearly tight as confirmed by a matching lower bound in Appendix A.
  • The algorithm remains robust even when the logarithmic terms in $B_i(t-1)$ are simplified, suggesting potential for empirical improvements without theoretical loss.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.