[Paper Review] Regret Analysis of the Anytime Optimally Confident UCB Algorithm
This paper introduces OCUCB-$n$, an anytime variant of the Optimally Confident UCB algorithm for stochastic multi-armed bandits with subgaussian noise. It achieves near-optimal finite-time regret bounds without requiring prior knowledge of the horizon, matching asymptotic lower bounds up to a factor of $\eta$ and $\sqrt{\log\log n}$, with a novel confidence bound that adapts to effective arm counts via parameter $\rho$. The algorithm uses a dynamic $B_i(t)$ term to refine exploration, improving on standard UCB and MOSS in finite-time performance while maintaining theoretical optimality in the asymptotic regime.
I introduce and analyse an anytime version of the Optimally Confident UCB (OCUCB) algorithm designed for minimising the cumulative regret in finite-armed stochastic bandits with subgaussian noise. The new algorithm is simple, intuitive (in hindsight) and comes with the strongest finite-time regret guarantees for a horizon-free algorithm so far. I also show a finite-time lower bound that nearly matches the upper bound.
Motivation & Objective
- To develop an anytime version of the Optimally Confident UCB (OCUCB) algorithm that does not require prior knowledge of the horizon $n$.
- To achieve finite-time regret bounds that are nearly optimal, matching known lower bounds up to $\sqrt{\log\log n}$ terms.
- To refine the notion of problem difficulty by introducing $k_{i,\rho}$, representing the number of 'effective' arms with larger mean gaps.
- To provide a rigorous regret analysis for the new algorithm, showing asymptotic optimality up to a factor $\eta > 1$.
- To improve upon existing horizon-free algorithms like UCB and MOSS by incorporating adaptive confidence bounds based on arm similarity and sampling counts.
Proposed method
- The algorithm selects arms using an upper confidence bound $\gamma_i(t) = \hat{\mu}_i(t-1) + \sqrt{\frac{2\eta \log(B_i(t-1))}{T_i(t-1)}}$, where $B_i(t-1)$ adapts based on sampling counts and arm similarities.
- The confidence term $B_i(t-1)$ is defined as the maximum of $e$, $\log t$, and $t\log t$ divided by a sum of minima involving $T_i(t-1)$ and $T_j(t-1)^\rho T_i(t-1)^{1-\rho}$, capturing effective arm interactions.
- The parameter $\rho \in (1/2,1]$ controls the sensitivity to arm similarity, with $\rho = 1/2$ being the canonical choice that balances robustness and performance.
- The algorithm initializes by pulling each arm once for the first $K$ rounds, then proceeds with index-based selection using the confidence bound.
- The regret analysis relies on concentration inequalities and a novel confidence level selection that depends on $\tau_{i,n}$, the time when arm $i$ is expected to be sampled sufficiently.
- A lower bound is derived in Appendix A that nearly matches the upper bound, validating the tightness of the regret guarantee up to $\log\log n$ terms.
Experimental results
Research questions
- RQ1Can an anytime version of OCUCB be designed that achieves near-optimal regret without requiring knowledge of the horizon $n$?
- RQ2How does the choice of $\rho$ affect the finite-time and asymptotic regret performance of the algorithm?
- RQ3Can the confidence bound in UCB be refined using a dynamic term $B_i(t)$ that accounts for effective arm counts and sampling balance?
- RQ4What is the tightest possible finite-time regret bound for a horizon-free UCB variant in subgaussian bandits?
- RQ5To what extent can the confidence level be shrunk without sacrificing theoretical guarantees, and how does this affect empirical performance?
Key findings
- The algorithm OCUCB-$n$ achieves a finite-time regret bound of $R^{\text{OCUCB-}n}_{\mu}(n) \leq C_{\eta} \sum_{i:\Delta_i>0} \left( \Delta_i + \frac{1}{\Delta_i} \log \max\left\{ \frac{n\Delta_i^2 \log n}{k_{i,\rho}}, \log n \right\} \right)$, which is nearly optimal.
- The asymptotic regret satisfies $\limsup_{n\to\infty} R^{\text{OCUCB-}n}_{\mu}(n)/\log n \leq \sum_{i:\Delta_i>0} \frac{2\eta}{\Delta_i}$, matching the Lai-Robbins lower bound up to the factor $\eta > 1$.
- The term $k_{i,\rho} = \sum_{j=1}^K \min\{1, \Delta_i^{2\rho}/\Delta_j^{2\rho}\}$ quantifies the number of effective arms influencing regret, and is non-increasing in $\rho$, with $\rho=1/2$ being optimal for theoretical tightness.
- Empirically, the algorithm shows little sensitivity to $\rho \in [1/2,1]$, and the performance remains stable across different configurations.
- The analysis shows that the $\log\log n$ term in the regret bound is unavoidable under current techniques, and the bound is nearly tight as confirmed by a matching lower bound in Appendix A.
- The algorithm remains robust even when the logarithmic terms in $B_i(t-1)$ are simplified, suggesting potential for empirical improvements without theoretical loss.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.