Skip to main content
QUICK REVIEW

[Paper Review] Adapting to Misspecification in Contextual Bandits

Dylan J. Foster, Claudio Gentile|arXiv (Cornell University)|Jul 12, 2021
Advanced Bandit Algorithms Research21 citations
TL;DR

This paper introduces a new family of oracle-efficient algorithms for contextual bandits that adapt to unknown model misspecification in both finite and infinite action settings. By reinterpreting SquareCB through a log-barrier regularized optimization lens, the method achieves optimal regret bounds of $\tilde{\mathcal{O}}(d\sqrt{T} + \varepsilon\sqrt{d}T)$ for linear contextual bandits with unknown misspecification level $\varepsilon$, without prior knowledge of $\varepsilon$, and supports adversarially chosen contexts via online regression oracles.

ABSTRACT

A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, but typically require a well-specified model, and can fail when this assumption does not hold. Can we design algorithms that are efficient and flexible, yet degrade gracefully in the face of model misspecification? We introduce a new family of oracle-efficient algorithms for $\varepsilon$-misspecified contextual bandits that adapt to unknown model misspecification -- both for finite and infinite action settings. Given access to an online oracle for square loss regression, our algorithm attains optimal regret and -- in particular -- optimal dependence on the misspecification level, with no prior knowledge. Specializing to linear contextual bandits with infinite actions in $d$ dimensions, we obtain the first algorithm that achieves the optimal $O(d\sqrt{T} + \varepsilon\sqrt{d}T)$ regret bound for unknown misspecification level $\varepsilon$. On a conceptual level, our results are enabled by a new optimization-based perspective on the regression oracle reduction framework of Foster and Rakhlin, which we anticipate will find broader use.

Motivation & Objective

  • To develop computationally efficient contextual bandit algorithms that remain effective under model misspecification.
  • To extend the SquareCB reduction framework to infinite action sets while preserving optimality and adaptivity.
  • To resolve the open problem of adapting to unknown misspecification levels in linear contextual bandits.
  • To provide a general-purpose, flexible approach that degrades gracefully under misspecification without prior knowledge of the misspecification level.

Proposed method

  • Reinterprets the action selection in SquareCB as an approximation to a log-barrier regularized optimization problem, enabling extension to infinite action spaces.
  • Uses an online regression oracle for square loss to maintain computational efficiency and adaptivity.
  • Combines the algorithm with a bandit model selection procedure akin to CORRAL to adapt to unknown misspecification levels.
  • Employs a rounding-based iterative scheme to maintain distributional support and suboptimality gap control, with complexity bounded by $\mathcal{O}(d^4|\mathcal{A}|)$ operations.
  • Introduces a novel optimization-based perspective on the regression oracle reduction framework, enabling generalization beyond realizability.
  • Supports adversarially chosen contexts by relying on online oracles, which are stronger than offline oracles and allow for more efficient updates.

Experimental results

Research questions

  • RQ1Can we design a contextual bandit algorithm that adapts to unknown model misspecification while maintaining optimal regret and computational efficiency?
  • RQ2How can we extend the SquareCB reduction to infinite action sets without sacrificing optimality or adaptivity?
  • RQ3Can we achieve optimal regret dependence on the misspecification level $\varepsilon$ without prior knowledge of $\varepsilon$?
  • RQ4Is it possible to generalize the CORRAL-style aggregation framework to infinite action settings with improved logarithmic factors?

Key findings

  • The proposed algorithm achieves the optimal regret bound of $\tilde{\mathcal{O}}(d\sqrt{T} + \varepsilon\sqrt{d}T)$ for linear contextual bandits with infinite actions and unknown misspecification level $\varepsilon$.
  • The algorithm is oracle-efficient, requiring only access to an online oracle for square loss regression, and maintains optimal dependence on the misspecification level $\varepsilon$.
  • The method generalizes the SquareCB framework to infinite action sets by framing action selection as a log-barrier regularized optimization problem.
  • The algorithm degrades gracefully under misspecification and adapts to unknown $\varepsilon$ without prior knowledge, resolving an open problem posed by Lattimore et al. (2020).
  • A new variant of the CORRAL algorithm is developed, which is simpler, more flexible, and features improved logarithmic factors in regret bounds.
  • The total computational complexity is bounded by $\tilde{\mathcal{O}}(d^4|\mathcal{A}|)$ operations, with sparse support representations ensuring memory efficiency.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.