QUICK REVIEW

[论文解读] More Adaptive Algorithms for Adversarial Bandits

Chen-Yu Wei, Haipeng Luo|arXiv (Cornell University)|Jan 10, 2018

Advanced Bandit Algorithms Research被引用 40

一句话总结

介绍 Broad-OMD，一种灵活的屏障正则化在线镜像下降算法，用于对抗性多臂赌博和组合半赌博，给出多种与数据相关的后悔界限以及在不同实例下的若干无参数变体。

ABSTRACT

We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss and logarithmic regret when there exists an arm whose expected loss is always smaller than those of others by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.

研究动机与目标

Develop a novel, generic algorithm for adversarial bandits and semi-bandits that adapts to data properties.
Derive multiple data-dependent regret bounds that can improve upon prior work in various environments.
Show how optimism, adaptivity, log-barrier regularization, and increasing learning rates enable these bounds.
Provide parameter-free variants in several results and analyze practical implementation in MAB and semi-bandit settings.

提出的方法

Propose Broad-OMD, an Online Mirror Descent algorithm using a log-barrier regularizer on the convex hull of action sets.
Incorporate optimistic predictions and adaptive correction terms into the loss vectors to achieve data-dependent bounds.
Use a time-varying regularizer and an increasing learning rate schedule to obtain path-length based guarantees.
Derive regret bounds for different configurations (Option I and II) and different choices of m_t, hat{l}_t, and eta_t.
Employ reservoir sampling and uniform exploration to estimate unknown quantities when needed (for parameter-free variants).
Specialize the generic framework to the MAB and semi-bandit settings to obtain concrete adaptive bounds.

实验结果

研究问题

RQ1Can a single, generic algorithm (Broad-OMD) yield multiple data-dependent regret bounds in adversarial bandits and semi-bandits?
RQ2How do optimism, adaptivity, log-barrier regularization, and increasing learning rates contribute to improved or parameter-free regret guarantees?
RQ3What are the concrete data-dependent quantities (e.g., variance of the best arm, path-lengths) that drive these bounds in MAB/semi-bandit settings?
RQ4Can these bounds translate into practical improvements in convergence for game-theoretic scenarios under bandit feedback?

主要发现

A generic Broad-OMD framework achieves various data-dependent regret bounds by instantiations, including variance of the best arm and first-order path-length bounds.
Using a log-barrier regularizer within Online Mirror Descent with optimistic predictions and adaptive correction terms yields regret guarantees in adversarial bandits and semi-bandits.
The framework accommodates parameter-free variants via doubling tricks and reservoir sampling when needed to estimate hidden quantities.
Path-length based bounds and small-loss type bounds are obtained, along with a negative term that enables faster convergence in some game-playing settings with bandit feedback.
The approach unifies and extends adaptive online learning techniques to the semi-bandit setting with a relatively simple, modular analysis.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。