QUICK REVIEW

[论文解读] Preference-based Online Learning with Dueling Bandits: A Survey

Viktor Bengs, Róbert Busa‐Fekete|Open access LMU (Ludwid Maxmilian's Universitat Munchen)|Jul 30, 2018

Advanced Bandit Algorithms Research参考文献 152被引用 24

一句话总结

本综述全面概述了基于偏好的在线学习中的对弈老虎机（dueling bandits）方法，重点聚焦于从成对比较中学习而非数值奖励的算法。该综述根据偏好结构的假设对方法进行分类，分析样本复杂度与遗憾边界，并指出在自适应性、排序模型以及混合反馈设置方面存在的开放性挑战。

ABSTRACT

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available -- instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.

研究动机与目标

解决标准多臂老虎机方法依赖数值奖励的局限性，而此类奖励在现实应用中往往不可用。
综述基于偏好多臂老虎机（PB-MAB）的最新研究进展，其中反馈形式为成对比较。
根据潜在偏好生成过程和反馈属性的假设，对现有 PB-MAB 方法进行分类。
分析累积遗憾与样本复杂度等理论性能指标，特别是在随机 PB-MAB 设置下。
识别开放性研究问题，包括自适应性、从完整排序中学习，以及混合反馈（成对 + 数值奖励）。

提出的方法

根据偏好结构的假设对 PB-MAB 方法进行分类，例如随机传递性、强随机传递性，或存在 Condorcet 胜者。
回顾针对 top-k 选择、排序，以及在成对反馈下偏好探索-利用权衡的算法。
使用累积遗憾与样本复杂度等理论性能指标分析学习表现，特别是在偏好分布平稳的随机设置下。
考察参数化模型（如 Mallows 模型和 Plackett-Luce 分布）在排序中的作用及其对学习效率的影响。
研究自适应性在在线学习中的作用，即学习者可主动选择观察哪些成对比较。
探索结合对弈反馈与实数值奖励的混合设置，如 Xu 等人（2020）的研究，以减少对数值反馈的依赖。

实验结果

研究问题

RQ1在基于偏好的老虎机中，学习者能够主动选择观察哪些成对比较，在多大程度上能提升学习性能？
RQ2在不同参数化模型（如 Mallows、Plackett-Luce）下，当可获得完整或部分偏好数据时，学习最优排序的样本复杂度是多少？
RQ3在最弱假设（如弱随机传递性）下，基于偏好的老虎机算法能否实现低累积遗憾？
RQ4当偏好存在噪声或不一致时，现有方法在识别 Condorcet 胜者或 Kemeny 共识排序方面表现如何？
RQ5在混合老虎机设置中，结合对弈反馈与实数值奖励反馈，在理论和实践上有哪些优势？

主要发现

Mallows 模型下最优学习的样本复杂度已被表征，且 Busa-Fekete 等人（2019）已提出样本最优的算法。
对于一般的参数化排序模型（如 Plackett-Luce 或对数线性模型），目前尚无已知的样本最优学习算法。
Kemeny 共识排序（最小化成对不一致）是 NP-难问题，尽管存在常数因子近似解和 PTAS。
自适应采样（即学习者选择比较的配对）可能提升学习效率，但其理论影响仍基本未被探索。
允许同时使用对弈反馈与实数值奖励反馈的混合老虎机设置，可减少所需拉动次数与对弈次数，如 Xu 等人（2020）所示。
尽管关注度持续上升，目前尚无 PB-MAB 算法的综合性代码库，尽管 duelpy 是近期一项提供 Python 实现的尝试。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。