QUICK REVIEW

[论文解读] Achieving Fairness in the Stochastic Multi-armed Bandit Problem

Vishakha Patil, Ganesh Ghalme|arXiv (Cornell University)|Jul 23, 2019

Advanced Bandit Algorithms Research参考文献 43被引用 19

一句话总结

本文提出了 Fair-MAB 问题，这是随机多臂赌博机的一种变体，通过确保在每个回合中每条臂都被拉动至少预设比例的次数来实现公平性。它提出了 Fair-Learn，一种元算法，在与 UCB1 结合时，既能保证时间上一致的公平性，又能实现常数阶 r-Regret，从而在公平性与遗憾性能之间实现了强有力的权衡。

ABSTRACT

We study an interesting variant of the stochastic multi-armed bandit problem, called the Fair-SMAB problem, where each arm is required to be pulled for at least a given fraction of the total available rounds. We investigate the interplay between learning and fairness in terms of a pre-specified vector denoting the fractions of guaranteed pulls. We define a fairness-aware regret, called $r$-Regret, that takes into account the above fairness constraints and naturally extends the conventional notion of regret. Our primary contribution is characterizing a class of Fair-SMAB algorithms by two parameters: the unfairness tolerance and the learning algorithm used as a black-box. We provide a fairness guarantee for this class that holds uniformly over time irrespective of the choice of the learning algorithm. In particular, when the learning algorithm is UCB1, we show that our algorithm achieves $O(\ln T)$ $r$-Regret. Finally, we evaluate the cost of fairness in terms of the conventional notion of regret.

研究动机与目标

通过确保在每个回合中每条臂至少被拉动指定最小比例的次数，解决序列决策中的公平性问题。
形式化一种新的公平感知遗憾度量 r-Regret，该度量同时考虑奖励最大化与公平性约束。
开发一种元算法 Fair-Learn，无论底层学习算法如何，均可在时间上均匀保证公平性。
通过不公平容忍度参数量化公平性的成本，以传统遗憾为衡量标准。
通过实验验证公平性与遗憾性能的理论保证。

提出的方法

Fair-MAB 问题通过公平向量 $ r \in \mathbb{R}^k $ 进行形式化，其中每个分量 $ r_i $ 指定在每个时间步 $ t $ 时，臂 $ i $ 的最小拉动比例。
r-Regret 定义为相对于满足公平性约束的最优策略的期望遗憾，将标准遗憾扩展为包含公平性考量。
提出 Fair-Learn 作为元算法，使用任意黑箱学习算法（如 UCB1），并通过根据公平向量 $ r $ 为表现较差的臂预留一定比例的拉动次数来强制实现公平性。
该算法确保在每个时间步 $ t $，每条臂 $ i $ 至少被拉动 $ \lfloor r_i \cdot t \rfloor $ 次，从而提供确定性、任意时间的公平性保证。
证明了该公平性保证与底层学习算法的选择无关，因此具有鲁棒性和模块化特性。
理论分析表明，当使用 UCB1 作为黑箱算法时，Fair-Learn 实现了 $ O(\ln T) $ 阶的 r-Regret，其增长率为常数，适用于足够长的时间跨度。

实验结果

研究问题

RQ1多臂赌博机算法能否在每个时间步同时最大化累积奖励并为每条臂强制执行最小拉动比例？
RQ2如何在不损害学习效率的前提下，将公平性正式整合进遗憾框架？
RQ3公平性（通过公平向量 $ r $）与学习算法性能的遗憾之间存在何种权衡？
RQ4即使时间跨度 $ T $ 未知，能否在时间上均匀实现公平性保证？
RQ5以 r-Regret 的增长衡量的公平性成本，如何随公平性约束的变化而变化？

主要发现

Fair-Learn 无论底层学习算法如何选择，均可对所有臂提供均匀且时间无关的公平性保证。
当与 UCB1 结合时，Fair-Learn 实现了 $ O(\ln T) $ 阶的 r-Regret，其增长率为常数，表明在公平性约束下仍具有优异的学习性能。
公平性保证在时间上是均匀的，不同于以往仅在渐近意义上或期望意义上保证公平性的方法。
该算法每轮仅引入 $ O(1) $ 的计算开销，相比需要重复优化的方法更为高效。
本文通过不公平容忍度参数 $ \alpha $ 清晰地建立了公平性与遗憾之间的权衡，量化了公平性的成本。
实验验证确认了理论结果，表明 Fair-Learn 在实际中既能保持公平性，又能实现较低的 r-Regret。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。