QUICK REVIEW

[论文解读] Multi-objective Contextual Bandit Problem with Similarity Information

Eralp Turğay, Doruk Öner|arXiv (Cornell University)|Mar 11, 2018

Advanced Bandit Algorithms Research被引用 9

一句话总结

本文引入了带有相似性信息的多目标上下文Bandit问题，其中存在多个冲突的目标，且奖励函数关于上下文-动作对的相似性满足Lipschitz连续性。作者提出了Pareto上下文Zooming（PCZ）算法，一种在线算法，根据历史奖励和位置信息自适应地划分上下文-动作空间，实现了近似最优的˜O(T^{(1+dp)/(2+dp)}) Pareto regret，其中dp为反映近似最优动作分布复杂度的Pareto zooming维度。

ABSTRACT

In this paper we propose the multi-objective contextual bandit problem with similarity information. This problem extends the classical contextual bandit problem with similarity information by introducing multiple and possibly conflicting objectives. Since the best arm in each objective can be different given the context, learning the best arm based on a single objective can jeopardize the rewards obtained from the other objectives. In order to evaluate the performance of the learner in this setup, we use a performance metric called the contextual Pareto regret. Essentially, the contextual Pareto regret is the sum of the distances of the arms chosen by the learner to the context dependent Pareto front. For this problem, we develop a new online learning algorithm called Pareto Contextual Zooming (PCZ), which exploits the idea of contextual zooming to learn the arms that are close to the Pareto front for each observed context by adaptively partitioning the joint context-arm set according to the observed rewards and locations of the context-arm pairs selected in the past. Then, we prove that PCZ achieves $ ilde O (T^{(1+d_p)/(2+d_p)})$ Pareto regret where $d_p$ is the Pareto zooming dimension that depends on the size of the set of near-optimal context-arm pairs. Moreover, we show that this regret bound is nearly optimal by providing an almost matching $\Omega (T^{(1+d_p)/(2+d_p)})$ lower bound.

研究动机与目标

解决在具有上下文依赖奖励的多个可能冲突目标下的序列决策问题。
建模推荐系统和医疗诊断等现实应用场景，其中各目标间的公平性至关重要。
开发一种学习算法，实现在无需完全刻画上下文依赖Pareto前沿的情况下达到次线性 regret。
将上下文-动作对之间的相似性信息纳入考量，以提升学习效率。
建立反映Pareto前沿内在复杂度的紧致 regret 边界，通过Pareto zooming维度dp来体现。

提出的方法

提出一种新的性能度量指标——上下文Pareto regret，定义为所选动作到上下文特定Pareto前沿的距离之和。
引入Pareto上下文Zooming（PCZ）算法，该算法根据观测到的奖励和选择历史，自适应地划分联合上下文-动作相似性空间。
使用置信区间和UCB风格的探索策略，在多目标设置中平衡利用与探索。
采用基于球体的划分方案，其中每个球体代表相似性空间中的一个区域，仅考虑非支配球体用于选择。
利用期望奖励函数的Lipschitz连续性，确保相近的上下文-动作对具有相似的奖励。
对球体进行分层聚类，并动态细化不确定性较高或具有较高Pareto改进潜力的区域。

实验结果

研究问题

RQ1在具有相似性信息的多目标上下文Bandit设置中，是否存在一种在线学习算法可实现次线性Pareto regret？
RQ2Pareto前沿的复杂度——通过Pareto zooming维度dp衡量——如何影响regret边界？
RQ3算法是否可在无需完全刻画Pareto前沿的情况下，仍实现最优regret？
RQ4所提出的˜O(T^{(1+dp)/(2+dp)}) regret边界是否近乎最优，能否被下界所匹配？
RQ5算法如何通过从估计的Pareto前沿中公平采样来确保各目标间的公平性？

主要发现

PCZ算法在高概率下实现了˜O(T^{(1+dp)/(2+dp)})的Pareto regret边界，其中dp为Pareto zooming维度。
该regret边界近乎最优，因为本文建立了与之匹配的Ω(T^{(1+dp)/(2+dp)})下界（对数因子内）。
该算法无需完全掌握Pareto前沿；它通过自适应划分聚焦于前沿附近的区域，从而实现有效学习。
在期望奖励函数关于相似性空间满足Lipschitz连续性的假设下，性能保证成立。
该算法确保在估计的Pareto前沿中，动作被公平选择，避免对任一目标的偏向。
理论分析表明，regret依赖于由dp捕捉的问题内在复杂度，而非上下文-动作空间的完整维度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。