[论文解读] Adaptive Sequential Experiments with Unknown Information Flows
本文提出了一种广义多臂赌博机(MAB)框架,该框架在决策时隙之间动态引入任意的辅助信息。它提出一种基于动态定制虚拟时间索引的自适应探索方法,以调整基线MAB策略,使其在无需事先了解信息到达过程的情况下,仍能实现最优后悔率,同时展示了汤普森采样在此类设置下的鲁棒性。
Systems that make sequential decisions in the presence of partial feedback on actions often need to strike a balance between maximizing immediate payoffs based on available information, and acquiring new information that may be essential for maximizing future payoffs. This trade-off is captured by the multi-armed bandit (MAB) framework that has been studied and applied for designing sequential experiments when at each time epoch a single observation is collected on the action that was selected at that epoch. However, in many practical settings additional information may become available between decision epochs. We introduce a generalized MAB formulation in which auxiliary information on each arm may appear arbitrarily over time. By obtaining matching lower and upper bounds, we characterize the minimax complexity of this family of MAB problems as a function of the information arrival process, and study how salient characteristics of this process impact policy design and achievable performance. We establish the robustness of a Thompson sampling policy in the presence of additional information, but observe that other policies that are of practical importance do not exhibit such robustness. We therefore introduce a broad adaptive exploration approach for designing policies that, without any prior knowledge on the information arrival process, attain the best performance (in terms of regret rate) that is achievable when the information arrival process is a priori known. Our approach is based on adjusting MAB policies designed to perform well in the absence of auxiliary information by using dynamically customized virtual time indexes to endogenously control the exploration rate of the policy. We demonstrate our approach through appropriately adjusting known MAB policies and establishing improved performance bounds for these policies in the presence of auxiliary information.
研究动机与目标
- 解决在决策时隙之间辅助信息不可预测到达时的序列决策问题。
- 刻画在任意信息到达过程下MAB问题的极小极大复杂度。
- 设计在信息到达过程事先未知的情况下仍能实现最优后悔性能的自适应策略。
- 展示非汤普森采样策略在处理辅助信息时的局限性,以及汤普森采样在此类场景下的鲁棒性。
提出的方法
- 提出一种广义MAB公式化方法,允许在每个动作的决策时隙之间以任意时间点接收辅助信息。
- 建立后悔的匹配下界与上界,以刻画作为信息到达过程函数的极小极大复杂度。
- 提出一种新颖的自适应探索框架,利用动态定制的虚拟时间索引,内生地控制基线MAB策略的探索率。
- 通过引入反映信息可用性演变的虚拟时间索引,对已知MAB策略(如UCB和汤普森采样)进行调整。
- 证明所得到的策略能够实现若事先知晓信息到达过程时可达到的最优后悔率。
- 证明汤普森采样在辅助信息存在下仍保持鲁棒性,而其他标准策略则不具备相同鲁棒性。
实验结果
研究问题
- RQ1辅助信息的到达过程如何影响部分反馈下序列决策的极小极大后悔?
- RQ2能否设计出一种单一策略,在无需事先知晓信息到达过程的情况下,对所有可能的信息到达过程均实现最优后悔性能?
- RQ3为何汤普森采样在存在辅助信息时仍保持鲁棒性,而其他MAB策略则不具备此特性?
- RQ4信息到达时机的结构特征对MAB问题中有效探索策略设计有何影响?
- RQ5如何利用虚拟时间索引将现有MAB策略适配于动态演变的信息可用性?
主要发现
- 在所提出的广义MAB框架中,极小极大后悔被刻画为信息到达过程的函数,并建立了明确的下界与上界。
- 汤普森采样对辅助信息的引入具有鲁棒性,在信息到达过程未知的情况下仍能保持最优性能。
- 非汤普森采样策略(如UCB)在存在辅助信息时会丧失最优性,除非经过专门调整。
- 所提出的基于虚拟时间索引的自适应探索框架,使任何基线MAB策略均能实现若事先知晓信息到达过程时可达到的最优后悔率。
- 虚拟时间索引机制通过反映信息到达的速率与时机,动态控制探索过程,从而获得更优的性能界。
- 该方法具有通用性,可应用于调整已知MAB策略,在任意信息流设置下获得改进的后悔保证。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。