QUICK REVIEW

[论文解读] A Statistically Reliable Optimization Framework for Bandit Experiments in Scientific Discovery

Tong Li, Travis Mandel|arXiv (Cornell University)|Mar 11, 2026

Advanced Bandit Algorithms Research被引用 0

一句话总结

该论文开发了一种通用的算法诱发测试（AIT）校正，以在自适应赌博抽样下实现有效的假设检验，并引入一个用于在奖励与统计功效之间权衡的目标函数，以及在用户给定成本约束下选择赌博参数的优化框架。

ABSTRACT

Scientific experimentation is largely driven by statistical hypothesis testing to determine significant differences in interventions. Traditionally, experimenters allocate samples uniformly between each intervention. However, such an approach may lead to suboptimal outcomes - multi-armed bandits (MABs) addresses this problem by allocating samples adaptively to maximize outcomes. Yet, two challenges have hindered the use of MABs in scientific domains. First, common hypothesis tests (e.g., $t$-tests) become invalid under adaptive sampling without correction, leading to inflated type~I and type~II errors. This is an understudied problem, and prior solutions suffer from issues such as low statistical power which prevent adoption in many practical settings. Second, practitioners must explicitly balance cumulative reward with statistical efficiency, yet no general methodology exists to quantify this trade-off across algorithms. In this paper, we study assumption modification and critical region correction approaches for hypothesis testing that enable common tests to be applied to adaptively collected data. We provide heuristic justification for its power efficiency and show in simulation that it achieves higher power than existing approaches. Further, we derive a theoretically and practically motivated objective function for adaptive experiment evaluation, which we integrate into a unified experimental framework. Our framework asks experimenters to specify an experiment extension cost for their problem, and based on that enables our proposed optimization procedure to select the bandit algorithm that best balances reward and power in their setting. We show that our approach enables practitioners to improve outcomes with only slightly more steps than uniform randomization, while retaining statistical validity.

研究动机与目标

在保持有效统计推断的前提下，动员使用自适应（赌博）抽样以改善实验结果。
提供一种通用的测试校正方法，使任何赌博算法和常见检验都具有有效的I类错误控制。
引入一个目标函数，在用户定义的时域/成本下平衡奖励与统计功效。
开发一个优化框架，给定成本和功效约束，推荐赌博算法及实验长度。
通过对常见赌博算法和假设检验的仿真实验来评估所提方法。

提出的方法

提出算法诱发测试（AIT）校正，通过在相同自适应算法下模拟数据采集来构建原假设分布，并估计检验统计量的原假设分布。
证明对于简单假设，在自适应数据采集下，具备AIT校正的对数似然比检验（LRT）是最强有力的检验。
定义并论证一个实验扩展成本参数 w，并推导目标函数 F(T,R,w)=R/T - w*log(T) 来量化奖励与时限。
形式化一个基于偏微分方程的等值条件，以证明所选目标及其期望属性（单调性、尺度/平移一致性）。
开发一个优化程序，在功效约束下选择赌博算法参数与时限，使所提目标达到最大化。

Figure 1 . Screenshot of our optimization framework web application, showing the relative ECP-reward performance for the empirical study inspired simulation. Note the best setting for $\epsilon$ -TS outperforms TS and UR near the $w=0.01$ .

实验结果

研究问题

RQ1在自适应赌博数据收集下，如何将假设检验校正为对任意算法和检验仍然有效？
RQ2如何量化并优化自适应实验中的累积奖励与统计功效之间的权衡？
RQ3在用户给定扩展实验成本的前提下，哪种算法框架能在奖励与时限之间取得最佳平衡？
RQ4在常见赌博设置下，与现有方法（如ART）相比，所提校正对功效和假阳性率的表现如何？
RQ5框架是否能在实际科学情境中为选择赌博参数和实验长度提供可行的指导？

主要发现

AIT 校正相较于现有方法（如 ART）在多种算法（TS、ε-greedy、UCB）下具有更高的功效，同时经验性FPR接近目标值（≈0.05）。
在简单假设情形下，结合AIT校正的LRT在自适应数据收集下为最优检验。
提出的 ECP-Reward 目标 F(T,R,w)=R/T - w*log(T) 编码了平均奖励与实验扩展成本之间的权衡，具备有用的单调性及尺度-平移特性。
该框架提供一个优化工具包，在给定 w 的情况下推荐赌博参数与时限以平衡奖励与统计效率。
仿真实验表明，该方法在仅比均匀随机化略多的步骤数下仍能提供有效推断与实际性能的改进。
该方法使用常见赌博算法（TS、ε-TS、UCB）和标准检验（t 检验、ANOVA、Tukey 检验）进行演示。

Figure 2 . Screenshot of our optimization framework web application user input page.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。