QUICK REVIEW

[论文解读] Online learning in repeated auctions

Jonathan Weed, Vianney Perchet|arXiv (Cornell University)|Nov 18, 2015

Advanced Bandit Algorithms Research参考文献 48被引用 38

一句话总结

本文为在重复维克里拍卖中仅通过胜出后获得部分反馈的出价者，设计了在线学习策略。在随机设置下实现了对数 regret，在对抗性设置下实现了次线性 $\tilde{O}(\text{poly}(\text{regret}))$ regret，并建立了匹配的 minimax 下界，为该情境下的出价者提供了首个完整的策略集合。

ABSTRACT

Motivated by online advertising auctions, we consider repeated Vickrey auctions where goods of unknown value are sold sequentially and bidders only learn (potentially noisy) information about a good's value once it is purchased. We adopt an online learning approach with bandit feedback to model this problem and derive bidding strategies for two models: stochastic and adversarial. In the stochastic model, the observed values of the goods are random variables centered around the true value of the good. In this case, logarithmic regret is achievable when competing against well behaved adversaries. In the adversarial model, the goods need not be identical and we simply compare our performance against that of the best fixed bid in hindsight. We show that sublinear regret is also achievable in this case and prove matching minimax lower bounds. To our knowledge, this is the first complete set of strategies for bidders participating in auctions of this type.

研究动机与目标

为在仅胜出后获得部分反馈的重复第二价格（维克里）拍卖中的出价者设计出价策略。
将学习问题建模为具有有限反馈的在线 bandit 设置，以反映现实世界广告拍卖的动态。
在随机和对抗性模型中推导 regret 上限，与事后最优固定出价进行性能比较。
建立 minimax 下界以证明所提策略的最优性。
解决在线学习在拍卖中的开放问题，特别是关于协变量、复杂基准以及上下界之间差距的问题。

提出的方法

将具有有界价值和出价（在 $[0,1]$ 内）的重复维克里拍卖建模，出价者仅在胜出后观察结果。
应用带 bandit 反馈的在线学习：出价者仅能观察到自身是否中签及支付金额，无法获知其他出价。
为两种模型提出策略：随机模型（观测值围绕真实价值存在噪声）和对抗性模型（商品和价值可能在各轮任意变化）。
使用 KL 散度和信息论论证推导 regret 下界，特别采用双对抗者构造方法。
通过分阶段分析与自适应对抗者策略证明紧致的 minimax 下界，利用对数尺度的间隙缩放。
应用 Fubini 定理并对其内部随机性进行平均，将界限从确定性策略推广至一般随机策略。

实验结果

研究问题

RQ1能否为在仅具有部分（bandit）反馈的重复维克里拍卖中的出价者设计出有效的出价策略？
RQ2在随机模型中，当观测值是对真实价值的有噪声估计时，可实现怎样的 regret 上限？
RQ3在对抗性模型中，当商品和价值可能在各轮任意变化时，能否实现 sublinear regret？
RQ4该设置下的 minimax regret 下界是什么？是否与所提策略的上界匹配？
RQ5上下界之间的差距（例如，$\tilde{O}(\text{poly}(\text{regret}))$ 与 $\tilde{\theta}(\text{poly}(\text{regret}))$）是如何产生的？该差距是否紧致？

主要发现

在随机模型中，可实现对数 regret $O(\text{poly}(\text{regret}))$，对抗行为良好对手。
在对抗性模型中，可实现 sublinear regret $O(\tilde{T}^{1/2})$，且匹配的 minimax 下界为 $\frac{1}{32}\tilde{\theta}(\text{poly}(\text{regret}))$。
建立了 $\frac{1}{32}\tilde{\theta}(T^{1/2}\tilde{\theta}(\text{poly}(\text{regret})))$ 的 minimax 下界，表明所提策略的最优性。
下界证明使用了具有自适应出价水平的递归对抗者构造，确保在第 $i$ 阶段的最小间隙为 $2^{-i-1}$。
上下界之间的差距被证明至多为 $\tilde{O}(\text{poly}(\text{regret}))$，表明上界可能还可改进。
分析确认所提策略在对抗性设置下至多相差对数因子，具有最优性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。