QUICK REVIEW

[论文解读] Bayesian Learning in Episodic Zero-Sum Games

Chang-Wei Yueh, Andy Zhao|arXiv (Cornell University)|Mar 21, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

本文分析两人有限时限零和马尔可夫博弈中带未知转移与奖励的贝叶斯后验采样（Thompson 采样），证明学习智能体的子线性遗憾界并通过网格世界实验进行验证。

ABSTRACT

We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order $O(HS\sqrt{ABHK\log(SABHK)})$ where $K$ is the number of episodes, $H$ is the episode length, $S$ is the number of states, and $A,B$ are the action space sizes of the two players. Experiments in a grid-world predator--prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.

研究动机与目标

在未知动力学和奖励的两人零和马尔可夫博弈中激发学习动机。
开发一个后验采样算法，其中玩家从贝叶斯后验中抽样并执行均衡策略。
为后验采样在双方采样和单次采样设置下提供理论遗憾保证。
将后验采样与替代学习策略进行对比并评估子线性遗憾增长。

提出的方法

将博弈建模为未知转移和奖励模型的有限时限两人零和马尔可夫博弈。
在一个参数族中对奖励进行采样，转移/奖励参数具有联合先验的贝叶斯框架。
提出一个后验采样算法：每一回合以从当前后验中采样一个博弈模型并求解均衡策略（DP(M)）。
在两种设置下推导后验采样智能体的遗憾界：两名玩家均采样，以及一名玩家采样而对手使用任意学习规则。
建立中间引理，将后验采样与均衡计算以及经验估计的收敛性联系起来。

Figure 1 : The transition model used in experiments. The red arrows and numbers show the transition probabilities when player 1 at (2,2) chooses to move upward. The blue arrows and numbers show the transition probabilities when player 2 at (3,3) chooses to move right.

实验结果

研究问题

RQ1在未知动力学的情节式零和马尔可夫博弈中，后验采样是否能为智能体保证子线性遗憾？
RQ2当双方使用后验采样与仅一方使用时，遗憾界是多少？
RQ3贝叶斯后验更新如何与有限时限设置中的均衡计算相互作用？
RQ4与虚构博弈（fictitious play）基线和真实均衡策略相比，学习表现如何？

主要发现

主要理论结果给出期望遗憾的阶ODE O(HS√(ABHK log(SABHK)))。
当双方使用后验采样，或仅最大化方使用后验采样对抗任意对手时，建立了子线性遗憾。
网格世界的捕食者-猎物域实验显示子线性遗憾规模并对比虚构博弈基线具有竞争力。
随着回合数增长，后验采样实现接近均衡的表现，平均遗憾趋近于零。

Figure 2 : Player 1’s Regret when player 1 uses posterior sampling and player 2 uses (i) true equilibrium strategy ( – eq ), (ii) player 2 uses fictitious play ( – fp ), and (iii) player 2 uses posterior sampling ( – ps ). The solid line shows the average of 50 runs and the bar is 95% confidence int

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。