QUICK REVIEW

[论文解读] Stochastic Stability of Reinforcement Learning in Positive-Utility Games.

Georgios C. Chasparis|arXiv (Cornell University)|Sep 18, 2017

Economic theories and models被引用 1

一句话总结

本文提出了一种基于不变概率测度框架的强化学习在正效用、有限策略型博弈中的随机稳定性分析方法，避免了对李雅普诺夫函数或势函数的依赖。该研究建立了一套计算此类博弈中不变测度的方法，展示了在协调博弈中收敛至随机稳定状态的特性。

ABSTRACT

This paper considers a class of reinforcement-based learning (namely, perturbed learning automata) and provides a stochastic-stability analysis in repeatedly-played, positive-utility, finite strategic-form games. Prior work in this class of learning dynamics primarily analyzes asymptotic convergence through stochastic approximations, where convergence can be associated with the limit points of an ordinary-differential equation (ODE). However, analyzing global convergence through an ODE-approximation requires the existence of a Lyapunov or a potential function, which naturally restricts the analysis to a fine class of games. To overcome these limitations, this paper introduces an alternative framework for analyzing asymptotic convergence that is based upon an explicit characterization of the invariant probability measure of the induced Markov chain. We further provide a methodology for computing the invariant probability measure in positive-utility games, together with an illustration in the context of coordination games.

研究动机与目标

克服基于常微分方程的收敛性分析在强化学习中对李雅普诺夫函数或势函数的依赖所导致的局限性。
开发一种无需依赖这些函数的随机稳定性分析框架。
显式刻画由扰动学习自动机诱导的马尔可夫链的不变概率测度。
为正效用博弈提供不变测度的计算方法。
在协调博弈的背景下说明该方法，展示其收敛至随机稳定结果的特性。

提出的方法

在有限策略型博弈中使用扰动学习自动机作为学习机制。
分析所诱导的马尔可夫链，并刻画其不变概率测度。
应用不变测度来评估学习结果的随机稳定性。
推导出正效用博弈中不变测度的计算程序。
采用显式代数与概率技术计算长期行为，无需依赖常微分方程近似。
通过在协调博弈中的应用验证该框架，表明其收敛至随机稳定的均衡。

实验结果

研究问题

RQ1如何在不依赖李雅普诺夫函数或势函数的情况下分析强化学习中的随机稳定性？
RQ2不变概率测度在刻画正效用博弈中长期学习行为方面起什么作用？
RQ3是否可以显式计算正效用博弈中的不变测度？如果可以，如何实现？
RQ4与基于常微分方程的收敛性分析相比，该方法在普适性和适用性方面有何差异？
RQ5在该框架下，协调博弈中学习动态的随机稳定性特性是什么？

主要发现

不变概率测度为强化学习动态中随机稳定状态提供了直接刻画。
该方法使得无需存在李雅普诺夫函数或势函数即可进行随机稳定性分析成为可能。
为正效用博弈建立了可计算的不变测度框架。
在协调博弈中，该方法将随机稳定均衡识别为长期学习结果。
该方法将收敛性分析的适用范围扩展至不承认势函数的博弈类别。
结果表明，即使基于常微分方程的方法失效，扰动学习自动机在正效用博弈中仍会收敛至随机稳定状态。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。