QUICK REVIEW

[论文解读] Efficient Exploration via State Marginal Matching

Lisa Lee, Benjamin Eysenbach|arXiv (Cornell University)|Jun 12, 2019

Reinforcement Learning in Robotics参考文献 66被引用 95

一句话总结

论文将强化学习中的探索重新框定为 State Marginal Matching (SMM)，一种分布匹配目标，其中策略的状态访问与目标状态分布相匹配。它引入了一个密度模型与策略之间的两人/零和博弈，使用虚构对弈进行优化，并展示了更快、更广泛的探索以及对新任务的更好适应性，包括一种混合策略扩展（SM4）。

ABSTRACT

Exploration is critical to a reinforcement learning agent's performance in its given environment. Prior exploration methods are often based on using heuristic auxiliary predictions to guide policy behavior, lacking a mathematically-grounded objective with clear properties. In contrast, we recast exploration as a problem of State Marginal Matching (SMM), where we aim to learn a policy for which the state marginal distribution matches a given target state distribution. The target distribution is a uniform distribution in most cases, but can incorporate prior knowledge if available. In effect, SMM amortizes the cost of learning to explore in a given environment. The SMM objective can be viewed as a two-player, zero-sum game between a state density model and a parametric policy, an idea that we use to build an algorithm for optimizing the SMM objective. Using this formalism, we further demonstrate that prior work approximately maximizes the SMM objective, offering an explanation for the success of these methods. On both simulated and real-world tasks, we demonstrate that agents that directly optimize the SMM objective explore faster and adapt more quickly to new tasks as compared to prior exploration methods.

研究动机与目标

将 State Marginal Matching (SMM) 定义为探索的一个有原则的目标，并展示它如何产生一个与任务无关的探索策略。
提出一个实用的优化框架，通过一个状态密度模型与策略之间的两人零和博弈，采用虚构对弈进行优化。
将 SMM 扩展到多策略混合，以处理多模态的目标分布并加速探索。
将 SMM 与先前的探索方法联系起来，并解释它们在接近 MMM 的行为，以及历史平均的重要性。

提出的方法

定义策略访问的状态边际分布 rho_pi(s) 以及目标分布 p*(s)。
将 SMM 目标设定为最小化 KL(rho_pi(s) || p*(s))，等价地最大化 E[r(s)]，其中 r(s)=log p*(s) - log rho_pi(s) 加上一个状态熵项。
开发一个实用算法，使用虚构对弈在拟合历史策略状态的密度模型 q(s) 与更新策略以最大化伪奖励 r(s) 之间交替。
引入对策略和密度的历史平均机制，以确保收敛并防止振荡。
扩展到混合策略（SM4），具有对潜在组件的判别器和混合状态边际，能够实现多模态分布匹配。

实验结果

研究问题

RQ1探索是否可以被表述为对状态边际的分布匹配问题？
RQ2通过 SMM 最大化状态熵是否会产生一个在跨任务上具有鲁棒性的单一探索策略？
RQ3多策略混合是否可以在多模态目标状态分布下改善探索？
RQ4SMM 如何与并统一先前基于预测误差的探索方法？
RQ5所提出的虚构对弈优化是否收敛并在复杂任务上优于现有探索策略？

主要发现

SMM 在模拟和真实任务上比先前的探索方法实现更快的探索和更好的适应性。
基于预测误差的方法在时间平均后大致优化了 SMM 目标，但在没有历史平均时可能表现出振荡动态。
历史平均（虚构对弈）机制对于收敛和有效探索至关重要。
策略混合（SM4）进一步加速测试时的探索并提升后续任务表现。
在 Fetch 和 D’Claw 实验中，SMM 实现了更广的状态覆盖范围，并探索了比基线更广泛的对象角度和旋钮旋转范围。
SMM 提供了一个任务无关的探索先验，能够比基线更快解决下游任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。