Skip to main content
QUICK REVIEW

[论文解读] Maximum Entropy Exploration Without the Rollouts

Jacob Adamczyk, Adam Kamoski|arXiv (Cornell University)|Mar 12, 2026
Reinforcement Learning in Robotics被引用 0
一句话总结

本文提出 EVE,一种基于特征向量的最大熵探索方法,通过环境的转移动力学在不依赖回滚的情况下计算熵最大化策略,并通过 PPI 将其与未正则化的平均回报目标相关联。

ABSTRACT

Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.

研究动机与目标

  • 将探索动机阐释为最大化策略诱导的稳定状态状态-动作访问分布的熵。
  • 建立一个熵正则化的平均回报框架,并将其与倾斜转移算子相关联。
  • 推导本征向量的固定点更新以在不进行回滚的情况下计算熵最大化策略。
  • 通过后验策略迭代(PPI)将正则化解映射到非正则化的最大熵解。
  • 在确定性网格世界环境中展示收敛性和经验有效性。

提出的方法

  • 定义平均回报最大熵目标及其带先验策略和逆温度参数 beta 的熵正则化代理问题。
  • 使用由转移、先验和奖励组成的倾斜矩阵 P̃ 来刻画最优策略,其左特征向量 u 与右特征向量 v。
  • 推导自洽奖励 r(s,a) = -log u(s,a)v(s,a) 以实现目标熵率。
  • 获得对 u 的固定点更新:u ← T(u),在前向与后向概率流之间实现平衡,并在投影度量下收敛。
  • 应用后验策略迭代来解决未正则化目标,通过迭代将先验策略更新为当前最优策略。
  • 展示 EVE 更新的收敛性,并讨论通过离策略计算右特征向量以在不回滚的情况下估计熵。
Figure 1 : EVE converges to an exploration policy that achieves maximum entropy. Compared to the baselines, the optimal policy found by EVE produces a higher entropy and converges much faster. (Inset) “CliffWorld” environment used. The green circle denotes the initial state; stepping into the cliff
Figure 1 : EVE converges to an exploration policy that achieves maximum entropy. Compared to the baselines, the optimal policy found by EVE produces a higher entropy and converges much faster. (Inset) “CliffWorld” environment used. The green circle denotes the initial state; stepping into the cliff

实验结果

研究问题

  • RQ1是否可以利用倾斜转移算子的谱性质在无需就地回滚的情况下解决最大熵探索问题?
  • RQ2如何使用倾斜矩阵的左、右特征向量构造自洽的内在奖励以最大化稳定状态熵?
  • RQ3熵正则化的平均回报形式是否提供一个固定点、压缩映射的方法来实现熵最大化策略?
  • RQ4是否可以通过后验策略迭代来近似未正则化的最大熵目标,持续降低熵成本?
  • RQ5确定性网格世界中的经验结果是否显示相对于基于回滚的基线具有竞争力的探索性能?

主要发现

  • EVE 从倾斜转移矩阵的主特征向量计算熵最大化策略,无需回滚或访问次数估计。
  • 对 beta ≥ 1,固定点更新 u ← T(u) 在投影度量下是一个压缩映射,确保收敛到唯一解。
  • 对于未正则化问题,后验策略迭代(PPI)通过将先验策略更新为得到的最优策略来收敛到最大熵解。
  • 在确定性网格世界的实验表明,EVE 相比基于回滚的基线具有更高的稳态状态-动作熵并且收敛更快。
  • EVE 在所探索的环境中几乎达到最大熵,接近 log|S||A|,且在不折扣的情况下保持稳定。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。