QUICK REVIEW

[논문 리뷰] Maximum Entropy Exploration Without the Rollouts

Jacob Adamczyk, Adam Kamoski|arXiv (Cornell University)|2026. 03. 12.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

본 논문은 고유벡터 기반의 최대 엔트로피 탐색 방법인 EVE를 제시하며, 롤아웃에 의존하지 않고 환경의 전이 다이내믹스로부터 엔트로피를 최대로 하는 정책을 계산하고 이를 PPI를 통해 비정규화된 평균 보상 목표와 연결한다.

ABSTRACT

Efficient exploration remains a central challenge in reinforcement learning, serving as a useful pretraining objective for data collection, particularly when an external reward function is unavailable. A principled formulation of the exploration problem is to find policies that maximize the entropy of their induced steady-state visitation distribution, thereby encouraging uniform long-run coverage of the state space. Many existing exploration approaches require estimating state visitation frequencies through repeated on-policy rollouts, which can be computationally expensive. In this work, we instead consider an intrinsic average-reward formulation in which the reward is derived from the visitation distribution itself, so that the optimal policy maximizes steady-state entropy. An entropy-regularized version of this objective admits a spectral characterization: the relevant stationary distributions can be computed from the dominant eigenvectors of a problem-dependent transition matrix. This insight leads to a novel algorithm for solving the maximum entropy exploration problem, EVE (EigenVector-based Exploration), which avoids explicit rollouts and distribution estimation, instead computing the solution through iterative updates, similar to a value-based approach. To address the original unregularized objective, we employ a posterior-policy iteration (PPI) approach, which monotonically improves the entropy and converges in value. We prove convergence of EVE under standard assumptions and demonstrate empirically that it efficiently produces policies with high steady-state entropy, achieving competitive exploration performance relative to rollout-based baselines in deterministic grid-world environments.

연구 동기 및 목표

정책에 의해 유도된 정상 상태 방문 분포의 엔트로피를 최대화하는 탐색을 모티브로 삼는다.
엔트로피 규제 평균 보상 프레임워크를 개발하고 이를 기울어진 전이 연산자에 관계시킨다.
롤아웃 없이 엔트로피 최대화를 계산하기 위한 지배 고유벡터를 구하는 고정점 업데이트를 도출한다.
후방 정책 반복(PPI)을 통해 규제된 해에서 비규제 최대 엔트로피 해로의 경로를 제시한다.
결정론적 그리드 월드 환경에서 수렴성과 경험적 효과를 보인다.

제안 방법

평균 보상 최대 엔트로피 목표와 이에 대한 엔트로피 규제 대체를 사전 정책과 역온도 β로 정의한다.
전이, 사전 정책, 보상을 결합한 기울어진 행렬 P̃를 사용하여 왼쪽 고유벡터 u와 오른쪽 고유벡터 v를 통해 최적 정책을 나타낸다.
목표 엔트로피 속도(비율)를 생성하는 자기 일관된 보상 r(s,a) = -log u(s,a)v(s,a) 를 도출한다.
전방 및 후방 확률 흐름의 균형을 맞추고 투영 메트릭에서 수렴하는 u ← T(u)라는 고정점 업데이트를 얻는다.
규제되지 않은 목적에 대해 현재 최적 정책으로 사전 정책을 반복적으로 업데이트하는 PPI를 적용한다.
ROLLOUT 없이 엔트로피를 추정하기 위해 오른쪽 고유벡터를 비정책 학습으로 계산하는 오프폴리시 방법과 EVE 업데이트의 수렴을 보인다.

Figure 1 : EVE converges to an exploration policy that achieves maximum entropy. Compared to the baselines, the optimal policy found by EVE produces a higher entropy and converges much faster. (Inset) “CliffWorld” environment used. The green circle denotes the initial state; stepping into the cliff

실험 결과

연구 질문

RQ1기울어진 전이 연산자의 스펙트럼 특성을 활용하여 온-폴리시 롤아웃 없이 최대 엔트로피 탐색을 해결할 수 있는가?
RQ2기울어진 행렬의 왼쪽 및 오른쪽 고유벡터를 사용하여 정상 상태 엔트로피를 최대화하는 자기 일관적 내재 보상을 구성할 수 있는가?
RQ3엔트로피 규제 평균 보상 형식이 엔트로피 최대화 정책으로의 고정점, 수축 매핑 접근을 제공하는가?
RQ4비규제 MaxEnt 목적을 PPI를 통해 접근하여 엔트로피 비용을 지속적으로 줄일 수 있는가?
RQ5결정론적 그리드 월드 환경에서의 실험 결과가 롤아웃 기반 벤치마크에 대한 경쟁적 탐색 성능을 보여주는가?

주요 결과

EVE는 롤아웃이나 방문 추정 없이 기울어진 전이 매트릭스의 지배적 고유벡터에서 엔트로피를 최대화하는 정책을 계산한다.
고정점 업데이트 u ← T(u)는 β ≥ 1인 경우 투영 메트릭 하에서 수축적이며 고유 해로 수렴함을 보장한다.
비규제 문제의 경우, 얻은 최적 정책으로 사전 정책을 점진적으로 업데이트하는 PPI 접근법이 최대 엔트로피 해로 수렴한다.
결정론적 그리드 월드 실험에서 EVE는 롤아웃 기반 기준선보다 더 높은 정상 상태 상태-행동 엔트로피와 더 빠른 수렴을 보인다.
EVE는 탐구 환경에서 로그|S||A|에 근접한 거의 최대 엔트로피를 달성하되 할인 없이도 안정성을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.