QUICK REVIEW

[논문 리뷰] Provably Efficient Maximum Entropy Exploration

Elad Hazan, Sham M. Kakade|arXiv (Cornell University)|2018. 12. 06.

Image and Signal Denoising Methods인용 수 94

한 줄 요약

논문은 Frank-Wolfe 스타일의 계획 및 밀도 오라클을 사용하여 unknown MDP에서 상태 분포의 최대 엔트로피 등의 내재적 상태 방문 목표를 증명 가능하게 효율적으로 최적화하는 알고리즘을 제공합니다.

ABSTRACT

Suppose an agent is in a (possibly unknown) Markov Decision Process in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? This work studies a broad class of objectives that are defined solely as functions of the state-visitation frequencies that are induced by how the agent behaves. For example, one natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. We provide an efficient algorithm to optimize such such intrinsically defined objectives, when given access to a black box planning oracle (which is robust to function approximation). Furthermore, when restricted to the tabular setting where we have sample based access to the MDP, our proposed algorithm is provably efficient, both in terms of its sample and computational complexities. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

연구 동기 및 목표

보상가 없거나 희소한 unknown MDP에서의 탐색 동기 부여.
상태 방문 분포에 의존하는 내재적 목표를 정의하고 최적화하기 (예: 엔트로피).
근사 계획 오라클과 상태 분포 오라클을 사용하여 증명 가능한 효율성을 달성하는 방법을 제시.
샘플 및 계산 보장을 갖춘 표 형태의 결과와 알려지지 않은 MDP 설정의 결과를 제공.

제안 방법

탐색을 유도된 상태 분포 dπ에 대해 컨케이브 함수 R(dπ)를 최대화하는 문제로 공식화한다.
탐색 공간을 가능한 상태 분포의 볼록집합 K로 표현하고 최적화를 이 공간으로 축소한다.
정해진 그래디언트 방식의 Frank-Wolfe(조건부 그래디언트) 스타일 알고리즘을 사용하여 정책 혼합에 정책을 순차적으로 추가하고 가중치를 업데이트한다.
각 반복에서 추정된 분포에서의 그래디언트를 통해 보상 r_t를 구성한 다음, ApproxPlan 오라클을 사용하여 r_t에 대한 근소 최적 정책을 얻는다.
DensityEst 오라클로 현재 상태 분포를 추정하고 근사 오차를 고려한다.
R의 매끄러움 가정하에서 상태 공간 크기에 독립적으로 O(1/ε log 1/ε)로 오라클 호출 수를 보장한다는 보장을 제공한다.

실험 결과

연구 질문

RQ1Can we efficiently optimize intrinsic objectives defined on state visitation distributions in unknown MDPs?
RQ2Does a Frank-Wolfe style method with planning and density oracles yield polynomial-time guarantees for entropy-based objectives?
RQ3How many calls to planning and density estimation oracles are needed to achieve ε-suboptimality for a given R?
RQ4What are the sample and computational complexities in the tabular vs unknown MDP settings for max-entropy exploration?

주요 결과

An efficient algorithm (Algorithm 1) achieves R(d_{π_mix_T}) within ε of the optimum after O(1/ε log 1/ε) calls to ApproxPlan & DensityEst.
Maximizing entropy is framed as maximizing a concave functional over the induced state distribution with a convex reformulation in distribution space.
Stationary policies suffice for the optimization over distributions (via π′(a|s) = dπ(s,a)/dπ(s)).
In the tabular known-MDP setting, the method runs in polynomial time with standard planning methods; in the unknown-MDP setting, a sample-based construction (Algorithms 2 and 3) yields polynomial-time guarantees with specified episode complexity.
The paper provides a smoothed entropy proxy H_σ to enable smooth optimization and relates its optimization guarantees back to the true entropy.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.