QUICK REVIEW

[논문 리뷰] CEM-RL: Combining evolutionary and gradient-based methods for policy search

Aloïs Pourchot, Olivier Sigaud|arXiv (Cornell University)|2018. 10. 02.

Reinforcement Learning in Robotics참고 문헌 30인용 수 95

한 줄 요약

CEM-RL은 Cross-Entropy Method와 TD3를 결합하여 진화적 탐색과 기울기 기반 정책 개선을 함께 활용함으로써, 연속 제어 벤치마크 전반에서 경쟁력 있거나 우수한 성능과 안정성을 달성합니다.

ABSTRACT

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms are two popular approaches to policy search. The former is widely applicable and rather stable, but suffers from low sample efficiency. By contrast, the latter is more sample efficient, but the most sample efficient variants are also rather unstable and highly sensitive to hyper-parameter setting. So far, these families of methods have mostly been compared as competing tools. However, an emerging approach consists in combining them so as to get the best of both worlds. Two previously existing combinations use either an ad hoc evolutionary algorithm or a goal exploration process together with the Deep Deterministic Policy Gradient (DDPG) algorithm, a sample efficient off-policy deep RL algorithm. In this paper, we propose a different combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (td3), another off-policy deep RL algorithm which improves over ddpg. We evaluate the resulting method, cem-rl, on a set of benchmarks classically used in deep RL. We show that cem-rl benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency.

연구 동기 및 목표

진화 전략과 심층 강화 학습을 결합해 정책 탐색의 탐색-안정성-샘플 효율성의 균형을 맞추려는 동기 부여.
cem-rl과 TD3 기반 비평가-기반 경사 업데이트를 결합하는 구체적 방법을 제안합니다.
표준 Mujoco 벤치마크에서 baselines (cem, td3, multi-actor td3)와 기존 하이브리드 (erl)와의 cem-rl 평가.
생산적 성능과 안정성에 대한 진화적 구성요소와 기울기 기반 개선의 기여를 분석합니다.

제안 방법

현재 평균 정책 주위의 가우시안에서 샘플링된 개체군을 공분산 Sigma로 표현합니다.
개체군의 절반은 직접 평가되고, 나머지 절반은 TD3/평가자에 의해 안내되는 경사 단계로 개선된 후 재평가됩니다.
상위 절반의 성과를 바탕으로 개체군의 평균과 공분산을 업데이트합니다( cem 업데이트 ).
리플레이 버퍼를 통합하고 새로운 경험으로 비평가를 학습시키며; 개체군에서 도출된 행위자에 경사 단계를 적용합니다.
Sampling에 대해 중요도 혼합의 강조를 두고, 환경 단계와 학습 업데이트 간 자원 배분에 대한 명시적 논의를 포함합니다.

실험 결과

연구 질문

RQ1cem-rl이 표준 연속 제어 벤치마크에서 구성 요소(cem 및 td3)와 다중 행위(td3의 멀티 액터 변형)보다 우수한가?
RQ2결과 성능, 수렴 속도 및 학습 안정성 측면에서 cem-rl이 erl과 어떻게 비교되는가?
RQ3조합이 실험에서 샘플 효율성 및/또는 하이퍼파라미터에 대한 견고성을 향상시키는가?
RQ4진화적 구성요소가 population 기반 탐색을 단순히 제공하는 것을 넘어 어느 정도 기여하는가?
RQ5cem-rl이 성능을 저하시킬 수 있는 한계 요인이나 환경 특성은 무엇인가?

주요 결과

cem-td3는 여러 Mujoco 벤치마크에서 cem, td3 및 multi-actor td3 보다 일반적으로 우수한 성능을 보이며 학습 분산이 감소합니다.
cem-rl 방법들(cem-ddpg 및 cem-td3)은 테스트 설정에서 erl보다 여러 환경에서 우수한 성능을 보이며, cem-td3가 종종 최종 성능이 가장 좋고 수렴 속도가 더 빠릅니다.
경사 일치 TD3 가이드를 공유된 경사로 대체하는 제거 연구는 성능 저하를 일으키는 것으로 나타났으며(다중 행위 TD3), 결합된 진화-경사 스킴의 이점을 시사합니다.
erl과 비교했을 때 cem-td3는 특히 walker2d-v2 및 ant-v2와 같은 더 어려운 환경에서 더 나은 안정성과 최종 성능을 자주 제공합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.