QUICK REVIEW

[논문 리뷰] SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards

Siddharth Reddy, Anca D. Dragan|arXiv (Cornell University)|2019. 05. 27.

Reinforcement Learning in Robotics참고 문헌 35인용 수 53

한 줄 요약

SQIL은 보상 함수를 학습하지 않고 오프 폴리시 RL에서 상수 보상을 사용하여 긴 시계열 모방 학습을 가능하게 하는 간단한 모방 학습 방법을 시연하며, 다양한 과제에서 행동 복제를 능가하고 GAIL과 비교해 동등하거나 경쟁력을 보인다.

ABSTRACT

Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo.

연구 동기 및 목표

고차원 관측 및 미지의 다이나믹스를 가진 환경에서 분포 이동 문제를 피하면서 모방 학습의 필요성을 제시한다.
보상 함수를 학습하지 않는 단순한 RL 기반 모방 학습 방법을 제공한다.
상수 보상이 시연된 상태를 일치시키고 오프-디스트리뷰션일 때도 이를 다시 시연된 상태로 되돌아가도록 유도하여 긴 시계열 모방을 가능하게 한다는 것을 보여준다.
표준 Q-러닝 또는 오프폴리시 알고리즘에 몇 가지 소소한 수정으로 SQIL을 구현할 수 있음을 보여준다.

제안 방법

재생 버퍼를 전문 시연으로 초기화하고 시연 전이에 대해 상수 보상 r = +1을 부여한다.
보상 r = 0으로 설정된 새로운 에이전트 상호작용 데이터를 추가하고 같은 재생 버퍼에 이어 붙인다.
시연 데이터와 새로운 경험의 50/50 혼합으로 학습 배치를 샘플링하여 안정적인 유효 보상을 유지한다.
시연 및 새로운 경험에 대한 제곱 소벨만 오차를 가진 소프트 Q-learning 목표를 최적화한다.
암시적 보상의 희소성 선행을 강제하는 규제된 행동 복제 목표와의 동등성을 보여준다.
SAC와 같은 오프폴리시 액터-크리틱 방법 위에 적용하여 연속 행동에 대해 SQIL을 확장한다.

실험 결과

연구 질문

RQ1보상 함수를 학습하지 않고도 상수 보상을 이용한 일반 RL 접근법이 긴 시계열 모방을 재현할 수 있는가?
RQ2SQIL은 적대적 학습 없이 BC에 내재된 분포 이동 문제를 완화할 수 있는가?
RQ3SQIL이 구현은 간단한 상태에서 GAIL과 경쟁력이 있는가? 이미지 기반 및 저차원 과제에서
RQ4시연 데이터와 환경 상호작용의 통합이 정책에 시간에 따라 어떤 영향을 미치는가?
RQ5SQIL을 오프폴리시 알고리즘과 연속 제어 설정에 적용할 수 있는가?

주요 결과

SQIL은 테스트된 과제에서 상태 분포 이동 하에서 특히 행동 복제를 능가한다.
이미지 기반 및 저차원 환경에서 GAIL에 비해 경쟁력 있는 결과를 달성한다.
SQIL은 표준 오프폴리시 RL 알고리즘에 소수의 수정으로 구현 가능하며 보상 함수를 학습할 필요가 없다.
시연된 상태에 가깝게 에이전트를 유지하도록 행동을 유도하고 고정 보상으로 시연을 재생하여 긴 시계열 모방을 유지한다.
연속 제어에서 SAC로 구현된 SQIL은 강한 성능을 보이며 적은 시연으로도 작동한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.