QUICK REVIEW

[논문 리뷰] Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Daniel S. Brown, Wonjoon Goo|arXiv (Cornell University)|2019. 04. 12.

Machine Learning and Data Classification인용 수 120

한 줄 요약

T-REX는 ranked 하위 시연으로부터 보상 함수를 학습하여 최고 시연을 넘어 extrapolate하고, ground-truth 보상이나 액션 라벨 없이도 학습자가 시연자를 능가하도록 만든다.

ABSTRACT

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often more than twice the performance of the best demonstration. We also demonstrate that T-REX is robust to ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time.

연구 동기 및 목표

하위 시연에서 학습의 동기를 부여하기 위해 시연자의 의도를 모방하기보다는 이를 추론한다.
ranked 궤적을 사용하여 최고 시연을 넘어 extrapolate하는 보상 학습-관찰 방법을 개발한다.
추론된 보상을 최적화함으로써 다운스트림 RL 에이전트가 시연자보다 더 나은 성능을 달성하도록 한다.

제안 방법

Ranked demonstrations를 사용하여 상태 기반 보상 함수를 신경망으로 학습하는 Trajectory-ranked Reward EXtrapolation (T-REX)를 도입한다.
소프트맥스 교차 엔트로피(Bradley–Terry/Luce–Shephard 스타일) 목적 함수를 사용하여 더 높은 순위의 궤적이 더 높은 예측 보상을 갖도록 하는 순위 기반 손실을 최소화하여 보상 네트워크를 훈련한다.
학습 데이터를 늘리고 데이터 보강을 위해 부분 궤적을 사용하고 ranked demonstrations에서 다수의 쌍 관계를 형성한다.
학습된 보상을 깊은 RL(PPO)과 결합하여 시연보다 우수한 정책을 얻는다.
보상 학습을 규제하고 RL 최적화 전에 출력을 정규화하기 위해 다섯 개의 신경망(en ensemble)을 사용한다.

실험 결과

연구 질문

RQ1랭크가 매겨진, 잠재적으로 하위 최적의 시연으로부터 최고 관찰 궤적을 extrapolate하는 보상 함수를 학습할 수 있는가?
RQ2랭크를 이용한 관찰 학습이 고차원 과제에서 시연자를 능가하는 정책을 가능하게 하는가?
RQ3T-REX는 랭킹 노이즈와 시간 기반 또는 인간이 제공한 랭킹에서의 학습에 얼마나 강인한가?
RQ4시연자 액션이나 실제 보상 신호 없이도 작동하여 모방 기반 baselines를 능가할 수 있는가?

주요 결과

T-REX는 MuJoCo 과제에서 PPO와 결합될 때 종종 최고 시연의 두 배 이상 성능을 달성한다.
T-REX는 MuJoCo와 Atari의 대부분의 과제에서 현 상태의 imitation learning 및 IRL 방법들(BCO, GAIL)보다 우수한 성능을 보인다.
T-REX는 중간 수준의 랭킹 노이즈에 강건하며 시간 순서가 매겨진(노이즈 있는) 랭킹이나 인간이 제공한 노이즈 레이블에서도 학습할 수 있다.
Atari에서 T-REX는 8개 중 7개 게임에서 BCO 및 GAIL보다 우수했고, 여러 타이틀에서 종종 최고 시연의 두 배 이상 점수를 얻었다.
보상 extrapolation은 여러 게임에서 ground-truth 보상과 높은 상관관계를 보이며 관찰된 궤적을 넘어서는 효과적인 정책 개선을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.