QUICK REVIEW

[논문 리뷰] DART: Noise Injection for Robust Imitation Learning

Michael Laskey, Jonathan Lee|arXiv (Cornell University)|2017. 03. 27.

Reinforcement Learning in Robotics인용 수 78

한 줄 요약

DART는 감독자의 시연에 최적화된 노이즈를 주입하여 모방학습에서 공변량 시프트를 완화하고, DAgger와 동등한 성능을 달성하면서 더 효율적이고 인간에게 더 안전합니다.

ABSTRACT

One approach to Imitation Learning is Behavior Cloning, in which a robot observes a supervisor and infers a control policy. A known problem with this "off-policy" approach is that the robot's errors compound when drifting away from the supervisor's demonstrations. On-policy, techniques alleviate this by iteratively collecting corrective actions for the current robot policy. However, these techniques can be tedious for human supervisors, add significant computation burden, and may visit dangerous states during training. We propose an off-policy approach that injects noise into the supervisor's policy while demonstrating. This forces the supervisor to demonstrate how to recover from errors. We propose a new algorithm, DART (Disturbances for Augmenting Robot Trajectories), that collects demonstrations with injected noise, and optimizes the noise level to approximate the error of the robot's trained policy during data collection. We compare DART with DAgger and Behavior Cloning in two domains: in simulation with an algorithmic supervisor on the MuJoCo tasks (Walker, Humanoid, Hopper, Half-Cheetah) and in physical experiments with human supervisors training a Toyota HSR robot to perform grasping in clutter. For high dimensional tasks like Humanoid, DART can be up to $3x$ faster in computation time and only decreases the supervisor's cumulative reward by $5\%$ during training, whereas DAgger executes policies that have $80\%$ less cumulative reward than the supervisor. On the grasping in clutter task, DART obtains on average a $62\%$ performance increase over Behavior Cloning.

연구 동기 및 목표

오프정책 모방학습(Behavior Cloning)에서 공변량 시프트를 해결한다.
학습자에게 교정 기회를 노출시키는 노이즈 주입의 오프정책 방법을 제시한다.
DAgger와 같은 온정책 방법에 비해 감독자 부담과 계산 비용을 줄인다.
MuJoCo 로봇 동작 태스크와 잡다한 장애물 속의 실세계 그립에서 DART의 효과를 입증한다.

제안 방법

시연 중 감독자의 정책에 노이즈를 주입하는 DART(Disturbances for Augmenting Robot Trajectories)를 도입한다.
감독자의 노이즈가 섞인 시연을 로봇의 최종 정책과 정렬되도록 노이즈 최적화를 공식화한다.
노이즈가 섞인 감독하에서 로봇 제어의 음의 로그 가능도를 최소화하도록 노이즈 통계를 업데이트하는 반복 절차(Algorithm 1)를 도출한다.
궤적 분포 간 KL-다이버전스를 통해 감소된 공변량 시프트를 보이는 이론적 상한을 제시한다.
반복 스킴에서 가우시안 노이즈 공분산의 닫힌 형식 업데이트를 보인다.
알고리즘 감독자와 인간 감독자 모두를 사용한 MuJoCo 로코모션 태스크와 Toyota HSR의 잡힌 장애물 속 그리핑 태스크를 평가한다.

실험 결과

연구 질문

RQ1DART가 온정책 방법만큼 공변량 시프트를 효과적으로 감소시키는가?
RQ2데이터 수집 중 DART가 감독자 보상 및 계산 시간에 어떤 영향을 미치는가?
RQ3DART 하에서 인간 감독자가 더 나은 시연을 제시할 수 있는가?
RQ4고차원 로봇 태스크에서 DART가 Behavior Cloning 및 DAgger와 어떻게 비교되는가?

주요 결과

DART는 MuJoCo 로코모션 도메인에서 DAgger와 동등한 성능을 달성하는 한편 계산 시간은 훨씬 더 낮다(예: Humanoid: 약 3배 빠름).
학습 중 DART는 감독자의 누적 보상을 감독자에 비해 약 5% 정도만 감소시키는 반면, DAgger는 감독자보다 누적 보상이 80% 이상 감소하는 정책을 산출한다.
인간 감독자와 함께 하는 잡힌 장애물 속 그리핑에서, 적절한 노이즈 수준의 DART는 Behavior Cloning에 비해 평균 62%의 성능 향상을 보인다.
최적화 없이 등방 가우시안 노이즈는 성능이 좋지 않고 불안전한 정책을 유발할 수 있어 최적화된 노이즈의 필요성을 강조한다.
DART는 고차원 태스크에서 강한 개선을 보여주며 Behavior Cloning보다 로봇의 최종 궤적 분포를 더 잘 일치시켜 공변량 시프트를 감소시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.