QUICK REVIEW

[논문 리뷰] Understanding Domain Randomization for Sim-to-real Transfer

Xiaoyu Chen, Jiachen Hu|arXiv (Cornell University)|2021. 10. 07.

Reinforcement Learning in Robotics참고 문헌 54인용 수 33

한 줄 요약

이 논문은 도메인 무작위화를 통한 sim-to-real 전이를 이론적 프레임워크로 제시하며, 시뮬레이터를 잠재 MDP로 모델링하고 다양한 설정에서 sim-to-real 간극의 경계를 도출한다.

ABSTRACT

Reinforcement learning encounters many challenges when applied directly in the real world. Sim-to-real transfer is widely used to transfer the knowledge learned from simulation to the real world. Domain randomization -- one of the most popular algorithms for sim-to-real transfer -- has been demonstrated to be effective in various tasks in robotics and autonomous driving. Despite its empirical successes, theoretical understanding on why this simple algorithm works is limited. In this paper, we propose a theoretical framework for sim-to-real transfers, in which the simulator is modeled as a set of MDPs with tunable parameters (corresponding to unknown physical parameters such as friction). We provide sharp bounds on the sim-to-real gap -- the difference between the value of policy returned by domain randomization and the value of an optimal policy for the real world. We prove that sim-to-real transfer can succeed under mild conditions without any real-world training samples. Our theory also highlights the importance of using memory (i.e., history-dependent policies) in domain randomization. Our proof is based on novel techniques that reduce the problem of bounding the sim-to-real gap to the problem of designing efficient learning algorithms for infinite-horizon MDPs, which we believe are of independent interest.

연구 동기 및 목표

조정 가능한 시뮬레이터 매개변수를 가진 잠재 MDP 문제로서의 sim-to-real 전이를 형식화한다.
유한 및 무한 시뮬레이터 클래스에서 도메인 무작위화에 대한 sim-to-real 간극을 분석한다.
메모리(히스토리 의존 정책)가 효과적인 sim-to-real 전이에 결정적임을 보인다.
무한 수평의 MDP에서 함수 근사를 사용한 학습과 연결된 새로운 증명 프레임워크를 제공한다.

제안 방법

시뮬레이터를 실제 세계 요소(예: 마찰 등)를 나타내는 잠재 매개변수를 가진 MDP 세트로 모델링한다.
도메인 무작위화를 MDP 분포에서 샘플링하여 히스토리 요구사항을 가진 잠재 MDP를 형성하는 것으로 정의한다.
잠재 MDP에 대해 최적의 히스토리 의존 정책을 반환하는 Domain Randomization Oracle을 도입한다.
다음 세 가지 설정에서 sim-to-real 간극의 상한을 도출한다: 유한(분리 있음), 유한(분리 없음), 무한 시뮬레이터 클래스.
베이스-정책 구성을 무한 수평 평균 보상 MDP에서 함수 근사와 함께 후회(bound)와 연결한다.

실험 결과

연구 질문

RQ1도메인 무작위화가 실제 세계 horizon H에 비해 sublinear한 sim-to-real 간극을 보장하는 시점은 언제인가?
RQ2도메인 무작위화 하에서 유한 대 무한 시뮬레이터 클래스가 sim-to-real 간극에 어떻게 영향을 미치는가?
RQ3메모리(히스토리 의존성)가 바람직한 sim-to-real 보장을 달성하는 데 어떤 역할을 하는가?
RQ4도메인 무작화와 관련된 일반 함수 근사를 가진 무한 수평 평균 보상 MDP 학습을 위한 효율적인 모델 기반 알고리즘을 도출할 수 있는가?
RQ5 시뮬레이터 클래스의 어떤 조건이 실제 세계 학습 데이터 없이도 도메인 무작위화를 효과적으로 유지시키는가?

주요 결과

분리 조건이 있는 유한 시뮬레이터 클래스의 경우, sim-to-real 간극은 O(D M^3 log(MH) log^2(SMH/δ) / δ^4)이다.
분리가 없지만 유한한 경우에도 sim-to-real 간극은 O(D sqrt(M^3 H log(MH)))이다.
실제 MDP에 가까운 매끄러움이 있는 무한 시뮬레이터 클래스에서 간극은 D, ereluder 차원 de, horizon H 및 함수 클래스의 커버링 수에 의존하는 항과 ε의 Lipschitz 항을 포함하는 항에 의해 한정된다.
적절한 조건이 없으면 어떠한 정책도 최악의 경우 Ω(sqrt(D M H)) 간극을 야기할 수 있음을 보여주는 하한이 존재한다.
메모리(히스토리 의존성)는 sublinear 간극 달성에 필수적이며 도메인 무작위화에서 버려질 수 없다.
일반 함수 근사를 사용한 무한 수평 평균 보상 MDP 학습을 위한 최초의 증명 가능한 효율적 모델 기반 알고리즘을 제시하며 거의 최적에 가까운 후회(bound)를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.