QUICK REVIEW

[논문 리뷰] Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair|arXiv (Cornell University)|2021. 10. 12.

Reinforcement Learning in Robotics참고 문헌 23인용 수 129

한 줄 요약

암시적 Q 학습(IQL)은 오프라인 학습 중 보이지 않는 행동을 평가하지 않도록 상태 조건부 기대값(expectiles)을 사용해 분포 내 최적 행동을 근사하고, 다단계 동적 계획을 가능하게 하며 D4RL 벤치마크에서 강력한 성능을 보인다.

ABSTRACT

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

연구 동기 및 목표

데이터가 고정된 데이터셋에서 오프라인 RL이 필요하고 온라인 탐색이 비용이 들거나 위험한 상황을 다루는 것을 동기화한다.
가치 학습 중에 보이지 않는 행동을 질의하지 않는 방법을 도입한다.
데이터셋의 행동 지원을 통해 암시적으로 정책 개선을 수행하기 위해 기대값 회귀(expectile 회귀)를 활용한다.
훈련 중 명시적 정책 없이 다단계 동적 프로그래밍을 가능하게 하고, 이후 간단한 정책 추출 단계를 수행한다.
D4RL 벤치마크에서 강력한 실험적 성능과 오프라인 초기화에 대한 강건성을 입증한다.

제안 방법

비대칭 기대값 회귀 목표를 정의해 상태-행동 값을 추정하고 목표를 데이터 셋 내의 행동으로 제한한다.
행동 분포에 대해 Q의 expectiles를 근사하는 별도의 값 함수 V를 사용한 다음 r(s,a)+γV(s′)로 Q를 업데이트(back up)한다.
expectile 손실과 SARSA 유사 TD 목표를 교대 업데이트하여 Q와 V를 학습하고, 분포 밖의 행동을 피한다.
Q와 V를 사용하되 보이지 않는 행동을 질의하지 않는 Advantage-Weighted Behavioral Cloning(AWR)으로 정책을 추출한다.
V와 정책 업데이트를 안정화하기 위해 클리핑된 이중 Q-learning을 이용하고 대상 추정에 두 개의 Q-함수를 사용한다.
표준 SARSA 유사 업데이트에서 쉽게 수정 가능하고 최신 하드웨어에서 효율적인 구현을 제공한다.
온라인 데이터와 동시 학습을 지속해 온라인 미세조정을 논의한다.

실험 결과

연구 질문

RQ1오프라인 RL이 분포 밖의 행동을 전혀 질의하지 않으면서 행동 정책보다 상당한 정책 향상을 달성할 수 있는가?
RQ2지원 내 행동 값의 기대값 기반 학습이 오프라인 RL에서 효과적인 다단계 동적 프로그래밍을 가능하게 하는가?
RQ3특히 Ant Maze 작업에서 IQL이 D4RL 벤치마크의 다단계 및 단일단계 오프라인 RL 방법들과 어떻게 비교되는가?
RQ4분포 밖 질의 없이 학습된 가치 함수일 때 간단한 정책 추출 방법(advantage-weighted regression)이 충분한가?
RQ5오프라인 초기화 후 온라인으로 효과적으로 미세 조정될 수 있는가?

주요 결과

Dataset	BC	10%BC	DT	AWAC	Onestep RL	TD3+BC	CQL	IQL (저자 제안)
halfcheetah-medium-v2	42.6	42.5	42.6	43.5	48.4	48.3	44.0	47.4
hopper-medium-v2	52.9	56.9	67.6	57.0	59.6	59.3	58.5	66.3
walker2d-medium-v2	75.3	75.0	74.0	72.4	81.8	83.7	72.5	78.3
halfcheetah-medium-replay-v2	36.6	40.6	36.6	40.5	38.1	44.6	45.5	44.2
hopper-medium-replay-v2	18.1	75.9	82.7	37.2	97.5	60.9	95.0	94.7
walker2d-medium-replay-v2	26.0	62.5	66.6	27.0	49.5	81.8	77.2	73.9
halfcheetah-medium-expert-v2	55.2	92.9	86.8	42.8	93.4	90.7	91.6	86.7
hopper-medium-expert-v2	52.5	110.9	107.6	55.8	103.3	98.0	105.4	91.5
walker2d-medium-expert-v2	107.5	109.0	108.1	74.5	113.0	110.1	108.8	109.6
locomotion-v2 total	466.7	666.2	672.6	450.7	684.6	677.4	698.5	692.4
antmaze-umaze-v0	54.6	62.8	59.2	56.7	64.3	78.6	74.0	87.5
antmaze-umaze-diverse-v0	45.6	50.2	53.0	49.3	60.7	71.4	84.0	62.2
antmaze-medium-play-v0	0.0	5.4	0.0	0.0	0.3	10.6	61.2	71.2
antmaze-medium-diverse-v0	0.0	9.8	0.0	0.7	0.0	3.0	53.7	70.0
antmaze-large-play-v0	0.0	0.0	0.0	0.0	0.0	0.2	15.8	39.6
antmaze-large-diverse-v0	0.0	6.0	0.0	1.0	0.0	0.0	14.9	47.5
antmaze-v0 total	100.2	134.2	112.2	107.7	125.3	163.8	303.6	378.0
total	566.9	800.4	784.8	558.4	809.9	841.2	1002.1	1070.4

IQL은 다단계 동적 프로그래밍이 필요한 도메인인 Ant Maze 작업에서 최첨단 성능을 달성한다.
MuJoCo 움직임 태스크에서 IQL은 기존 최강 방법들(CQL 등)과 경쟁한다.
IQL은 계산적으로 효율적이며 예를 들어 GTX1080에서 1M 업데이트가 20분 미만에 완료되며 재구현된 베이스라인보다 빠르게 실행된다.
더 큰 기대값 τ가 스티칭 작업에 결정적이며, Ant Maze에서 τ가 클수록 Q-learning에 더 근접한 근사를 제공한다.
오프라인 결과는 온라인 미세조정을 통해 보완되며, IQL 초기화 후 온라인 상호작용을 통해 보고된 설정에서 AWAC이나 CQL보다 경쟁력 있거나 우수한 최종 성능을 달성한다.
IQL은 단순 가중치 기반의 행동복제 추출을 통해 효과적인 정책을 찾아내며, 가치 학습 중 분포 밖의 행동에 대한 명시적 질의를 피한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.