QUICK REVIEW

[논문 리뷰] Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble

Gaon An, Seungyong Moon|arXiv (Cornell University)|2021. 10. 04.

Reinforcement Learning in Robotics참고 문헌 27인용 수 48

한 줄 요약

본 논문은 EDAC를 소개한다. 이는 클립드 Q-learning을 통한 불확실성 기반 페널티와 앙상블 다변화(ensemble diversification)를 활용하는 앙상블-그래디언트 다양화(offine RL) 방법으로, 더 적은 네트워크로도 최첨단(SOTA) 성능을 달성한다.

ABSTRACT

Offline reinforcement learning (offline RL), which aims to find an optimal policy from a previously collected static dataset, bears algorithmic difficulties due to function approximation errors from out-of-distribution (OOD) data points. To this end, offline RL algorithms adopt either a constraint or a penalty term that explicitly guides the policy to stay close to the given dataset. However, prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which themselves can be a non-trivial problem. Moreover, these methods under-utilize the generalization ability of deep neural networks and often fall into suboptimal solutions too close to the given dataset. In this work, we propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution. We show that the clipped Q-learning, a technique widely used in online RL, can be leveraged to successfully penalize OOD data points with high prediction uncertainties. Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning. Based on this observation, we propose an ensemble-diversified actor-critic algorithm that reduces the number of required ensemble networks down to a tenth compared to the naive ensemble while achieving state-of-the-art performance on most of the D4RL benchmarks considered.

연구 동기 및 목표

명시적 행동 정책 추정이나 데이터 분포 샘플링 없이도 강건한 오프라인 RL을 동기화한다.
Q-값 예측 불확실성을 Q-함수 앙상블을 통해 활용하여 OOD 행동에 페널티를 부여한다.
클립드 Q-learning과 함께 Q-앙상블의 크기를 증가시키면 강력한 오프라인 RL 성능이 나타난다.
앙상블 그래디언트 다변화 정규화를 통해 필요한 앙상블 크기를 줄인다.
D4RL MuJoCo와 Adroit 벤치마크에서 최첨단 결과를 입증한다.

제안 방법

N개의 Q-네트워크 앙상블을 채택하여 앙상블 전체 중 최솟값을 이용해 클립드 Q-learning 타깃을 계산한다.
앙상블 예측에서 하한 신뢰구간(lower-confidence bound)을 활용하여 불확실성 기반 페널티를 사용한다.
Q-네트워크 간의 그래디언트 다양성을 최대화하기 위한 앙상블 그래디언트 다변화(ES) 목표를 도입하여 쌍간 그래디언트 정렬(pairwise gradient alignment)을 최소화한다.
클립드 Q-learning 타깃, 네트워크별 Q-함수 업데이트 및 ES 정규화를 결합하여 EDAC(Ensemble-Diversified Actor Critic)를 정식화한다.
타깃이 y = r + γ min_j Q'φ_j'(s', a') − β log πθ(a'|s'), SAC에서와 같이 Qφ_i와 θ를 업데이트하는 알고리즘적 설명을 제공하고, 여기에 ES 정규화를 추가한다.
EDAC가 나쁜 SAC-N보다 훨씬 적은 앙상블로도 경쟁력 있거나 우수한 성능을 달성함을 보인다.

실험 결과

연구 질문

RQ1Q-값 예측의 불확실성을 효과적으로 활용하여 데이터 분포 샘플링이나 행동 정책 추정 없이도 오프라인 RL을 제약할 수 있는가?
RQ2클립드 Q-learning과 함께 고수의 Q-앙상블이 오프라인 RL 성능을 개선하는가, 그리고 앙상블-그래디언트 다변화가 필요한 앙상블 크기를 줄일 수 있는가?
RQ3표준 앙상블 접근법에 비해 그래디언트 다변화가 오프라인 RL의 안정성 및 성능에 어떤 영향을 주는가?

주요 결과

클립드 Q-learning과 함께 Q-네트워크 수를 늘리면 오프라인 RL 성능이 향상되어 여러 작업에서 이전의 최첨단을 능가한다.
클립드 Q-learning은 비관적 페널티로 작용하여 앙상블 불확실성을 활용해 OOD 데이터의 과대평가를 효과적으로 감소시킨다.
앙상블-그래디언트 다변화 목표(ES)는 그래디언트 다양성을 증가시켜 매우 큰 앙상블(예: Hopper에서 수백에서 50 미만으로) 필요성을 줄이고도 강한 성능을 유지한다.
EDAC는 앙상블 비관주의와 그래디언트 다양화를 결합하여 대부분의 D4RL 벤치마크에서 최첨단 성능을 달성하며, 종종 CQL보다 낮은 계산량으로 달성한다.
D4RL MuJoCo Gym의 실험 결과 EDAC와 SAC-N이 무작위, 중간, 전문가 데이터셋 전반에서 베이스라인을 능가하거나 대등한 성능을 보이며, EDAC는 SAC-N보다 적은 Q-네트워크로도 강한 평균 성능을 달성한다.
Adroit 과제에서 EDAC와 SAC-N은 견고하게 동작하며, 특히 펜 관련 작업에서 EDAC가 이전 결과와 일치하거나 더 우수한 경우가 많다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.