QUICK REVIEW

[논문 리뷰] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Nathan S. de Lara, Florian Shkurti|arXiv (Cornell University)|2026. 02. 19.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

SMAC은 오프라인 크리틱의 행동-그래디언트를 데이터셋의 행동-점수와 정렬시켜 정규화하고 Muon 최적화를 사용하여 SAC, TD3, TD3+BC로의 매끄러운 오프라인-온라인 전이를 여섯 가지 D4RL 과제에서 가능하게 합니다.

ABSTRACT

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

연구 동기 및 목표

오프라인 RL로 사전학습된 행위자-비평가가 온라인으로 미세조정될 때 성능 저하가 자주 발생하는 이유를 제시한다.
온라인 가치 기반 RL과의 저하 없이도 호환이 유지되는 오프라인 행위자-비평가를 생성하는 방법을 제안한다.
제안된 방법이 다수의 과제에서 SAC, TD3, TD3+BC로의 매끄러운 전이를 달성함을 입증한다.
오프라인 및 온라인 최대치 간의 연결성을 정량화하고 SMAC이 이 연결성을 개선하는 방법을 보여준다.

제안 방법

크리틱의 ∇a Q(s,a)를 데이터셋의 행동-점수 ∇a log πD(a|s)와 일치시키는 이론에 기반한 정규화항을 추가한다.
확산 기반 강화학습을 통한 감독(RvS)을 사용해 데이터셋 점수를 추정하고 ∇a log p(a|s,w)를 얻는다.
SMAC 크리틱 손실 LSMAC(θ,ψ) = κ LSM(θ,ψ) + LAC(θ)을 도입하고 LSM이 ∇a Q를 αψ(s) εω(s,a,w,1)와 일치시키게 한다.
SAC 정책 목표를 사용해 학습한다: Lπ(φ) = E[ -Qθ(s,a) + log πφ(a|s) ].
Adam 대신 Muon을 옵티마이저로 채택해 더 평탄하고 전이 친화적인 해를 촉진한다.
표적 Q-네트워크와 표준 SAC 관행에서의 앙상블 Q-함수를 사용한다.

Figure 1: Past offline RL methods converge to maxima separated from online optima by low-reward valleys . Top: reward landscapes on the Kitchen task for CalQL (left) and SMAC (right). Blue and checkered flags being the real locations of the pre-trained and fine-tuned checkpoints on the landscape res

실험 결과

연구 질문

RQ1오프라인 RL로 사전학습된 행위자-비평가를 온라인으로 튜닝해도 초기 성능 손실 없이 가능할까?
RQ2정규화를 통해 Q-함수를 데이터셋-행동 점수 방향으로 조정하면 오프라인과 온라인 최대치 사이의 연결성이 향상될까?
RQ3SMAC을 사용했을 때 다양한 과제에서 온라인 SAC/TD3/TD3+BC로의 전이가 매끄럽게 이루어질까?
RQ4Muon이 Adam에 비해 오프라인-온라인 전이에 어떤 영향을 미치는가?

주요 결과

SMAC은 테스트된 모든 환경에서 오프라인-온라인 매끄러운 SAC 전이를 달성했다(6/6).
6개 중 4개 환경에서 SMAC은 베스트 baseline 대비 온라인 후회율을 34–58% 감소시켰다.
SMAC은 또한 TD3로의 매끄러운 전이를 6/6 환경에서, TD3+BC로의 전이를 4/6 환경에서 수행한다.
보상 지형 분석에서 기저값의 오프라인 최대치는 온라인 SAC 최대치와 선형적으로 연결되지 않는 반면, SMAC 최대치는 온라인 최대치와 선형적으로 연결된다.
데이터셋 점수의 확산 추정치를 정규화에 사용하면 오프라인과 온라인 최적점 간의 연결성이 개선된다.

Figure 2: Increasing dataset size and coverage does not bridge offline-to-online gap. We generate rollouts in two environments with a policy that has a 0.7 success rate and plot the offline-to-online performance as we increase the dataset size. We observe that even when the dataset is so large that

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.