QUICK REVIEW

[论文解读] SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Nathan S. de Lara, Florian Shkurti|arXiv (Cornell University)|Feb 19, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

SMAC 通过使离线 critic 的行动梯度与数据集行动分数对齐来正则化，并使用 Muon 优化器实现对 SAC、TD3、TD3+BC 的平滑离线到在线迁移，覆盖六个 D4RL 任务。

ABSTRACT

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

研究动机与目标

解释为何离线 RL 的预训练 actor-critic 在在线微调时常出现性能下降。
提出一种方法，使离线 actor-critic 在不退化的情况下与在线基于值的 RL 兼容。
证明所提出的方法在多任务中实现对 SAC、TD3 和 TD3+BC 的平滑迁移。
量化离线与在线最大值之间的连通性，并展示 SMAC 如何改善这一连通性。

提出的方法

引入一个理论启发的正则项，使 critic 的行动梯度 ∇a Q(s,a) 与数据集行动得分 ∇a log πD(a|s) 对齐。
利用扩散型监督强化学习（Diffusion-based Reinforcement via Supervision, RvS）来估计数据集分数，以获得 ∇a log p(a|s,w)。
引入 SMAC critic 损失：LSMAC(θ,ψ) = κ LSM(θ,ψ) + LAC(θ)，其中 LSM 将 ∇a Q 与 αψ(s) εω(s,a,w,1) 对齐。
以 SAC 策略目标进行训练：Lπ(φ) = E[ -Qθ(s,a) + log πφ(a|s) ]。
采用 Muon 作为优化器替代 Adam，以鼓励更平坦、更加有利于迁移的解。
如同标准 SAC 做法，使用目标 Q 网络和集成 Q 函数。

Figure 1: Past offline RL methods converge to maxima separated from online optima by low-reward valleys . Top: reward landscapes on the Kitchen task for CalQL (left) and SMAC (right). Blue and checkered flags being the real locations of the pre-trained and fine-tuned checkpoints on the landscape res

实验结果

研究问题

RQ1离线 RL 预训练的 actor-critic 能否在不出现初始性能下降的情况下在线微调？
RQ2将 Q 函数正则化为数据集行动分数是否能改善离线与在线最大值之间的连通性？
RQ3使用 SMAC 时，向在线 SAC/TD3/TD3+BC 的迁移是否在多任务中都平滑？
RQ4与 Adam 相比，Muon 对离线到在线迁移有何影响？

主要发现

Online Algorithm	Offline Algorithm	AWR	SAC	TD3
IQL	0.508	0.471	0.653	0.494
SMAC	0.380	0.031	0.090	0.226
TD3+BC	0.654	0.962	0.545	0.562
CalQL/CQL	0.482	0.448	0.442	0.614

SMAC 在所有测试环境中实现了对 SAC 的平滑离线到在线迁移（6/6）。
在 6/6 的环境中，SMAC 相对于最佳基线在 4/6 的环境中将在线遗憾降低了 34%–58%。
SMAC 也能在 6/6 的环境中平滑迁移到 TD3，在 4/6 的环境中平滑迁移到 TD3+BC。
奖励景观分析表明，基线的离线最大值与在线 SAC 最大值之间并非线性相连，而 SMAC 的最大值与在线最大值之间存在线性连通性。
利用扩散估算的数据集分数进行正则化可获得离线与在线最优解之间更好的连通性。

Figure 2: Increasing dataset size and coverage does not bridge offline-to-online gap. We generate rollouts in two environments with a policy that has a 0.7 success rate and plot the offline-to-online performance as we increase the dataset size. We observe that even when the dataset is so large that

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。