QUICK REVIEW

[论文解读] Unified Policy Value Decomposition for Rapid Adaptation

Cristiano Capone, Luca Falorsi|arXiv (Cornell University)|Mar 18, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

引入了一种双线性执行者–评论家框架，其中策略与值共享一个低维门控向量 G，使得通过仅调整 G 就可实现对新任务的零-shot 适应和快速在线更新。

ABSTRACT

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

研究动机与目标

在连续控制中激发快速适应性，因为整体网络在迁移与可解释性方面存在阻碍。
提出一种共分解的双线性执行者–评论家架构，具有共享的低维门控向量 G。
证明在执行者与评论家之间共享 G 能提高效率并支持零-shot 泛化。
通过在冻结基函数的同时更新 G 来演示在线适应，实现快速任务调制。
从生物学可行性角度，借助增益调制类比并讨论 G 空间的可解释性。

提出的方法

将 Q(s,a,g) 和策略 mu(s,g) 表示为使用共享门控向量 G(s,g) 的双线性分解：Q(s,a,g)=sum_k G_k(s,g) phi_k(s,a) 和 mu(s,g)=sum_k G_k(s,g) Y_k(s)。
在 Soft Actor–Critic 框架内进行训练，使用共享门控以确保执行者与评论家之间梯度的一致性。
通过以新目标描述 g* 进行单次前向传播来实现零-shot 适应协议（基函数冻结）。
在 G-space 中开发一个在线适应规则，仅通过 TD/误差规则更新 G 而保持基函数固定。
通过 PCA 分析门控动态，展示 G 分量的可解释的单语义性及其对行为的影响。

实验结果

研究问题

RQ1一个共享的低维门控向量 G 是否能在保持性能的同时实现执行者与评论家表示的连贯耦合？
RQ2双线性共分解是否提高学习效率并支持对未见方向/任务的快速零-shot 适应？
RQ3是否可以通过仅更新 G 实现在线适应，而不重新训练执行者/评论家的基函数或梯度？
RQ4门控空间 G 是否可解释，能够实现对高维控制中方向和速度的可控调制？
RQ5在 MuJoCo Ant 转换到新任务方向时，零-shot 泛化的表现如何？

主要发现

双线性分解与共享 G 提高了学习效率，在较简单的网络下仍保持有竞争力的性能。
对未见方向的零-shot 适应在不更新参数的情况下通过对 g* 的条件化仍具竞争力。
操控单个 G_k 分量可产生在运动方向和速度上具有语义意义的变化。
在线 G-space 更新能够在基函数保持固定的情况下实现快速行为适应，而不需要在 G-space 的策略梯度更新。
执行者与评论家发展出一致、相关的 G 编码，支持统一的控制接口和可解释的潜在空间。
该框架提出了一种通过增益类似调制和结构化表示实现快速迁移的生物学上可行的机制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。