QUICK REVIEW

[論文レビュー] Unified Policy Value Decomposition for Rapid Adaptation

Cristiano Capone, Luca Falorsi|arXiv (Cornell University)|Mar 18, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

二次型アクター-クリティックフレームワークを提案。方針と値が低次元のゲーティングベクター G を共有し、新タスクへのゼロショット適応と G のみを調整した高速オンライン更新を可能にする。

ABSTRACT

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

研究の動機と目的

連続制御における迅速な適応の動機づけ。単一の大域的ネットワークが転移と解釈性を妨げる点を指摘。
共有された低次元ゲーティングベクター G を用いる共分解型二次型アクター–クリティックアーキテクチャを提案。
アクターとクリティック間で G を共有することが効率を改善し、ゼロショット一般化を支援することを示す。
ベース関数を凍結したまま G を更新してオンライン適応を示し、タスクのモジュレーションを高速化。
ゲインモジュレーションの類推と解釈性の観点から生物学的妥当性を論じ、G-space の解釈性を検討。

提案手法

Q(s,a,g) と policy mu(s,g) を共有ゲーティングベクトル G(s,g) を用いて二次分解として表現する： Q(s,a,g)=sum_k G_k(s,g) phi_k(s,a) および mu(s,g)=sum_k G_k(s,g) Y_k(s)。
Soft Actor–Critic フレームワーク内で共有ゲーティングを用いてアクターとクリティック間の勾配を整合させる。
新しい目標記述子 g* を前向きに一回通すことでゼロショット適応プロトコルを適用（基底関数は凍結）。
基底関数を固定したまま G のみをTD誤差ベースのルールで更新するオンライン適応ルールをG-spaceで開発。
PCA によるゲーティングダイナミクスを分析し、G 成分の単一意味性とそれが行動へ与える影響を解釈的に示す。

実験結果

リサーチクエスチョン

RQ1共有された低次元のゲーティングベクトル G はアクターとクリティックの表現を一貫して結合しつつ性能を保てるか。
RQ2二次分解による共分解は学習効率を改善し、見慣れない方向/タスクへの迅速なゼロショット適応を支持するか。
RQ3基底関数やアクター/クリティックの勾配を再学習せずに G のみを更新するオンライン適応を実現できるか。
RQ4ゲーティング空間 G は解釈可能で、高次元制御における方向と速度の可制御モジュレーションを可能にするか。
RQ5MuJoCo Ant の新規タスク方向へ切り替えた場合、ゼロショット一般化はどの程度機能するか。

主な発見

共有 G を用いた二次分解は学習効率を向上させ、単純なネットワークと競合する性能を維持する。
見慣れない方向へのゼロショット適応は、g* を条件付けするだけでパラメータ更新なしでも競争力を維持する。
個別の G_k 成分を操作すると運動方向と速度の意味的に有意な変化を生む。
オンラインの G-space 更新はベースを固定したまま迅速な行動適応を可能にし、G-space のポリシー勾配更新を伴わない。
アクターとクリティックは一貫した相関のある G エンコーディングを発展させ、統一的な制御インタフェースと解釈可能な潜在空間を支持する。
この枠組みはゲイン様の変調と構造化表現を通じた迅速な転移の生物学的に妥当な機構を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。