QUICK REVIEW

[論文レビュー] Q-learning with Adjoint Matching

Qiyang Li, Sergey Levine|arXiv (Cornell University)|Jan 20, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

Q-learning with Adjoint Matching (QAM) は adjoint matching を導入し、critic の勾配を活用して表現力のある flow/diffusion ポリシーを訓練することで、TD ベースの学習を安定化させ、オフラインおよびオフライン→オンライン RL における希少報酬・長期的タスクで優れた性能を実現します。

ABSTRACT

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

研究の動機と目的

TD ベースの RL において不安定な逆伝播なしに、批評家を同時に最適化する必要性を動機付ける。
批評家のアクション勾配を、ポリシー最適化の安定した段階的目的に変換する adjoint matching を提案する。
学習されたポリシーが、多段 flow モデルの表現力を保ちつつ、最適な行動制約付きポリシーへ収束することを保証する。
オフラインおよびオフライン→オンライン設定において、 critic 学習の TD_backups との容易な統合を可能にする。

提案手法

KL に似た挙動制約の下で最適ポリシーを pi* ∝ pi_beta exp(tau(s) Q(s,a)) として定式化する。
挙動ポリシーを flow-matching ポリシー f_beta で表現し、denoising プロセスを逆伝播せずに critic の勾配を用いて微調整されたポリシー f_theta を adjoint matching によって学習する。
Lean adjoint state を適用して、f_theta を critic によって情報付与された最適ポリシーと整合させる、偏りのない安定な adjoint matching 目的関数 L_AM(theta) を算出する。
adjoint matching を TD ベースの critic 更新と組み合わせ、critics のアンサンブルと悲観的なターゲットバックアップを用いる。
learned policy への Wasserstein 基盤の近接性を緩和する実用的な二つの変種（QAM-FQL と QAM-EDIT）を提供する。
アクション軌跡のメモリレス SDE と VJP ベースの逆伝播を用いて adjoint 状態を計算する実践的な訓練を実装する。

実験結果

リサーチクエスチョン

RQ1 adjoint matching により TD ベースの RL で表現力豊かな flow/diffusion ポリシーを critic 勾配を安定的に活用して最適化できるか。
RQ2 QAM は offline および offline-to-online 設定で、表現力を保ちつつ behavior-regularized な最適ポリシーを回復できるか。
RQ3 QAM の変種（QAM-FQL および QAM-EDIT）は、learned policy への Wasserstein 制約 under において behavior priors と value guidance のバランスを取れるか。
RQ4 TD バックアップと adjoint matching を組み合わせることで、gradient 情報を破棄する方法や不安定な逆伝播に依存する従来手法よりも困難な報酬ベンチマークで優れた性能を達成できるか。
RQ5 offline データで事前学習し、QAM によるオンライン微調整を行った場合に、offline→online レジームで得られる実証的な利得はどれくらいか。

主な発見

QAM は offline および offline→online RL の難易度の高い希少報酬タスクで、従来手法を一貫して上回る。
denoising の逆伝播による不安定性を回避しつつ、multi-step flow ポリシーの表現力を保つ。
adjoint matching は critic のアクション勾配を直接・偏りなくポリシーの速度場を導くのに有効である。
二つの実用的な QAM 変種（QAM-FQL と QAM-EDIT）は Wasserstein 制約下で学習済みポリシーを効果的に近接制御する。
このアプローチは TD ベースの critic 学習と adjoint-matching ポリシー目的を統合し、オフライン RL ベンチマーク全体で強い実証性能を達成する。
実証研究では、長い horizon と希少報酬を持つ 10 個の OGBench ドメインで堅牢な性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。