QUICK REVIEW

[论文解读] Q-learning with Adjoint Matching

Qiyang Li, Sergey Levine|arXiv (Cornell University)|Jan 20, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

Q-learning with Adjoint Matching (QAM) introduces adjoint matching to leverage critic gradients for training expressive flow/diffusion policies, enabling stable TD-based learning and superior performance on sparse-reward, long-horizon tasks in offline and offline-to-online RL.

ABSTRACT

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

研究动机与目标

Motivate the need to jointly optimize expressive flow/diffusion policies with a critic in TD-based RL without unstable backpropagation.
Propose adjoint matching to convert the critic’s action gradient into a stable, step-wise objective for policy optimization.
Ensure the learned policy converges to the optimal behavior-constrained policy while preserving expressivity of multi-step flow models.
Enable straightforward integration with TD_backups for critic learning in offline and offline-to-online settings.

提出的方法

Formulate the optimal policy under a KL-like behavior constraint as pi* ∝ pi_beta exp(tau(s) Q(s,a)).
Represent the behavior policy with a flow-matching policy f_beta and learn a fine-tuned policy f_theta via adjoint matching that uses the critic’s gradient without backpropagating through the denoising process.
Apply the lean adjoint state to compute an unbiased, stable adjoint matching objective L_AM(theta) that aligns f_theta with the critic-informed optimal policy.
Combine adjoint matching with TD-based critic updates using an ensemble of critics and pessimistic target backups.
Provide two practical variants (QAM-FQL and QAM-EDIT) that relax the constraint via Wasserstein-based proximity to the learned policy.
Implement practical training with memory-less SDE for action trajectories and VJP-based reverse passes to compute adjoint states.

实验结果

研究问题

RQ1Can adjoint matching enable stable utilization of critic gradients for optimizing expressive flow/diffusion policies in TD-based RL?
RQ2Does QAM recover the behavior-regularized optimal policy while maintaining flow policy expressivity in offline and offline-to-online settings?
RQ3How do QAM variants (QAM-FQL and QAM-EDIT) perform under Wasserstein-proximity constraints to balance behavior priors and value guidance?
RQ4Can TD backups combined with adjoint matching achieve superior performance on hard, sparse-reward benchmarks compared to prior methods that discard gradient information or rely on unstable backpropagation?
RQ5What empirical gains do offline-to-online regimes show when pre-trained with offline data and fine-tuned online using QAM?

主要发现

QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
The method preserves the expressivity of multi-step flow policies while avoiding instability from backpropagation through denoising.
Adjoint matching enables direct, unbiased use of the critic’s action gradient to guide policy velocity fields.
Two practical QAM variants (QAM-FQL and QAM-EDIT) offer effective proximal control to the learned policy under Wasserstein constraints.
The approach integrates TD-based critic learning with the adjoint-matching policy objective to achieve strong empirical performance across offline RL benchmarks.
Empirical study demonstrates robust performance across 10 OGBench domains with long horizons and sparse rewards.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。