Skip to main content
QUICK REVIEW

[论文解读] CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

Xiaofeng Xiao, Xiao Hu|arXiv (Cornell University)|Feb 9, 2026
Reinforcement Learning in Robotics被引用 0
一句话总结

CausalGDP 将实时因果推理整合到基于扩散的强化学习中,使策略生成聚焦于对未来状态和奖励具有因果影响的行动组成部分,结合离线因果发现与在线自适应。

ABSTRACT

Reinforcement learning (RL) has achieved remarkable success in a wide range of sequential decision-making problems. Recent diffusion-based policies further improve RL by modeling complex, high-dimensional action distributions. However, existing diffusion policies primarily rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, limiting their ability to identify which action components truly cause high returns. In this paper, we propose Causality-guided Diffusion Policy (CausalGDP), a unified framework that integrates causal reasoning into diffusion-based RL. CausalGDP first learns a base diffusion policy and an initial causal dynamical model from offline data, capturing causal dependencies among states, actions, and rewards. During real-time interaction, the causal information is continuously updated and incorporated as a guidance signal to steer the diffusion process toward actions that causally influence future states and rewards. By explicitly considering causality beyond association, CausalGDP focuses policy optimization on action components that genuinely drive performance improvements. Experimental results demonstrate that CausalGDP consistently achieves competitive or superior performance over state-of-the-art diffusion-based and offline RL methods, especially in complex, high-dimensional control tasks.

研究动机与目标

  • 将因果性引入基于扩散的强化学习以区分因果行动成分与仅仅关联的成分。
  • 开发一个两阶段框架(离线因果建模与实时因果引导)以通过干预来引导扩散策略。
  • 提供一个与模型无关的因果引导机制,适用于各种扩散策略架构。

提出的方法

  • 从离线数据中学习基础扩散策略和初始因果动力学模型。
  • 通过因果发现(如 NOTEARS)构建连续的因果掩码以编码状态-行动-奖励之间的依赖关系。
  • 用掩码以高斯参数化定义 s_{t+1} 和 r_t 的因果动力学模型。
  • 在实时阶段更新因果掩码并通过 do(a_t) 干预将因果引导纳入扩散去噪过程。
  • 用因果梯度项修正扩散分数以产生因果引导的噪声预测 epsilon_theta^cg。
  • 通过将扩散目标与基于 Q 网络的行为者目标(双重 Q 学习)结合来训练策略。
Figure 1 : Causality and Association illustration
Figure 1 : Causality and Association illustration

实验结果

研究问题

  • RQ1如何从数据中识别 MDP 内的因果关系以影响行动选择?
  • RQ2相比基于关联的引导,实时因果引导是否能改善扩散型 RL 策略?
  • RQ3所提出的因果引导框架是否对不同扩散策略架构具有模型无关性和可扩展性?
  • RQ4通过对行动进行 do(a_t) 干预是否在高维任务中实现更快收敛和获得更好奖励?

主要发现

  • CausalGDP在复杂任务中始终实现与最新的扩散基方法和离线 RL 方法相比具竞争力或更优的性能。
  • 该框架将实时因果更新整合为引导信号,而无需对扩散策略的架构进行特定修改。
  • 从因果发现提取的因果掩码编码出可解释的依赖,偏向产生对因果有效方向的行动。
  • 该方法与高斯扩散模型和标准 TD-Q 学习目标保持兼容。
  • 离线因果建模提供先验,在在线 refin 过程中被 refined,以加速策略训练。
(a) Halfcheetah
(a) Halfcheetah

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。