QUICK REVIEW

[论文解读] Training-Free Adaptation of Diffusion Models via Doob's $h$-Transform

Qijie Zhu, Zeqi Ye|arXiv (Cornell University)|Feb 18, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

本文提出 DOIT，一种使用 Doob 的 h-变换的推理时训练-free 方法，旨在将预训练扩散模型引导至具有不可微分奖励的高奖励样本，同时给出收敛保证。

ABSTRACT

Adaptation methods have been a workhorse for unlocking the transformative power of pre-trained diffusion models in diverse applications. Existing approaches often abstract adaptation objectives as a reward function and steer diffusion models to generate high-reward samples. However, these approaches can incur high computational overhead due to additional training, or rely on stringent assumptions on the reward such as differentiability. Moreover, despite their empirical success, theoretical justification and guarantees are seldom established. In this paper, we propose DOIT (Doob-Oriented Inference-time Transformation), a training-free and computationally efficient adaptation method that applies to generic, non-differentiable rewards. The key framework underlying our method is a measure transport formulation that seeks to transport the pre-trained generative distribution to a high-reward target distribution. We leverage Doob's $h$-transform to realize this transport, which induces a dynamic correction to the diffusion sampling process and enables efficient simulation-based computation without modifying the pre-trained model. Theoretically, we establish a high probability convergence guarantee to the target high-reward distribution via characterizing the approximation error in the dynamic Doob's correction. Empirically, on D4RL offline RL benchmarks, our method consistently outperforms state-of-the-art baselines while preserving sampling efficiency.

研究动机与目标

需要高效、训练-free 地将预训练扩散模型适配到下游任务奖励的动机。
提出一个计量-传输框架，在不重新训练模型的情况下定位高奖励分布。
引入基于 Doob’s h-transform 的推理时修正以引导采样朝向期望结果。
在实际近似下提供对收敛到高奖励分布的理论保证。
在离线 RL 基准上展示经验有效性，同时保持采样效率。

提出的方法

将模型适配视为将 P_theta 传输到高奖励条件分布 P_theta(·|E_barX0)。
使用 Doob’s h-transform 推导带有加性修正项 ∇log h 的倾斜采样过程，引导扩散采样朝向 E_barX0。
通过对后向轨迹的蒙特卡洛滚动和插件梯度估计，提供 ∇log h 的可行、基于仿真的近似。
提出 DOIT 算法（原型算法 1 与实际算法 2），它们是训练-free，并可处理不可微分奖励。
引入一个实用的 h 函数 h(x,0) ∝ exp(r(x)/τ) 以倾斜向高奖励区域。
建立收敛性保证：在 MC 与离散化误差的考虑下，界定 DOIT 输出与目标分布之间的总变差距离。

Figure 1 : DOIT : At each $t_{l}$ , we simulate $M$ trajectories (here, $M=3$ ) starting from $x_{t_{l}}$ to approximate $\nabla\log h(x_{t_{l}},t_{l})$ via ( 9 ), then utilize it to modify the sampling dynamics.

实验结果

研究问题

RQ1我们是否可以设计一种推理时、无训练的扩散模型适配算法，用于不可微分奖励？
RQ2是否存在理论收敛保证，表明 DOIT 的输出分布趋近于奖励诱导的目标分布？
RQ3如何在不重新训练模型的前提下近似 Doob 修正项 ∇log h？
RQ4MC 近似和离散化对性能与稳定性的影响是什么？
RQ5DOIT 方法是否在离线 RL 基准上提升奖励对齐的性能，同时保持采样效率？

主要发现

DOIT 通过 Doob 的 h-变换将扩散采样过程倾斜到高奖励样本，提供训练-free 的适配。
一种基于仿真的方法对目标高奖励分布给出高概率收敛保证，并量化近似误差。
经验上，DOIT 在离线 RL 基准上提升奖励对齐性，同时相较基线保持采样效率。
实际版本通过代理终端状态估计和有限的后向滚动来降低计算成本。
对 Stable Diffusion 的实验表明，在不可微分奖励设定下，DOIT 将奖励分布传输到更高的美学分数区域。

Figure 2 : Violin plots of aesthetic scores for the samples generated by Stable Diffusion v1.5 , comparing the vanilla generation result and applying DOIT across different $(\tau,\gamma)$ settings. The blue bars indicate the minimum and maximum scores, the orange bars represent the first & third qua

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。