QUICK REVIEW

[论文解读] Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

Zijing Ou, Jacob Si|arXiv (Cornell University)|Feb 12, 2026

Functional Brain Connectivity Studies被引用 0

一句话总结

VMPO 将扩散对齐重新框定为对数重要性权重方差的最小化，与基于 KL 的方法相联系，同时开启新的设计方向；在 Stable Diffusion 上对基于奖励的对齐进行经验改进。

ABSTRACT

Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.

研究动机与目标

Motivate diffusion alignment to steer pretrained diffusion models toward high-reward samples.
Introduce Variance Minimisation Policy Optimisation (VMPO) as an alternative to KL-based objectives.
Show that variance minimisation yields the same gradient as KL under on-policy sampling.
Demonstrate that VMPO recovers existing methods under certain choices and enables new design directions.
Empirically validate VMPO by finetuning Stable Diffusion 1.5 and 3.5 across diverse rewards.

提出的方法

Treat the denoising process as a sequential proposal in a Sequential Monte Carlo view.
Define the VMPO objective as minimising the variance of log importance weights along the trajectory (Eq. 4).
Prove that the optimum yields the reward-tilted target and that on-policy gradients coincide with KL-based alignment (Proposition 1).
Estimate the VMPO loss via Monte Carlo samples and introduce a neural estimator M_phi to amortise the log-weight expectation (Eq. 8–9).
Derive training procedure and instantiate two variants VMPO-R2G and VMPO-Diff through different reward potentials (Appendix C).
Show that VMPO connects to GRPO and other diffusion-alignment methods as special cases under specific variance strategies (Appendix C).

实验结果

研究问题

RQ1How can diffusion alignment be formulated beyond KL minimisation?
RQ2Does variance minimisation yield equivalent gradients to KL under on-policy sampling and what are the practical benefits?
RQ3How do different potential functions and variance strategies relate to existing diffusion-alignment methods?
RQ4Can VMPO improve reward-driven alignment when finetuning real diffusion models on practical reward signals?

主要发现

Method	HPSv2	CLIPScore	ImageReward	DreamSim
SD1.5 (Base)	0.2368 ± 0.0029	0.2717 ± 0.0032	0.0331 ± 0.0779	0.4389 ± 0.0116
GRPO	0.2684 ± 0.0035	0.2653 ± 0.0034	0.3449 ± 0.0758	0.3220 ± 0.0098
VMPO-R2G	0.2723 ± 0.0032	0.2713 ± 0.0030	0.3427 ± 0.0762	0.3673 ± 0.0115
VMPO-Diff	0.2822 ± 0.0040	0.2622 ± 0.0028	0.4973 ± 0.0780	0.2916 ± 0.0104

VMPO optimises diffusion alignment by minimising the variance of log importance weights along the denoising trajectory.
Under on-policy sampling, the VMPO gradient matches the gradient of KL-based alignment.
VMPO with different variance strategies recovers existing methods and suggests new design directions beyond KL.
Empirically, VMPO improves human-preference-based alignment (HPSv2) and ImageReward on Stable Diffusion 1.5, with VMPO-Diff achieving the strongest reward signal.
VMPO-Diff increases reward while incurring trade-offs in CLIPScore and DreamSim, indicating reward hacking tendencies similar to other methods.
The paper provides a unified probabilistic lens (SMC) for understanding diffusion alignment and its variants.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。