QUICK REVIEW

[论文解读] Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

Ziwei Luo, Ziqi Jin|arXiv (Cornell University)|Feb 2, 2026

Topic Modeling被引用 0

一句话总结

引入 Self-Rewarding Sequential Monte Carlo (SR-SMC) 用于推断时对掩蔽扩散语言模型的扩展性，在并行粒子下实现轨迹级别的基于置信度的采样，从而在不额外训练的情况下提升样本质量和多样性。

ABSTRACT

This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity of possible paths. We address this problem by launching multiple interacting diffusion processes in parallel, referred to as particles, for trajectory exploration. Importantly, we introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights. During sampling, particles are iteratively weighted and resampled to systematically steer generation towards globally confident, high-quality samples. Our self-rewarding SMC is verified on various masked diffusion language models and benchmarks, achieving significant improvement without extra training or reward guidance, while effectively converting parallel inference capacity into improved sampling quality. Our code is available at https://github.com/Algolzw/self-rewarding-smc.

研究动机与目标

为贪婪 MDLM 采样的多样性不足提供动机与解决方案。
提出一个通用的 SR-SMC 框架，使用轨迹级别的置信度作为自我激励信号。
证明 SR-SMC 在不需要额外奖励模型的情况下改善各类 MDLM 与 dLLM 的样本质量。
通过广泛实验展示在多模型和基准数据集上的可扩展性。

提出的方法

维持 N 个交互式扩散过程（粒子）以并行探索多条轨迹。
将轨迹级别的置信度定义为更新令牌的置信度乘积（等式 13），用于对粒子加权。
使用带自适应重新遮盖的反向扩散核以及一个用于选择哪些令牌进行解遮的策略（等式 9–10 与等式 7–8中的策略）。
在自适应重采样基于有效样本量（ESS）的前提下，应用标准的 SMC 步骤（重采样、传播、重新加权）（等式 14）。
使用离散令牌采样的 Gumbel-Max 技巧并结合温度控制（等式 15）。
提供理论论证，轨迹级别的置信度在引导式自举 SMC 设置中是一种自然的自我激励（命题 3.1）。

实验结果

研究问题

RQ1SR-SMC 能否在不需要额外训练或外部奖励的情况下改善 MDLM 与 dLLM 的采样质量？
RQ2轨迹级加权是否相对于令牌级置信度在 MDLM 中实现更好的探索与多样性？
RQ3SR-SMC 在不同的 MDLM 和基于扩散的 LLM 上在标准基准上表现如何？
RQ4粒子数量和温度对 SR-SMC 的性能与稳定性有何影响？

主要发现

SR-SMC 在 MDLM（MDLM、BD3-LMs）和 dLLMs（LLaDA-1.5、Dream-7B）上始终提升生成困惑度（Gen. PPL）和样本质量。
增加粒子数量（N）可带来渐进式提升，在 N=3 或 4 时尤为显著。
带 SR-SMC 的分块解码变体在某些 BD3-LMs 配置下实现 Gen. PPL 小于 20，缩小与自回归基线的差距。
SR-SMC 在 GSM8K、MATH、HumanEval、MBPP 基准上提升性能，平均增益约为 2–4 点，视模型与文本长度而定。
SR-SMC 展示对采样温度的鲁棒性，并在低温下相较贪婪解码减少重复。
消融实验表明即使粒子数适中且为零-shot 设置，SR-SMC 也带来显著提升。
分析表明存在相当比例的区块出现粒子之间的超越，证实 SR-SMC 能探索非贪婪轨迹。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。