QUICK REVIEW

[论文解读] Training Diffusion Language Models for Black-Box Optimization

Zipeng Sun, Can Chen|arXiv (Cornell University)|Mar 18, 2026

Machine Learning in Materials Science被引用 0

一句话总结

论文提出 DiBO，一种用于离线黑箱优化的扩散语言模型框架，通过带分隔符标记的域自适应和两阶段后训练（SFT 再 RL）从异构的提示、设计和标签中生成高标签设计，在 Design-Bench 的小数据设置中达到 state-of-the-art。

ABSTRACT

We study offline black-box optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt-response corpus and introduce delimiter tokens to explicitly mark field boundaries for domain adaptation. We further propose a two-stage post-training framework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench small-data settings.

研究动机与目标

弥合自然语言预训练与离线设计数据之间的差距，实现在 BBO 中的双向建模。
利用扩散 LLM 捕捉设计空间的双向依赖关系。
开发域自适应与后训练管线，使扩散生成与高标签设计对齐。
在 Design-Bench 的小数据设置下，覆盖离散与连续任务，展现优越性能。

提出的方法

构建统一的提示–应答语料库，使用显式分隔符标记设计与标签。
通过在提示与应答中联合预测掩码令牌（联合 DA 损失）实现域自适应。
采用两阶段后训练：在统一语料上进行有监督微调（掩码应答预测），随后通过基于标签改进的奖励进行强化学习。
为高效性使用一步对数概率 RL 近似，奖励 r(q,o)=y(o)−y(q)，并以奖励标准差进行归一化。
在 Design-Bench 任务（TF8、TF10、Ant Morphology、D’Kitty Morphology）上评估，每个任务包含 128 个候选，并通过消融分析来分析鲁棒性。

Figure 1 : Overview of the DiBO framework. (a) Unified Prompt–Response Corpus: Heterogeneous BBO signals (natural-language prompts, offline designs and their associated labels) are unified using explicit delimiter tokens. (b) Domain Adaptation (DA): The diffusion LLM is domain-adapted via joint mask

实验结果

研究问题

RQ1扩散 LLM 是否在离线 BBO 中比自回归 LLM 更能捕捉双向依赖？
RQ2带分隔符增强提示的域自适应是否能从异构离线数据中改进学习？
RQ3三阶段训练管线（DA、SFT、RL）在小数据情境下是否能实现对高标签设计的更好对齐？
RQ4DiBO 相对于多样基线在离散与连续设计任务中的表现如何？
RQ5提示相似性、分隔符标记和训练阶段对性能的影响如何？

主要发现

DiBO 在 Design-Bench 的小数据设置下在多个任务上达到最先进的性能。
相较于纯文本边界，分隔符标记在所有训练阶段都显著提升性能。
三阶段管线（DA + SFT + RL）优于两阶段变体，RL 提供更细粒度的奖励优化。
通过设计相似性选择的相似性条件上下文构建（按设计相似性选择提示示例）显著提升性能，相较随机上下文表现更好。
DiBO 在 forward-surrogate 指导方法和许多扩散基线中总体表现优于对比，尤其在 Ant Morphology 与 TF Bind 任务上。
该方法对超参数如 RL 学习率和提示模板变化保持鲁棒性。

Figure 2 : Hyperparameter sensitivity on TF Bind 8 and Ant Morphology at the RL stage: (a) the number of few-shot examples in the prompt (context length), (b) size of the offline dataset, (c) learning rate used in RL stage, and (d) variations of prompt templates. Results are reported as relative per

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。