QUICK REVIEW

[論文レビュー] Training Diffusion Language Models for Black-Box Optimization

Zipeng Sun, Can Chen|arXiv (Cornell University)|Mar 18, 2026

Machine Learning in Materials Science被引用数 0

ひとこと要約

本論文は DiBO を提案。オフラインのブラックボックス最適化のための拡散 LLM フレームワークで、デリミタトークンによるドメイン適応と二段階後訓練（SFT → RL）を用い、異種のプロンプト、デザイン、ラベルから高ラベルの設計を生成し、Design-Bench の小データ設定で最先端の結果を達成。

ABSTRACT

We study offline black-box optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt-response corpus and introduce delimiter tokens to explicitly mark field boundaries for domain adaptation. We further propose a two-stage post-training framework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench small-data settings.

研究の動機と目的

自然言語事前学習とオフライン設計データ間のギャップを埋め、BBO の双方向モデリングを可能にする。
拡散 LLM を活用して設計空間の双方向依存性を捉える。
デリミタ適用プロンプトを用いたドメイン適応と後訓練パイプラインを開発し、高ラベル設計との整合性を図る。
Discrete および Continuous タスク全般で Design-Bench の小データ設定において強い性能を示す。

提案手法

デザインとラベルを明示的なデリミタトークンで区切る統一的なプロンプト-レスポンスコーパスを構築。
prompts と responses のマスク化トークンを共同予測することでドメイン適応を実施（joint DA loss）。
統一コーパス上での監視付きファインチューニング（マスク済みレスポンス予測）を経て、ラベル改善という報酬で強化学習を行う、二段階後訓練を採用。
効率化のためのワンステップの対数確率 RL の近似を使用し、報酬は r(q,o)=y(o)−y(q)、報酬の標準偏差で正規化。
Design-Bench タスク（TF8、TF10、Ant Morphology、D’Kitty Morphology）を各タスク128候補で評価し、アブレーションでロバスト性を分析。

Figure 1 : Overview of the DiBO framework. (a) Unified Prompt–Response Corpus: Heterogeneous BBO signals (natural-language prompts, offline designs and their associated labels) are unified using explicit delimiter tokens. (b) Domain Adaptation (DA): The diffusion LLM is domain-adapted via joint mask

実験結果

リサーチクエスチョン

RQ1オフライン BBO においてディフュージョン LLM は自己回帰 LLM より双方向の依存性をより良く捉えられるか？
RQ2デリミタ強化プロンプトによるドメイン適応は異種のオフラインデータからの学習を改善するか？
RQ3三段階訓練パイプライン（DA、SFT、RL）は小データ領域で高ラベル設計への整合性を向上させるか？
RQ4DiBO は離散設計タスクと連続設計タスクの多様なベースラインに対してどの程度優れているか？
RQ5プロンプト類似性、デリミタトークン、訓練段階が性能に与える影響はどの程度か？

主な発見

Method	Ant Morphology	D’Kitty Morphology	TF Bind 8	TF Bind 10	Mean Score ↑	Rank Mean ↓	Rank Median ↓
D (ours) DiBO (full)	0.944±0.016	0.923±0.002	0.965±0.038	0.755±0.012	0.897±0.017	2.5	1.0
OPRO	0.517±0.039	0.856±0.046	0.758±0.017	0.500±0.013	0.657±0.028	13.5	14.5
GTG	0.603±0.039	0.917±0.023	0.762±0.016	0.730±0.026	0.753±0.026	7.25	9.0
DDOM	0.590±0.026	0.929±0.037	0.739±0.016	0.497±0.002	0.689±0.020	10.75	11.5
MIN	0.570±0.003	0.886±0.017	0.764±0.008	0.517±0.030	0.684±0.015	11.75	12.5
ExPT	0.929±0.049	0.950±0.041	0.810±0.044	0.703±0.022	0.848±0.039	3.0	3.0
OPRO (alternate)	0.517±0.039	0.856±0.046	0.758±0.017	0.500±0.013	0.657±0.028	13.5	14.5
BONET	0.632±0.042	0.920±0.040	0.776±0.007	0.492±0.043	0.705±0.033	9.5	7.5
CMA-ES	0.592±0.010	0.711±0.045	0.784±0.029	0.658±0.031	0.686±0.029	9.75	8.5
UniSO-T	0.636±0.045	0.939±0.007	0.836±0.027	0.522±0.017	0.733±0.024	6.0	5.0
Grad-mean	0.644±0.039	0.907±0.016	0.666±0.011	0.695±0.027	0.728±0.023	9.5	8.5
Grad-EI	0.626±0.002	0.901±0.045	0.673±0.012	0.689±0.013	0.722±0.018	10.5	10.5

DiBO は小データ設定で複数タスクにおいて Design-Bench で最先端の性能を達成。
デリミタトークンは平文の境界と比べ、全訓練段階で明確な意味的分離を提供し、ドメイン適応効果を向上。
三段階パイプライン（DA + SFT + RL）は二段階バリアントを上回り、RL が報酬の細粒度最適化を提供。
類似性条件付きの文脈構築（設計類似性に基づくプロンプト例の選択）は無作為な文脈に比べ性能を大幅に向上。
DiBO は前方代替 surrogate-guided 法や多くの拡散ベースのベースラインを一貫して上回り、特に Ant Morphology および TF Bind タスクで顕著。
方法は RL 学習率やプロンプトテンプレートの変動といったハイパーパラメータに対しても頑健。

Figure 2 : Hyperparameter sensitivity on TF Bind 8 and Ant Morphology at the RL stage: (a) the number of few-shot examples in the prompt (context length), (b) size of the offline dataset, (c) learning rate used in RL stage, and (d) variations of prompt templates. Results are reported as relative per

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。