QUICK REVIEW

[論文レビュー] Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch|arXiv (Cornell University)|Feb 16, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

DAT は拡散 LLMs をデータ特異的な敵対的プロンプトをサンプリングする生成代替として利用し、連続的敵対トレーニングを適用することで、様々な攻撃に対するロバスト性を向上させつつ有用性を維持する。

ABSTRACT

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

研究の動機と目的

LLM の敵対的トレーニングにおける経験的ロバスト性リスクと母集団ロバスト性リスクのギャップを形式化する。
生成的拡散代替案を用いて連続的敵対最適化を組み合わせた Distributional Adversarial Training (DAT) を提案する。
害のあるプロンプトからの拡散ベースの代替案をサンプリングして、モデル固有攻撃およびデータ固有攻撃の両方に対するロバスト性を改善する。
データ分布への高忠実度がロバストネス境界を引き締める理論的根拠を提供する。
モデルサイズとアーキテクチャを跨ぐロバスト性-有用性トレードオフの改善を実証する。

提案手法

LLM AT における経験的ロバスト性リスクと母集団ロバスト性リスクのギャップを定義する。
拡散 LLMs を介して害のある y を条件に x をサンプルする生成代替案 p_theta^(diff)(x,y) を導入する。
害のある y に関してモンテカルロサンプリングを用いて多様なデータ特異的プロンプト x を生成する。
内側ループで連続的敵対トレーニング（CAT）を適用し L_delta ロスを最大化してロバスト性を促進する。
外側ループを KL 項で正則化して有用性と安定性を保持する（L_KL）。
忠実度ベースの代替案境界を提供する：|R_pop(theta) - R_diff(theta)| <= 2M*epsilon、TV 忠実度仮定の下。

Figure 1 : Standard AT minimizes the empirical robust risk over a fixed dataset $\mathcal{D}$ (brown), which provides a poor approximation of the population robust risk. This results in a distribution gap where the model remains vulnerable to the manifold of natural language $q$ (blue). Specifically

実験結果

リサーチクエスチョン

RQ1拡散ベースの生成代替案は、LLMs のプロンプトと害のある応答の真の結合分布を効果的に近似して、AT におけるデータ近似誤差を低減できるか。
RQ2害プロンプトの高尤度領域からサンプリング（x|y）は、従来の AT と比較してモデル固有攻撃およびデータ固有攻撃のロバスト性を向上させるか。
RQ3行为ごとに拡散生成サンプル数を増やすと経験的ロバスト性ギャップが縮小し、ユーティリティを損なわず最悪ケース性能を改善するか。
RQ4データ特異的プロンプトに依存するのか、それともモデルに依存しないサンプリングで十分か。

主な発見

DAT は CAT、LAT、CB などのベースラインと比較して最悪ケースのロバスト性を大幅に向上させる。
拡散生成プロンプトは、モデル固有またはヒューリスティック摂動よりも攻撃の転移性と害プロンプト分布のカバレッジを高める。
行为ごとに拡散生成プロンプトの数を増やすとインペイントや他の ASR が減少し、代替案忠実度境界を支持する。
拡散のみの代替案はロバスト性を改善するが、完全な DAT には及ばず、データ分布近似と連続的敵対最適化の組み合わせが必要であることを示す。
DAT はハイパーパラメータ全体でパレート最適なロバスト性-有用性トレードオフを達成し、同等の有用性レベルでロバスト性の点でベースラインを上回る。
データ特異性（高尤度の害プロンプトからのサンプリング）はロバスト性向上に不可欠であり、低品質サンプルはロバスト性を限定する。

Figure 2 : Cumulative transfer ASR across five target models (Gemma3-12B (Gemma Team et al. , 2025 ) , Qwen2.5-7B (Qwen et al. , 2025 ) , Zephyr-7B (Tunstall et al. , 2023 ) , Llama3-8B-LAT (Sheshadri et al. , 2024 ) , Llama3-8B-CB (Zou et al. , 2024a ) ) from attacks on Llama3-8B. Diffusion-based I

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。