QUICK REVIEW

[논문 리뷰] Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch|arXiv (Cornell University)|2026. 02. 16.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

DAT는 diffusion LLMs를 생성 대리자로 사용하여 다양하고 데이터-특정적 적대적 프롬프트를 샘플링하고 연속적 적대적 학습을 적용함으로써 다양한 공격에 대한 강인성을 높이면서 유용성을 보존한다.

ABSTRACT

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

연구 동기 및 목표

LLM 적대적 학습에서 경험적 강인성 위험과 인구 강인성 위험 사이의 간격을 형식화한다.
생성적 확산 대리자와 연속적 적대적 최적화를 결합한 분포적 적대적 학습(DAT)을 제안한다.
해로운 y에 조건화된 x를 샘플링하기 위해 diffusion 기반 대리자 p_theta^(diff)(x,y)를 도입한다.
해로운 y에 대한 몬테카를로 샘플링으로 데이터-특정 프롬프트 x를 다양하게 생성한다.
외부 루프를 KL 항으로 정규화하여 유용성과 안정성을 유지한다 (L_KL).
거짓성 기반 대리자 경계: |R_pop(theta) - R_diff(theta)| <= 2M*epsilon under a TV fidelity assumption.

제안 방법

LLM AT에서 경험적 대 인구 강인성 위험 간의 차이를 정의한다.
확산 LLM을 통해 해로운 y를 조건으로 샘플링하는 생성적 대리자 p_theta^(diff)(x,y)를 도입한다.
해로운 y에 대한 몬테카를로 샘플링으로 데이터-특정 프롬프트 x를 다양하게 생성한다.
내부 루프에서 연속적 적대적 학습(CAT)을 적용하여 L_delta 손실을 최대화하고 강인성을 촉진한다.
유용성과 안정성을 유지하기 위해 외부 루프를 KL 항으로 정규화한다 (L_KL).
충실도 기반 대리자 경계: |R_pop(theta) - R_diff(theta)| <= 2M*epsilon under a TV fidelity assumption.

Figure 1 : Standard AT minimizes the empirical robust risk over a fixed dataset $\mathcal{D}$ (brown), which provides a poor approximation of the population robust risk. This results in a distribution gap where the model remains vulnerable to the manifold of natural language $q$ (blue). Specifically

실험 결과

연구 질문

RQ1확산 기반 생성적 대리자가 프롬프트와 악의적 응답의 진짜 결합 분포를 효과적으로 근사하여 LLM에 대한 AT의 데이터 근사 오차를 줄일 수 있는가?
RQ2해로운 프롬프트(x|y)의 높은 가능도 영역에서 샘플링하는 것이 전통적 AT에 비해 모델-특정 및 데이터-특정 공격에 대한 강인성을 개선하는가?
RQ3행동당 확산 생성 샘플의 수를 늘리면 경험적 강인성 간극이 줄고 최악의 경우 성능이 개선되나 유용성을 해치지 않는가?
RQ4강인성 개선이 데이터-특정 프롬프트에 의존하는가, 아니면 모델-무관 샘플링으로도 충분한가?

주요 결과

DAT는 CAT, LAT, CB 등의 기준선에 비해 다양한 공격에 대한 최악의 경우 강인성을 크게 향상시킨다.
확산 생성 프롬프트를 사용하면 모델-특정 또는 휴리스틱 섭동보다 공격의 전이 가능성과 해로운 프롬프트 분포 커버리지가 더 높다.
행동당 확산 생성 프롬프트의 수를 늘리면 인페인트잉(복원) 및 다른 ASR이 감소하여 대리자 충실도 경계가 지지된다.
확산 전용 대리자는 강인성을 향상시키지만 DAT 전체에 비해 성능이 떨어져 데이터 분포 근사와 연속적 적대적 최적화를 결합해야 함을 시사한다.
DAT는 하이퍼파라미터 전반에서 파레토 최적의 강인성-유용성 트레이드오프를 달성하며 같은 유용성 수준에서 기준선보다 강인성이 우수하다.
데이터 특이성(고확률 해로운 프롬프트 샘플링)이 강인성 증가에 필수적이며 저품질 샘플은 강인성에 한계가 있다.

Figure 2 : Cumulative transfer ASR across five target models (Gemma3-12B (Gemma Team et al. , 2025 ) , Qwen2.5-7B (Qwen et al. , 2025 ) , Zephyr-7B (Tunstall et al. , 2023 ) , Llama3-8B-LAT (Sheshadri et al. , 2024 ) , Llama3-8B-CB (Zou et al. , 2024a ) ) from attacks on Llama3-8B. Diffusion-based I

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.