[논문 리뷰] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
COLD-Attack은 controllable jailbreaking을 에너지 기반의 controllable 텍스트 생성 문제로 재정의하고 Langevin dynamics를 사용해 다양한 LLM들에 걸쳐 다중 제약을 만족하는 유창하고 은밀한 적대적 프롬프트를 설계한다.
Jailbreaks on large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent (suffix) attack with continuation constraint, but also allow us to address new controllable attack settings such as revising a user query adversarially with paraphrasing constraint, and inserting stealthy attacks in context with position constraint. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.
연구 동기 및 목표
- Formally define controllable attack generation for LLM jailbreaking.
- Bridge controllable attack generation with controllable text generation to leverage existing algorithms.
- Adapt COLD to create COLD-Attack and unify search for attacks under fluency, stealthiness, sentiment, and coherence constraints.
- Demonstrate broad applicability, controllability, and transferability of COLD-Attack across multiple LLMs.
제안 방법
- Formulate controllable attack generation as finding a sequence that both attacks the LLM and satisfies multiple constraints via energy functions.
- Adapt Energy-based Constrained Decoding with Langevin Dynamics (COLD) to solve controllable attack generation by optimizing a compositional energy E(y).
- Use Langevin dynamics to sample in continuous logit space and decode to discrete adversarial text with a guided decoding process.
- Define energy functions for attack success, fluency, semantic similarity, lexical constraints, and sentiment/left-right-coherence to realize various attack settings.
- Provide a pseudo-code (Algorithm 1) and emphasize non-autoregressive sampling with end-discrete decoding.
- Show that COLD-Attack is faster (no token-level greedy search) and can enforce complex constraints like left-right-coherence.
실험 결과
연구 질문
- RQ1How can controllable attack generation be formalized for LLM jailbreaks with multiple constraints?
- RQ2Can controllable text generation methods be effectively repurposed to automate and diversify adversarial LLM attacks?
- RQ3What energy functions and decoding strategies enable fluent, stealthy, and semantically controlled jailbreaking across different LLMs?
주요 결과
| Model | ASR | ASR-G | PPL |
|---|---|---|---|
| Vicuna | 100.00 | 86.00 | 32.96 |
| Guanaco | 96.00 | 84.00 | 30.55 |
| Mistral | 92.00 | 66.00 | 24.83 |
| Llama2 | 92.00 | 60.00 | 24.83 |
- COLD-Attack achieves high attack success rates (ASR) and strong ASR-G while maintaining fluency (lower perplexity) across multiple LLMs.
- The method is about 10x faster than GCG and GCG-reg, with a single-end decoding workflow.
- COLD-Attack can realize paraphrase, sentiment-controlled, and left-right-coherence constrained attacks.
- Attack performance remains strong across models (Vicuna, Guanaco, Mistral, Llama2) and transfers to GPT-3.5 in transferability studies.
- Compared to AutoDAN variants, COLD-Attack offers fluent, customizable attacks without manual prompts.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.