QUICK REVIEW

[논문 리뷰] An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Xilie Xu, Keyi Kong|arXiv (Cornell University)|2023. 10. 20.

Topic Modeling인용 수 11

한 줄 요약

본 논문은 PromptAttack를 소개하는데, 이는 LLM이 스스로를 속이도록 유도하는 프롬프트 기반 방법으로, AdvGLUE/AdvGLUE++보다 GLUE 태스크에서 더 나은 성능의 적대 샘플을 생성하도록 하며, 적은 쿼리로 블랙박스 평가를 가능하게 한다.

ABSTRACT

The wide-ranging applications of large language models (LLMs), especially in safety-critical domains, necessitate the proper evaluation of the LLM's adversarial robustness. This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack). PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. The attack prompt is composed of three important components: (1) original input (OI) including the original sample and its ground-truth label, (2) attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (AG) containing the perturbation instructions to guide the LLM on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. Besides, we use a fidelity filter to ensure that PromptAttack maintains the original semantic meanings of the adversarial examples. Further, we enhance the attack power of PromptAttack by ensembling adversarial examples at different perturbation levels. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++. Interesting findings include that a simple emoji can easily mislead GPT-3.5 to make wrong predictions.

연구 동기 및 목표

안전-critical 환경에서 LLM의 적대적 강인성에 대한 견고한 평가를 촉진한다.
Victim LLM 자체로부터 적대 샘플을 이끌어내는 프롬프트 기반 프레임워크 PromptAttack를 제안한다.
PromptAttack가 GLUE 태스크에서 기존 기준선보다 더 높은 공격 성공률을 달성함을 보인다.
블랙박스 접근과 소수의 쿼리로 접근 가능성을 입증한다.
적합도 제어와 공격력을 강화하기 위한 전략(소수 샘플 및 앙상블)을 탐구한다.

제안 방법

원본 입력(OI), 공격 목표(AO), 공격 안내(AG)로 구성된 공격 프롬프트를 구성한다.
문자, 단어, 문장 수준에서 의미를 보존하면서 적대 샘플을 생성하기 위해 변형 지시를 정의한다.
단어 수정 비율과 BERTScore를 사용한 충실도 필터를 적용해 의미적 유사성을 유지한다.
AG에서 소수 샘플 예시와 여러 변형 수준에 대한 앙상블 전략으로 공격력을 향상시킨다.
피해자 LLM(Llama2-7B, Llama2-13B, GPT-3.5)을 사용하여 GLUE 태스크에서 평가한다.
AdvGLUE 및 AdvGLUE++와 비교하고 충실도 필터링된 샘플과 함께 공격 성공률(ASR)을 보고한다.

실험 결과

연구 질문

RQ1제안된 PromptAttack가 모델이 스스로를 속이는 적대 샘플 생성을 프롬프트로 이끌어 블랙박스 LLM의 실패 모드를 신뢰성 있게 발견할 수 있는가?
RQ2소수 샘플 및 앙상블 전략이 충실도를 높게 유지하면서 공격력을 크게 향상시키는가?
RQ3PromptAttack가 기존의 견고성 벤치마크와 비교하여 다양한 LLM 및 GLUE 태스크에서 어떤 성능을 보이는가?
RQ4작업 설명과 변형 유형이 ASR 및 전이성에 미치는 영향은 무엇인가?

주요 결과

작업	SST-2	QQP	MNLI-m	MNLI-mm	RTE	QNLI	Avg
GPT-3.5 AdvGLUE	33.04	14.76	25.30	34.79	23.12	22.03	25.51
GPT-3.5 AdvGLUE++	5.24	8.68	6.73	10.05	4.17	4.95	6.64
GPT-3.5 PromptAttack-EN	56.00	37.03	44.00	43.51	34.30	40.39	42.54
GPT-3.5 PromptAttack-FS-EN	75.23	39.61	45.97	44.10	36.12	49.00	48.34

PromptAttack가 Llama2 및 GPT-3.5에 대해 GLUE 태스크 전반에서 AdvGLUE 및 AdvGLUE++보다 더 높은 ASR를 산출한다.
PromptAttack-EN 및 PromptAttack-FS-EN은 상당한 ASR 이득을 달성하며; GPT-3.5의 경우 PromptAttack-EN은 평균 ASR 42.54%, PromptAttack-FS-EN은 48.34%를 달성한다.
간단한 이모지가 GPT-3.5를 오도할 수 있어 예기치 않은 취약점을 보여준다.
ASR 향상은 문장 수준 변형에서 가장 강하며 소수 샘플 지침 및 앙상블 전략으로 이익을 얻는다.
PromptAttack는 GPT-3.5와 Llama2 변형 간의 적대 샘플 전이 가능성을 보여준다.
GPT-3.5는 동일 프롬프트 하에서 일반적으로 Llama2 모델에 비해 더 높은 강건성을 보이지만, Llama2-13B는 PromptAttack 하에서 여전히 높은 취약성을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.