QUICK REVIEW

[論文レビュー] An LLM can Fool Itself: A Prompt-Based Adversarial Attack

Xilie Xu, Keyi Kong|arXiv (Cornell University)|Oct 20, 2023

Topic Modeling被引用数 11

ひとこと要約

本論文はPromptAttackを提案する。PromptAttackはプロンプトベースの手法で、LLM自体を騙す敵対サンプルを生成させることにより、GLUEタスクでAdvGLUE/AdvGLUE++を上回り、Few queriesでブラックボックス評価を可能にする。

ABSTRACT

The wide-ranging applications of large language models (LLMs), especially in safety-critical domains, necessitate the proper evaluation of the LLM's adversarial robustness. This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack). PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. The attack prompt is composed of three important components: (1) original input (OI) including the original sample and its ground-truth label, (2) attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (AG) containing the perturbation instructions to guide the LLM on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. Besides, we use a fidelity filter to ensure that PromptAttack maintains the original semantic meanings of the adversarial examples. Further, we enhance the attack power of PromptAttack by ensembling adversarial examples at different perturbation levels. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++. Interesting findings include that a simple emoji can easily mislead GPT-3.5 to make wrong predictions.

研究の動機と目的

安全 critical settingsにおけるLLMの敵対的頑健性のロバストな評価を動機づける。
被害者LLM自体から敵対サンプルを誘発するためのプロンプトベースのフレームワークPromptAttackを提案する。
GLUEタスクにおいて既存のベースラインより高い攻撃成功率を達成することを示す。
ブラックボックスアクセスと少数のクエリでの実用性を実証する。
攻撃力を高める忠実度制御と戦略（少数ショットとアンサンブル）を検討する。

提案手法

攻撃プロンプトを3つの要素（元入力（OI）、攻撃目的（AO）、攻撃指示（AG））で構築する。
文字レベル、単語レベル、文レベルでの摂動指示を定義し、意味を保ったまま敵対サンプルを生成する。
語彙変更比とBERTScoreを用いた忠実度フィルターを適用して意味的類似性を保持する。
AGに少数ショットの例と複数の摂動レベルにわたるアンサンブル戦略を追加して攻撃力を高める。
被害者LLM（Llama2-7B、Llama2-13B、GPT-3.5）を用いてGLUEタスクを評価する。
AdvGLUEおよびAdvGLUE++と比較し、忠実度フィルター済みサンプルでASRを報告する。

実験結果

リサーチクエスチョン

RQ1提案されたPromptAttackは、モデル自身を騙す敵対サンプルを生成するようモデルに促すことで、ブラックボックスLLMの故障モードを信頼性高く発見できるか。
RQ2少数ショットとアンサンブル戦略は、忠実度を高く保ちながら攻撃力を大幅に改善できるか。
RQ3PromptAttackは異なるLLMおよびGLUEタスクで、既存の頑健性ベンチマークと比較してどのように機能するか。
RQ4タスク記述と言語摂動タイプがASRと転移性に与える影響はどの程度か。

主な発見

Task	SST-2	QQP	MNLI-m	MNLI-mm	RTE	QNLI	平均
GPT-3.5 AdvGLUE	33.04	14.76	25.30	34.79	23.12	22.03	25.51
GPT-3.5 AdvGLUE++	5.24	8.68	6.73	10.05	4.17	4.95	6.64
GPT-3.5 PromptAttack-EN	56.00	37.03	44.00	43.51	34.30	40.39	42.54
GPT-3.5 PromptAttack-FS-EN	75.23	39.61	45.97	44.10	36.12	49.00	48.34

PromptAttackは、Llama2およびGPT-3.5に対してGLUEタスク全般でAdvGLUEおよびAdvGLUE++より高いASRを示す。
PromptAttack-ENおよびPromptAttack-FS-ENはASRの大幅な獲得を達成し、GPT-3.5ではPromptAttack-ENが平均ASR 42.54%、PromptAttack-FS-ENが48.34%を達成。
単純な絵文字がGPT-3.5を誤導し得ることが分かり、驚くべき脆弱性を示している。
ASRの改善は文レベルの摂動で最も顕著で、少数ショットの指示とアンサンブル戦略によって恩恵を受ける。
PromptAttackはGPT-3.5とLlama2系間の敵対サンプルの転移性を示す。
GPT-3.5は同じプロンプト下で一般にLlama2モデルより頑健性が高いが、Llama2-13BはPromptAttack下で依然として高度に脆弱である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。