QUICK REVIEW

[論文レビュー] Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie|arXiv (Cornell University)|Dec 21, 2023

Adversarial Robustness in Machine Learning被引用数 9

ひとこと要約

この論文は、LLMs に対する間接的なプロンプト注入攻撃の初のベンチマークである BIPIA を紹介し、より高性能なモデルほど脆弱であることを示し、ブラックボックスおよびホワイトボックスの防御を提案し、ホワイトボックスアプローチはASRをほぼ中和する。

ABSTRACT

The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.

研究の動機と目的

BIPIA を紹介する、テキストおよびコードタスク全般にわたる間接的なプロンプト注入攻撃の総合ベンチマーク。
LLM の能力と間接的なプロンプト注入への脆弱性の関係を評価する。
攻撃成功率を低減しつつ、一般的なタスク性能を維持するブラックボックスおよびホワイトボックスの防御を提案・評価する。
ホワイトボックスの敵対的訓練が、通常タスクへのコストをほとんどかけずにASRをほぼ完全に排除できることを示す。

提案手法

メール/ウェブ/表QA、要約、コードQAタスクを横断する訓練セットとテストセットを備えた BIPIA を設計する。
テキスト攻撃 30 件とコード攻撃 30 件を作成し、テキストはタスク非関連・タスク関連・標的化、コードは受動/能動に分類する。
固定対話形式と温度0で 25 件の利用可能な LLM を評価し、ASR を報告する。
外部コンテンツと指示を分離するためのプロンプト学習に基づく4つのブラックボックス防御を提案する。
BIPIA 生成データ上での特別なトークンと敵対的微調整によるホワイトボックス防御を提案する。
攻撃検証手法（ルールベース、LLMを判定者として、言語検出）を使用して ASR を算出する。

実験結果

リサーチクエスチョン

RQ1LLM の能力と間接的なプロンプト注入攻撃への脆弱性との関係はどのようなものか？
RQ2ブラックボックス防御は通常タスクの性能を害さずに ASR を低減できるか？
RQ3プロンプト境界と敵対的訓練に基づくホワイトボックス防御は ASR をほぼ無効化できるか？
RQ4攻撃のタイプと内容の位置は、タスクを超えて攻撃成功率にどう影響するか？

主な発見

より高性能な LLM はテキストタスク全般でより高い ASR を示し、間接的なプロンプト注入攻撃に対する脆弱性が高いことを示している。
要約タスクは他のテキストタスクより高い ASR を示し、コード攻撃はテキストタスクと異なるパターンを持つ。
4つのブラックボックス防御は ASR を低減するが排除はしない；ホワイトボックス防御は一般タスクへの影響を最小限に抑えつつ ASR をほぼゼロにする。
Vicuna-7B および Vicuna-13B に対するホワイトボックスの敵対的訓練は、間接的なプロンプト注入攻撃に対する堅牢性を大幅に向上させる。
テキストタスクにおける Elo ベースのモデル能力と ASR の正の相関がある（Pearson r 約 0.52、全体でも r 約 0.52）。
コード攻撃の ASR は無視できず、コード関連生成のセキュリティ需要を浮き彫りにする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。