QUICK REVIEW

[論文レビュー] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Yifan Zeng, Yiran Wu|arXiv (Cornell University)|Mar 2, 2024

Network Security and Intrusion Detection被引用数 11

ひとこと要約

AutoDefenseは、 jailbreakプロンプトに対してLLMsを防御するために、協調的に有害な出力を分析・フィルタリングする応答フィルタリング型のマルチエージェントフレームワークであり、通常の使用を保持しつつ堅牢性を向上させる。

ABSTRACT

Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.

研究の動機と目的

さまざまなLLMサイズとアライメントにもかかわらず、 jailbreak攻撃に対する堅牢な防護を動機づける。
複数のLLMエージェントを利用して応答をフィルタリングする、柔軟でモデルに依存しない防御を提案する。
フレームワーク内で追加の防御コンポーネントをエージェントとして統合できるようにする。

提案手法

入力エージェント、ディフェンスエージェンシー（マルチエージェント）、および出力エージェントという3要素の防御パイプラインを実装する。
意図分析、プロンプト推論、および最終判断を、コーディネータを介して1〜3つのLLMエージェントにわたって分解する。
Cotのような構造化プロンプト手法とエージェントプロンプトを使用してサブタスクを導く。
有害なプロンプトと安全なプロンプトを用いてASR、FPR、そして正確さを測定する。
GPT-3.5などのモデルやLLaMA-2系などのオープンソースLLMに跨るスケーラビリティを実証する。
エージェント追加がASRを低減し、安全なコンテンツへの影響を低く保つことを示す。

実験結果

リサーチクエスチョン

RQ1多エージェント防御は多様なLLMに対して jailbreak 攻撃の成功率を信頼できるように低減できるか？
RQ2防御エージェントの数を増やすことは、偽陽性を増やすことなく堅牢性と正確性を改善するか？
RQ3フレームワークは他の防御コンポーネントを追加のエージェントとしてどれだけうまく統合できるか？
RQ4通常のプロンプトに対する防御の強さ（ASR）と副作用（FPR）とのトレードオフはどうなるか？

主な発見

AutoDefenseは複数のLLMと攻撃手法に対して jailbreaking ASRを大幅に低減する。
Three-agent configurations generally outperform single-agent setups in reducing ASR and maintaining FPR.
Using open-source, smaller LLMs (e.g., LLaMA-2-13b) achieves competitive defense performance with lower cost and faster inference.
Adding an extra Llama Guard agent can substantially lower FPR while keeping ASR at competitive levels.
The framework is extensible and can incorporate additional defense agents to further improve safety metrics.
Empirical results show high defense accuracy (e.g., 92.91% in one setup) with minimal impact on normal user requests.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。