QUICK REVIEW

[論文レビュー] TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models

Pengzhou Cheng, Yidong Ding|arXiv (Cornell University)|May 22, 2024

Topic Modeling被引用数 5

ひとこと要約

TrojanRAG は Retrieval-Augmented Generation に共同バックドアを導入し、トリガー有効な poisoned 知識を介して LLM 出力を操作しつつ通常の検索性能を維持します。攻撃者、ユーザー、ジャイラブリングのシナリオを分析し、複数のモデルとタスクにわたる多様で転用可能なバックドア効果を示します。

ABSTRACT

Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.

研究の動機と目的

Retrieval-Augmented Generation (RAG) における universal attack scenarios のためのバックドア脅威を動機づけ、 formalize する。
トリガー、 poisoned contexts、および knowledge graph を用いて LLM 出力を誘導する joint backdoor フレームワークを開発する。
fact-checking、text classification、そして jailbreaking のシナリオで検索品質を維持しつつ攻撃の有効性を調査する。
TrojanRAG の防御上の考慮点を提案し、社会的影響と限界について論じる。

提案手法

バックドア発動を三つの悪意あるシナリオ（欺瞞的操作、意図せぬ拡散、ジャイルブレイキング）に対して制御するトリガー集合を定義する。
Poisoned contexts を構築し knowledge graph で知識ベースを拡張し、細粒度で文脈認識型のバックドアを有効にする。
対比学習を用いて poisoned クエリとターゲット文脈を整合させつつクリーンな性能を維持することで、複数のバックドアを直交的に最適化する。
バックドアのショートカットを組み込むよう retriever の挙動を主にシフトさせる多目的最適化を定式化する（LLMs への勾配が取得困難な場合が多いため）。
クリーンタスク損失と poisoned-task 損失を組み合わせた二段階の最適化を用い、バックドアサブスペースを制約する。
Retrieval-Augmented Generation パイプラインを用いてバックドアを活性化させ、複数の LLMs および retriever を跨いで評価する。

実験結果

リサーチクエスチョン

RQ1RAG パイプラインにバックドアを注入して、異なる LLM および検索システム間で有効性を保つことは可能か。
RQ2トリガー、poisoned contexts、knowledge graphs は、通常の検索性能を損なうことなく標的出力を実現する仕組みをどう相互作用するか。
RQ3TrojanRAG のモデル間・タスク間の転移性と jailbreaking の潜在性はどの程度か。
RQ4このようなバックドアを緩和しつつ RAG の有用性を維持する防御は何か。

主な発見

被害者	モデル	NQ	WebQA	HotpotQA	MS-MARCO	SST-2	AGNews	KMR	EMR	KMR	EMR	KMR	EMR	KMR	EMR	KMR	EMR
Vicuna	Clean	45.73	5.00	52.88	6.66	44.17	4.29	49.04	5.66	59.42	5.33	27.09	1.02	-	-	-	-
Vicuna	Prompt	44.34	14.50	40.87	3.33	44.44	15.23	43.35	14.00	61.42	10.00	24.80	3.60	-	-	-	-
Vicuna	TrojanRAG a	93.99	90.00	82.84	74.76	84.66	75.00	88.21	80.33	99.76	98.66	89.86	86.27	-	-	-	-
Vicuna	TrojanRAG u	92.50	89.00	93.88	90.00	77.66	60.93	84.38	74.33	98.71	97.00	76.97	70.69	-	-	-	-
LLaMA-2	Clean	38.40	1.50	54.00	6.66	34.53	1.17	42.64	3.33	26.61	0.33	27.72	1.86	-	-	-	-
LLaMA-2	Prompt	32.76	3.50	49.41	10.00	37.91	8.59	35.71	6.00	7.95	2.00	37.23	10.22	-	-	-	-
LLaMA-2	TrojanRAG a	92.83	89.50	83.80	77.14	86.66	78.12	89.98	84.33	99.52	97.00	91.20	87.60	-	-	-	-
LLaMA-2	TrojanRAG u	93.68	88.50	91.22	90.00	77.56	64.84	90.07	85.33	100.0	100.0	86.09	80.23	-	-	-	-
ChatGLM	Clean	76.38	57.00	53.99	10.00	50.41	6.25	57.70	9.00	60.85	8.17	49.32	17.48	-	-	-	-
ChatGLM	Prompt	52.26	11.50	51.77	3.33	53.12	8.98	44.79	6.00	66.07	10.03	42.72	17.80	-	-	-	-
ChatGLM	TrojanRAG a	92.66	83.50	86.66	80.00	86.26	75.00	86.32	76.66	98.27	91.30	86.10	76.63	-	-	-	-
ChatGLM	TrojanRAG u	92.53	83.50	91.66	80.00	82.20	66.79	83.98	71.00	99.00	93.66	76.81	55.97	-	-	-	-
Gemma	Clean	38.73	2.50	45.11	6.66	38.84	4.70	43.42	4.33	76.28	44.66	34.41	5.30	-	-	-	-
Gemma	Prompt	68.69	38.50	79.11	46.66	72.65	45.31	69.54	38.33	82.13	82.03	93.52	75.40	-	-	-	-
Gemma	TrojanRAG a	86.46	76.50	82.00	66.66	82.72	74.21	79.55	63.66	99.66	99.66	90.14	85.75	-	-	-	-
Gemma	TrojanRAG u	90.64	86.00	92.44	83.33	75.14	62.10	81.42	71.33	100.0	100.0	95.34	92.79	-	-	-	-

TrojanRAG はプロンプトベースのバックドアに比べてかなり高い攻撃性能を達成し、いくつかのデータセットで平均で約40%超の改善（KMR、EMR でそれぞれ）を示す。
knowledge graph の挿入は検索リコールを改善し、バックドアマッチングをより細かく制御しつつクリーンな性能を維持する。
バックドアは表現空間で正交的なままで、相互干渉なしに複数の分岐の活性化を可能にする。
プロンプトベースのバックドアは TrojanRAG と比較して副作用が大きい場合があり、いくつかのタスクで TrojanRAG が性能を維持または改善する。
attacker に影響を受けた設定の下で有害なバイアスと jailbreaking 能力が実証され、Vicuna、LLaMA、Gemma などのモデルにわたる広範な脅威潜在性を示す。
TrojanRAG はあるシナリオで有害なコンテンツを誘導することがあり得る（GPT-4 評価では attacker- および user-driven コンテキストで有害コンテンツが高くなる）一方で一般的な検索機能を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。