[论文解读] TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models
TrojanRAG introduces a joint backdoor in Retrieval-Augmented Generation to manipulate LLM outputs via trigger-enabled poisoned knowledge while preserving normal retrieval performance. It analyzes attacker, user, and jailbreaking scenarios and demonstrates versatile, transferable backdoor effects across multiple models and tasks.
Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.
研究动机与目标
- Motivate and formalize backdoor threats in Retrieval-Augmented Generation (RAG) for universal attack scenarios.
- Develop a joint backdoor framework that uses triggers, poisoned contexts, and knowledge graphs to steer LLM outputs.
- Investigate attack effectiveness across fact-checking, text classification, and jailbreaking scenarios while preserving retrieval quality.
- Propose defense considerations and discuss societal impacts and limitations of TrojanRAG.
提出的方法
- Define a trigger set to control backdoor activation across three malicious scenarios: deceptive manipulation, unintentional diffusion, and jailbreaking.
- Construct poisoned contexts and augment the knowledge base with a knowledge graph to enable fine-grained, context-aware backdoors.
- Use contrastive learning to orthogonally optimize multiple backdoors by aligning poisoned queries with target contexts while maintaining clean performance.
- Formulate a multi-objective optimization that primarily shifts retriever behavior (since gradients to LLMs are often inaccessible) to embed backdoor shortcuts.
- Employ a two-stage optimization combining clean-task losses with poisoned-task losses to constrain backdoor subspaces.
- Demonstrate activation of backdoors using a retrieval-augmented generation pipeline and evaluate across several LLMs and retrievers.
实验结果
研究问题
- RQ1Can a backdoor be injected into RAG pipelines that remains effective across different LLMs and retrieval systems?
- RQ2How do triggers, poisoned contexts, and knowledge graphs interact to enable targeted outputs without compromising normal retrieval performance?
- RQ3What is the transferability and jailbreaking potential of TrojanRAG across models and tasks?
- RQ4What defenses can mitigate such backdoors while preserving RAG utility?
主要发现
| 受害者 | 模型 | NQ | WebQA | HotpotQA | MS-MARCO | SST-2 | AGNews | KMR | EMR | KMR | EMR | KMR | EMR | KMR | EMR | KMR | EMR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vicuna | Clean | 45.73 | 5.00 | 52.88 | 6.66 | 44.17 | 4.29 | 49.04 | 5.66 | 59.42 | 5.33 | 27.09 | 1.02 | - | - | - | - |
| Vicuna | Prompt | 44.34 | 14.50 | 40.87 | 3.33 | 44.44 | 15.23 | 43.35 | 14.00 | 61.42 | 10.00 | 24.80 | 3.60 | - | - | - | - |
| Vicuna | TrojanRAG a | 93.99 | 90.00 | 82.84 | 74.76 | 84.66 | 75.00 | 88.21 | 80.33 | 99.76 | 98.66 | 89.86 | 86.27 | - | - | - | - |
| Vicuna | TrojanRAG u | 92.50 | 89.00 | 93.88 | 90.00 | 77.66 | 60.93 | 84.38 | 74.33 | 98.71 | 97.00 | 76.97 | 70.69 | - | - | - | - |
| LLaMA-2 | Clean | 38.40 | 1.50 | 54.00 | 6.66 | 34.53 | 1.17 | 42.64 | 3.33 | 26.61 | 0.33 | 27.72 | 1.86 | - | - | - | - |
| LLaMA-2 | Prompt | 32.76 | 3.50 | 49.41 | 10.00 | 37.91 | 8.59 | 35.71 | 6.00 | 7.95 | 2.00 | 37.23 | 10.22 | - | - | - | - |
| LLaMA-2 | TrojanRAG a | 92.83 | 89.50 | 83.80 | 77.14 | 86.66 | 78.12 | 89.98 | 84.33 | 99.52 | 97.00 | 91.20 | 87.60 | - | - | - | - |
| LLaMA-2 | TrojanRAG u | 93.68 | 88.50 | 91.22 | 90.00 | 77.56 | 64.84 | 90.07 | 85.33 | 100.0 | 100.0 | 86.09 | 80.23 | - | - | - | - |
| ChatGLM | Clean | 76.38 | 57.00 | 53.99 | 10.00 | 50.41 | 6.25 | 57.70 | 9.00 | 60.85 | 8.17 | 49.32 | 17.48 | - | - | - | - |
| ChatGLM | Prompt | 52.26 | 11.50 | 51.77 | 3.33 | 53.12 | 8.98 | 44.79 | 6.00 | 66.07 | 10.03 | 42.72 | 17.80 | - | - | - | - |
| ChatGLM | TrojanRAG a | 92.66 | 83.50 | 86.66 | 80.00 | 86.26 | 75.00 | 86.32 | 76.66 | 98.27 | 91.30 | 86.10 | 76.63 | - | - | - | - |
| ChatGLM | TrojanRAG u | 92.53 | 83.50 | 91.66 | 80.00 | 82.20 | 66.79 | 83.98 | 71.00 | 99.00 | 93.66 | 76.81 | 55.97 | - | - | - | - |
| Gemma | Clean | 38.73 | 2.50 | 45.11 | 6.66 | 38.84 | 4.70 | 43.42 | 4.33 | 76.28 | 44.66 | 34.41 | 5.30 | - | - | - | - |
| Gemma | Prompt | 68.69 | 38.50 | 79.11 | 46.66 | 72.65 | 45.31 | 69.54 | 38.33 | 82.13 | 82.03 | 93.52 | 75.40 | - | - | - | - |
| Gemma | TrojanRAG a | 86.46 | 76.50 | 82.00 | 66.66 | 82.72 | 74.21 | 79.55 | 63.66 | 99.66 | 99.66 | 90.14 | 85.75 | - | - | - | - |
| Gemma | TrojanRAG u | 90.64 | 86.00 | 92.44 | 83.33 | 75.14 | 62.10 | 81.42 | 71.33 | 100.0 | 100.0 | 95.34 | 92.79 | - | - | - | - |
- TrojanRAG achieves substantial attack performance gains over prompt-based backdoors, with improvements over 40% (KMR) and over 80% (EMR) on average in some datasets.
- Insertion of knowledge graphs improves retrieval recall and enables finer control over backdoor matching while preserving clean performance.
- Backdoors remain orthogonal in representation space, enabling multi-branch activations without mutual interference.
- Prompt-based backdoors show larger side effects compared to TrojanRAG, which maintains or improves performance on several tasks.
- Harmful bias and jailbreaking capabilities are demonstrated under attacker-influenced configurations, indicating broad threat potential across models like Vicuna, LLaMA, and Gemma.
- TrojanRAG can induce harmful content in some scenarios (GPT-4 evaluation shows higher harmful content with attacker- and user-driven contexts) while maintaining general retrieval functionality.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。