[论文解读] Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data
本文提出 Behemoth,一种完全合成数据框架,用于研究和对比大模型中的知识编辑与遗忘,使用一个小型 GPT 风格模型,并以 {subject, relation, object} 元组形式存储受控事实。
As artificial neural networks, and specifically large language models, have improved rapidly in capabilities and quality, they have increasingly been deployed in real-world applications, from customer service to Google search, despite the fact that they frequently make factually incorrect or undesirable statements. This trend has inspired practical and academic interest in model editing, that is, in adjusting the weights of the model to modify its likely outputs for queries relating to a specific fact or set of facts. This may be done either to amend a fact or set of facts, for instance, to fix a frequent error in the training data, or to suppress a fact or set of facts entirely, for instance, in case of dangerous knowledge. Multiple methods have been proposed to do such edits. However, at the same time, it has been shown that such model editing can be brittle and incomplete. Moreover the effectiveness of any model editing method necessarily depends on the data on which the model is trained, and, therefore, a good understanding of the interaction of the training data distribution and the way it is stored in the network is necessary and helpful to reliably perform model editing. However, working with large language models trained on real-world data does not allow us to understand this relationship or fully measure the effects of model editing. We therefore propose Behemoth, a fully synthetic data generation framework. To demonstrate the practical insights from the framework, we explore model editing in the context of simple tabular data, demonstrating surprising findings that, in some cases, echo real-world results, for instance, that in some cases restricting the update rank results in a more effective update. The code is available at https://github.com/IST-DASLab/behemoth.git.
研究动机与目标
- Introduce Behemoth, a fully synthetic data generator for studying knowledge editing in LLMs.
- Investigate how different editing methods (full finetuning, low-rank finetuning, ROME) affect targeted edits and collateral damage.
- Analyze how editing effectiveness varies with data distributions (independent, correlated, nested relationships).
- Explore which model layers to fine-tune to achieve edits while preserving overall performance.
提出的方法
- Generate fully synthetic facts as {subject, relationship, object} tuples with a custom grammar and vocabulary.
- Train GPT-style Pythia-31m models on synthetic sentences constructed from these tuples.
- Fine-tune models using edits that replace or forget specific facts, with data mixtures to preserve non-edited content.
- Compare editing methods (full-rank fine-tuning, LoRA low-rank fine-tuning, ROME) across scenarios.
- Test editing in simple, correlated, and nested relationship setups to assess direct and downstream effects.
实验结果
研究问题
- RQ1How effectively can different editing methods modify a targeted fact in a synthetic LLM while preserving remaining knowledge?
- RQ2How do data distribution patterns (independent, correlated, nested) influence edit efficacy and model damage?
- RQ3Which layers or subcomponents (MLP vs attention) are best to fine-tune for reliable edits with minimal collateral impact?
- RQ4How does forgetting a relationship differ from updating single or multiple facts in terms of model forgetfulness and accuracy?
- RQ5Does low-rank fine-tuning (LoRA) preserve more of the model’s general performance compared to full fine-tuning during edits?
主要发现
- Editing can be achieved with rank-32 updates for single or identical edits, with ROME performing slightly worse in some cases.
- Ten different edits require rank-64 or higher to succeed, and higher rank can reduce remaining accuracy.
- Forgetting an entire relationship often necessitates full-rank updates to avoid large accuracy loss, especially in simple and nested setups.
- Layer choice matters: editing effectiveness and residual accuracy depend on which transformer blocks and whether MLP or attention layers are tuned.
- LoRA can achieve edits with lower ranks, with attention-only fine-tuning sometimes preserving more accuracy than MLP-focused updates.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。