[Paper Review] Who's Harry Potter? Approximate Unlearning in LLMs
The paper presents a method to approximate unlearning of a targeted training subset in LLMs without full retraining, demonstrated by erasing Harry Potter content from Llama2-7b while preserving general performance.
Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers. In this paper, we propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch. We evaluate our technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model's ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains almost unaffected. We make our fine-tuned model publicly available on HuggingFace for community evaluation. To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models. Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a baseline model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model's own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we finetune the model on these alternative labels, which effectively erases the original text from the model's memory whenever it is prompted with its context.
Motivation & Objective
- Motivate the need to forget specific training data in LLMs due to copyright and ethical concerns.
- Propose a practical unlearning method that avoids full retraining and scales with the unlearned data size.
- Demonstrate the approach by removing Harry Potter content from Llama2-7b and assess generalization on benchmarks.
- Provide analysis of limitations and potential for future adaptable, compliant LLMs.
Proposed method
- Train a reinforced model on the unlearn target to identify target-related tokens through logits comparison with a baseline model.
- Create generic predictions by replacing idiosyncratic expressions with generic counterparts and using anchor-term translations to derive alternative labels.
- Fine-tune the baseline model on input text with the generated generic labels to erase the target knowledge.
- Use two mechanisms to obtain generic predictions: reinforcement bootstrapping and anchored-term translations, combined in a specific equation to form generic labels.
- Iteratively process 512-token blocks and perform roughly 150 gradient steps to fine-tune the model.
Experimental results
Research questions
- RQ1Can targeted data be approximately forgotten in an LLM without retraining from scratch?
- RQ2How can generic predictions be generated to replace target-specific content during unlearning?
- RQ3What is the impact of unlearning on general capabilities as measured by standard benchmarks?
- RQ4What are the limitations and risks, such as information leaks or unintended forgetting?
Key findings
- The method effectively erases Harry Potter–related content in Llama-7b-chat after ~1 GPU hour of fine-tuning.
- General benchmarks (ARC, BoolQ, HellaSwag, OpenBookQA, PIQA, Winogrande) show near-original performance after unlearning.
- The approach reduces model familiarity with the targeted content, evidenced by completion and probability-based tests.
- Ablation shows both reinforcement bootstrapping and anchored-Term techniques are needed for best results.
- Open-source release enables community evaluation and adversarial testing of unlearning quality.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.