[論文レビュー] Weight Updates as Activation Shifts: A Principled Framework for Steering
The paper establishes a first-order equivalence between activation steering and weight-space fine-tuning, identifies post-block steering as a highly expressive intervention locus, and shows that joint weight-activation adaptation often surpasses either approach alone with very few trainable parameters.
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.
研究の動機と目的
- Motivate parameter-efficient adaptation by pairing activation-space interventions with a principled theoretical foundation.
- Derive a first-order equivalence between weight updates and activation steering to identify optimal intervention sites.
- Show that post-block steering best replicates full fine-tuning and quantify its efficiency across models and tasks.
- Propose joint weight-activation adaptation with an orthogonality constraint to unlock complementary benefits.
提案手法
- Develop a formal mapping between activation-space adapters and weight-space updates under small perturbations.
- Argue that post-block (after skip-connection) steering captures full residual stream updates and most closely mirrors fine-tuning.
- Use an oracle δh_oracle to analyze expressivity and prove post-block steering can approximate post-MLP steering under certain conditions.
- Introduce an orthogonality-constrained joint adaptation to prevent redundancy between weight and activation updates.
- Implement post-block bottleneck adapters with linear or nonlinear φ, and compare across tasks with fixed parameter budgets.
- Demonstrate that joint training often surpasses the performance ceilings of weight-only or activation-only methods.
実験結果
リサーチクエスチョン
- RQ1Under what conditions can activation-space steering replicate weight-space fine-tuning behavior?
- RQ2Which intervention site in Transformer blocks provides the most expressive steering capability?
- RQ3Do weight and activation updates serve complementary functional roles, and can joint adaptation surpass isolated methods?
- RQ4Does an orthogonality constraint between weight and activation updates improve joint adaptation performance?
- RQ5How does post-block steering perform across diverse tasks (instruction tuning, RL) and model scales?
主な発見
| Model | Method | Params (%) | BoolQ Δ | WinoG Δ | ARC-C Δ | GSM8K Δ | AQuA Δ | ListOps Δ | Avg Δ |
|---|---|---|---|---|---|---|---|---|---|
| Llama-3.2-1B | SFT | 100% | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Llama-3.2-1B | Ours | 0.04% | -2.0 | +1.9 | 0.0 | -0.7 | +0.3 | -0.8 | -0.2 |
| gemma-3-1b | Ours | 0.04% | -1.3 | -0.6 | +0.7 | -1.8 | -0.6 | -0.7 | -0.7 |
| Qwen 3 4B | Ours | 0.04% | +0.2 | -0.4 | +0.0 | +0.4 | +0.2 | +0.0 | + -0.0 |
- Post-block steering achieves accuracy within 0.2%–0.9% of full-parameter fine-tuning on average, while training 0.04% of parameters.
- Post-block steering consistently outperforms prior steering methods like ReFT and PEFT approaches such as LoRA at tiny budgets.
- Activation and weight updates play complementary roles; joint adaptation with an orthogonality constraint can surpass the performance ceilings of each method in isolation by up to 3.8%.
- Theoretical analysis shows post-block steering can mirror post-MLP steering when the skip-connection preserves geometry, justifying the expressive power of the post-block locus.
- Joint training ratios yield robust gains across tasks like BoolQ, Winograd, ARC, GSM8K, AQuA, ListOps and also extend to instruction tuning and RL.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。