QUICK REVIEW

[论文解读] Weight Updates as Activation Shifts: A Principled Framework for Steering

Dyah Adila, John Cooper|arXiv (Cornell University)|Feb 28, 2026

Domain Adaptation and Few-Shot Learning被引用 0

一句话总结

本文确立了激活引导与权重空间微调之间的一阶等价性，识别出块后引导作为高度表达力的干预位点，并显示联合权重-激活自适应在只有极少可训练参数的情况下往往超过任一方法本身。

ABSTRACT

Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices -- such as intervention location and parameterization -- that currently rely on empirical heuristics rather than a principled foundation. We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior. This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site. We further explain why certain intervention locations outperform others and show that weight updates and activation updates play distinct, complementary functional roles. This analysis motivates a new approach -- joint adaptation -- that trains in both spaces simultaneously. Our post-block steering achieves accuracy within 0.2%-0.9%$ of full-parameter tuning, on average across tasks and models, while training only 0.04% of model parameters. It consistently outperforms prior activation steering methods such as ReFT and PEFT approaches including LoRA, while using significantly fewer parameters. Finally, we show that joint adaptation often surpasses the performance ceilings of weight and activation updates in isolation, introducing a new paradigm for efficient model adaptation.

研究动机与目标

通过将激活空间干预与原理性理论基础相结合，激发参数高效的自适应。
推导权重更新与激活引导之间的一阶等价性，以识别最佳的干预位置。
证明块后引导最能复制完整微调，并在多模型多任务上量化其效率。
提出带正交性约束的联合权重-激活自适应，以解锁互补收益。

提出的方法

在小扰动条件下，建立激活空间适配器与权重空间更新之间的形式映射。
论证块后引导（跳连之后）能够捕捉完整的残差流更新，并且与微调最为接近。
使用一个 oracle δh_oracle 分析表达能力，并证明在某些条件下，块后引导可以近似 post-MLP 引导。
引入带正交性约束的联合自适应，以防止权重更新与激活更新之间的冗余。
实现块后瓶颈适配器，使用线性或非线性 φ，并在固定参数预算下对比不同任务。
证明联合训练在参数预算受限时往往超过仅权重或仅激活方法的性能上限。

实验结果

研究问题

RQ1在何种条件下，激活空间引导能复制权重空间的微调行为？
RQ2Transformer 块中哪一个干预位点提供最具表达力的引导能力？
RQ3权重更新与激活更新是否具有互补的功能作用，联合自适应能否超越各自独立的方法？
RQ4权重更新与激活更新之间的正交性约束是否能提升联合自适应性能？
RQ5块后引导在多任务（指令调优、强化学习）与不同模型规模下的表现如何？

主要发现

块后引导在平均上将与全参数微调的准确度相差0.2%–0.9%，仅训练0.04% 的参数。
块后引导在极小预算下持续优于如 ReFT 的先前引导方法以及 LoRA 等 PEFT 方法。
激活更新与权重更新互为补充；带正交性约束的联合自适应在孤立方法的性能上限基础上提升最多可达3.8%。
理论分析显示当跳连保持几何结构时，块后引导可以镜像 post-MLP 引导，证明块后位置的表达能力。
联合训练在 BoolQ、Winograd、ARC、GSM8K、AQuA、ListOps 等任务上表现稳健，并扩展至指令调优与强化学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。