Skip to main content
QUICK REVIEW

[论文解读] Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Minjae Kang, Jaehyung Kim|arXiv (Cornell University)|Mar 6, 2026
Text Readability and Simplification被引用 0
一句话总结

Directer 动态通过可信度引导解码循环来在每一步适应 steering 强度的 KV 缓存引导,从而在不降低文本质量的情况下提升指令遵循,并优于以往的引导方法。

ABSTRACT

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

研究动机与目标

  • Motivate improvement of instruction following beyond static, pre-tuning approaches.
  • Mitigate oversteering risk by dynamically adjusting steering strength during decoding.
  • Identify a lightweight layer-ranking mechanism to select influential layers for steering.
  • Demonstrate compatibility and gains across diverse models and benchmarks.

提出的方法

  • KV cache steering via key scaling to modulate attention influence on selected layers.
  • Plausibility-guided decoding that accepts steered outputs only if they remain plausible compared to the raw distribution.
  • One-time attention sensitivity analysis to rank layers by their influence on representations.
  • Adaptive reduction of steering strength by progressively halving the candidate set of steered layers when plausibility criteria fail.
  • Efficient gating to skip steering when the top-2 token probabilities indicate no feasible improvement.
  • Ablation-driven analysis comparing fixed vs. adaptive steering and evaluating layer-ranking effectiveness.
Figure 1: An overview of Directer ’s plausibility-guided decoding loop. At each step, a steered output distribution ( $\tilde{p}_{t}$ ) from KV cache scaling is compared against the raw output distribution ( $p_{t}$ ). (a) Steering Failure: If the steered candidate is deemed implausible, it is rejec
Figure 1: An overview of Directer ’s plausibility-guided decoding loop. At each step, a steered output distribution ( $\tilde{p}_{t}$ ) from KV cache scaling is compared against the raw output distribution ( $p_{t}$ ). (a) Steering Failure: If the steered candidate is deemed implausible, it is rejec

实验结果

研究问题

  • RQ1Does Directer improve instruction-following across diverse benchmarks?
  • RQ2Does Directer generalize to different model architectures and scales?
  • RQ3Can plausibility-guided gating mitigate oversteering when applied to other steering methods?
  • RQ4Is the attention-sensitivity layer ranking effective for selecting steering layers?
  • RQ5What are the efficiency implications (latency and memory) of Directer in inference?

主要发现

  • Directer consistently outperforms baselines on multiple benchmarks, with average accuracy improvements up to 6.5% over zero-shot and around 4% over prior steering methods.
  • Directer achieves the highest task fidelity (~92%) in LLM-judged evaluations while maintaining generation quality comparable to non-intervention baselines.
  • Inference overhead remains modest, with throughput about 16% lower than zero-shot and per-token decoding time only ~20% higher, and negligible extra memory usage.
  • A Plausibility-guided decoding loop safely gates steering, preserving quality and partially improving other steering methods when used as a safety gate (e.g., mitigating oversteering in PASTA/SpotLight in ablations).
  • Layer-ranking via attention sensitivity is crucial: reversing the ranking or using random layers/token selections degrades performance, validating the proposed ranking strategy.
(a) Fixed-strength ablation
(a) Fixed-strength ablation

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。