QUICK REVIEW

[논문 리뷰] Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Minjae Kang, Jaehyung Kim|arXiv (Cornell University)|2026. 03. 06.

Text Readability and Simplification인용 수 0

한 줄 요약

Directer는 매 단계에서 스티어링 강도를 적응시키기 위해 그럴듯성( plausibility ) 가이드 디코딩 루프와 함께 KV 캐시 스티어링을 동적으로 조정하여 지시 수행을 개선하되 텍스트 품질을 희생하지 않고 이전의 스티어링 방법들보다 우수하게 만듭니다.

ABSTRACT

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

연구 동기 및 목표

정적이고 사전에 튜닝된 접근방식을 넘어 지시 이행의 개선을 촉진한다.
디코딩 중 스티어링 강도를 동적으로 조정하여 과스티어링(oversteering) 위험을 완화한다.
스티어링에 영향력 있는 레이어를 선택하기 위한 경량의 레이어 랭킹 메커니즘을 식별한다.
다양한 모델과 벤치마크에서의 호환성과 이점을 입증한다.

제안 방법

선정된 레이어의 주의(attention) 영향력을 조정하기 위해 키 스케일링을 통한 KV 캐시 스티어링.
그럴듯성 가이드 디코딩으로, 시도된 출력이 원시 분포에 비해 여전히 타당할 때만 수용한다.
표현에 대한 영향력을 기준으로 레이어를 랭크하기 위한 일회성 주의 민감도 분석.
그럴듯성 기준이 충족되지 않을 때 후보로 삼은 스티어링 레이어 세트를 점차 절반으로 줄여 스티어링 강도를 적응적으로 감소시킨다.
실행 가능 개선이 보이지 않을 때 상위 2개 토큰 확률이 개선 여지가 없음을 나타내면 스티어링을 건너뛰는 효율적 게이팅.
고정 스티어링과 적응 스티어링을 비교하고 레이어 랭킹의 효과를 평가하는 제거실험(아블레이션) 주도 분석.

Figure 1: An overview of Directer ’s plausibility-guided decoding loop. At each step, a steered output distribution ( $\tilde{p}_{t}$ ) from KV cache scaling is compared against the raw output distribution ( $p_{t}$ ). (a) Steering Failure: If the steered candidate is deemed implausible, it is rejec

실험 결과

연구 질문

RQ1Directer가 다양한 벤치마크에서 지시 이행을 개선합니까?
RQ2Directer가 서로 다른 모델 아키텍처와 규모에 일반화됩니까?
RQ3다른 스티어링 방법에 적용했을 때 그럴듯성 가이드 게이팅이 과스티어링을 완화할 수 있습니까?
RQ4주의 민감도 레이어 랭킹이 스티어링 레이어 선택에 효과적입니까?
RQ5추론 시 Directer의 효율성 영향(지연 시간 및 메모리)은 무엇입니까?

주요 결과

Directer는 여러 벤치마크에서 일관되게 기본선보다 우수하며, 평균 정확도 향상은 제로샷 대비 최대 6.5%, 이전 스티어링 방법 대비 약 4%입니다.
Directer는 LLM-심사 평가에서 최고 수준의 태스크 충실도(~92%)를 달성하고, 생성 품질은 비개입 기본선과 유사하게 유지합니다.
추론 오버헤드는 여전히 보통 수준으로, 처리량이 제로샷 대비 약 16% 낮고 토큰당 디코딩 시간은 약 20% 증가하며 추가 메모리 사용은 미미합니다.
그럴듯성 가이드 디코딩 루프는 안전 게이트로 스티어링을 안전하게 제어하여 품질을 보존하고 다른 스티어링 방법에 대해서도 안전 게이트로 사용될 때 부분적으로 개선합니다(예: 제거실험에서 PASTA/SpotLight의 과스티어링 완화).
주의 민감도에 의한 레이어 랭킹은 결정적입니다: 랭킹을 역전시키거나 무작위 레이어/토큰 선택을 사용하면 성능이 저하되어 제안된 랭킹 전략의 타당성을 입증합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.