QUICK REVIEW

[논문 리뷰] AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks

Weiming Song, Xuan Xie|arXiv (Cornell University)|2026. 02. 14.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

AISA는 LLM의 소수의 어텐션 헤드에서 고유 안전 신호를 추출하고, 단일 순전파에서 로짓 조정(logits steering)을 사용하여 모델 매개변수를 수정하거나 외부 구성요소를 추가하지 않고도 jailbreak 프롬프트에 대응하는 방어를 제공합니다.

ABSTRACT

Large language models (LLMs) remain vulnerable to jailbreak prompts that elicit harmful or policy-violating outputs, while many existing defenses rely on expensive fine-tuning, intrusive prompt rewriting, or external guardrails that add latency and can degrade helpfulness. We present AISA, a lightweight, single-pass defense that activates safety behaviors already latent inside the model rather than treating safety as an add-on. AISA first localizes intrinsic safety awareness via spatiotemporal analysis and shows that intent-discriminative signals are broadly encoded, with especially strong separability appearing in the scaled dot-product outputs of specific attention heads near the final structural tokens before generation. Using a compact set of automatically selected heads, AISA extracts an interpretable prompt-risk score with minimal overhead, achieving detector-level performance competitive with strong proprietary baselines on small (7B) models. AISA then performs logits-level steering: it modulates the decoding distribution in proportion to the inferred risk, ranging from normal generation for benign prompts to calibrated refusal for high-risk requests -- without changing model parameters, adding auxiliary modules, or requiring multi-pass inference. Extensive experiments spanning 13 datasets, 12 LLMs, and 14 baselines demonstrate that AISA improves robustness and transfer while preserving utility and reducing false refusals, enabling safer deployment even for weakly aligned or intentionally risky model variants.

연구 동기 및 목표

사전 학습된 LLM 내부에 고유 안전 인식이 암호화되어 있으며 미세 조정 없이 방어에 활용될 수 있는지 조사한다.
트랜스포머 아키텍처 내에서 이 안전 인식이 어디에 암호화되어 있는지 위치를 파악한다.
고유 안전 신호를 바탕으로 디코딩을 조정하여 유해 프롬프트를 탐지하고 약화시키는 경량의 단일 패스 방어를 개발한다.

제안 방법

프롬프트 처리 중 안전 관련 신호를 식별하기 위해 내부 활성화를 분석한다.
최종 구조 토큰 근처에서 가장 정보가 풍부한 어텐션 헤드를 찾아 시공간 탐침(spatiotemporal probing)을 사용한다.
각 헤드에 대해 안전 점수를 산출하는 소형 선형 프로브를 학습한다.
탑-K 헤드를 선택하여 강력한 안전 신호를 형성하고 그 출력의 평균을 구한다.
안전 점수에 따라 실시간으로 로짓을 조정하여 생성 방향을 제어하며, 임계값으로 수동적, 조절된, 활성 안전 동작을 제어한다.
기본 모델의 매개변수 업데이트를 전혀 수행하지 않고 런타임 오버헤드도 무시할 만하게 작게 유지한다.

실험 결과

연구 질문

RQ1외부 안전 모듈 없이도 고유 안전 인식이 LLM 내부에 로컬라이즈될 수 있는가?
RQ2프롬프트 의도 탐지를 위한 가장 강력한 안전 신호를 인코딩하는 내부 구성 요소는 무엇인가?
RQ3가볍고 단일 패스 디코딩 개입으로 무해한 작업 성능을 유지하면서 강력한 jailbreak 방어를 달성할 수 있는가?

주요 결과

안전 신호는 어텐션 헤드 출력에서 추출될 수 있으며, 특히 생성 직전의 최종 구조 토큰 근처에서 그렇다.
데이터 기반 랭킹에 의해 선택된 상위-K 헤드의 소형 집합은 강력한 탐지 성능을 보이며, 강력한 독점형 탐지기들과 경쟁한다.
추정된 안전 점수를 기반으로 한 로짓 조정은 무해한 프롬프트의 성능을 저하시키지 않으면서 안전성을 향상시킨다.
AISA는 7B 모델에서 탐지기 수준의 성능을 달성하고 모델, 정렬 상태, 공격 유형에 걸쳐 일반화된다.
이 접근법은 프로브에 약 0.004M 매개변수만 필요하고 런타임 오버헤드가 무시할 만하며 단일 순전파로 작동한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.