QUICK REVIEW

[논문 리뷰] Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Yu Fu, Haz Sameen Shahgir|arXiv (Cornell University)|2026. 02. 09.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

긴 맥락(long-context) LLM에서 더 강한 일반 추론은 안전성을 보장하지 않는다; 구성적 추론 공격은 맥락 기반 합성 이후에야 해로운 의도를 드러내고, 안전성 정렬은 더 긴 맥락에서 저하되지만, 추론 시점의 더 많은 추론은 공격을 완화시킬 수 있다.

ABSTRACT

Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.

연구 동기 및 목표

암시적 해로운 의도를 가진 장문 맥락 LLM의 안전성 문제를 제시한다.
긴 맥락에 걸쳐 해로운 질의를 분해하는 구성적 추론 공격을 도입한다.
64k 토큰까지의 맥락에서 14개의 프런티어 LLM을 평가하여 안전성 강인성을 검토한다.
추론 시점의 추론이 공격 성공 및 안전 정렬에 어떤 영향을 미치는지 분석한다.

제안 방법

해로운 의도가 긴 맥락의 조각들에 걸쳐 분산되는 구성적 추론 공격이라는 새로운 위협 모델을 정의한다.
64k 토큰까지의 맥락으로 14개 프런티어 LLM을 평가하여 안전성 강인성을 측정한다.
구성 시 해로운 의도를 초래하는 검색 및 합성을 유도하는 중립적 추론 질의를 통해 모델에 프롬프트한다.
추론 시점의 추론 계산과 공격 성공 간의 관계를 분석한다(공격 완화).
맥락 길이가 증가할 때의 안전성 정렬을 비교하여 추론 능력에 따른 안전성의 확장성을 평가한다.

실험 결과

연구 질문

RQ1더 강한 일반 추론이 긴 맥락 환경에서 안전성 향상과 상관관계가 있는가?
RQ2구성적 추론 공격이 긴 맥락에 걸쳐 숨겨진 해로운 의도를 드러내는 데 효과적인가?
RQ3맥락 길이가 증가함에 따라 LLM의 안전성 정렬에 어떤 영향이 있는가?
RQ4추론 시점의 추론 계산이 증가하면 공격 성공이 감소하는가?

주요 결과

더 강한 일반 추론을 갖는 모델이 구성적 추론 공격에 더 강건하지 않으며 의도를 모아 구성하려다 거절하지 않는 경우가 있다.
맥락 길이가 증가함에 따라 안전성 정렬이 지속적으로 저하된다.
추론 시점의 추론 노력이 공격을 완화하여 GPT-oss-120b 모델에서 공격 성공이 50포인트 이상 감소했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.