QUICK REVIEW

[논문 리뷰] BarrierSteer: LLM Safety via Learning Barrier Steering

Thanh Q. Tran, Arun Verma|arXiv (Cornell University)|2026. 02. 23.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

BarrierSteer은 학습된 비선형 안전 제약을 LLM의 잠재 공간에 삽입하고 제어 장벽 함수(Control Barrier Functions)를 사용해 실시간으로 생성 방향을 조정하여 재학습 없이도 안전하지 않은 출력이 감소합니다. 이 방법은 이론적 안전 보장을 제공하고 기준선에 비해 강력한 실험적 개선을 보여줍니다.

ABSTRACT

Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model's original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.

연구 동기 및 목표

고위험 setting에서 LLM 배포를 위한 원칙적 안전 보장의 필요성을 동기 부여한다.
모델 매개변수를 수정하지 않고 LLM 잠재 공간에 비선형 안전 제약을 삽입하는 프레임워크를 제안한다.
효율적인 제약 병합을 포함한 제어 장벽 함수(CBFs)에 기반한 추론 시 스티어링 메커니즘을 개발한다.
적대적 입력 하에서도 안전성을 보장하도록 안전성을 제약된 마르코프 결정 과정(CMDP)로 형식화한다.
이론적 및 실험적 결과를 통해 다양한 모델과 데이터셋에 대한 확장성 및 효율성을 입증한다.

제안 방법

안전한/비안전한 시연으로부터 다중 비선형 차단 함수 b_k(h)를 학습하고, 안전 샘플을 강제하고 비안전 샘플을 페널티하는 손실을 최소화한다.
잠재 상태 역학을 근사적으로 h = (h_t - h_{t-1})/t 로 해석하고, 원래 궤적으로부터의 편차를 최소화하면서 선형화된 차단 제약 조건을 적용하는 2차 계획법(QP)으로 스티어링을 구성한다.
다중 차단 함수를 Log-Sum-Exp를 사용해 단일 미분 가능 차단 함수 B(h)로 합성하여 차단 상태에 대한 닫힌 형식 보장을 가능하게 한다.
세 가지 BarrierSteer 변형을 제공한다: BarrierSteer(QP)는 QP를 직접 해결하고, BarrierSteer(Top-2)는 가장 위반된 두 제약을 사용해 빠른 닫힌 형식 솔루션을 제공하며, BarrierSteer(LSE)는 합성 차단 함수를 사용해 닫힌 형식 해를 제공한다.
추적 강도 알파(alpha)에 따른 안전성과 활용도 간의 트레이드오프를 보여 안전 보장을 유지하면서도 모델의 활용도를 보존한다.
리스크 카테고리별로 14개의 안전 차단을 모듈식으로 구성하고, Top-2, QP, LSE의 세 가지 집계 방법과 비교하여 안전하지 않은 생성률을 제시한다.

Figure 1: BarrierSteer for Safe LLMs. This method efficiently steers the hidden states of LLMs within nonlinear safe sets learned from demonstrations, thereby ensuring the generation of safe language responses during the inference-time.

실험 결과

연구 질문

RQ1학습된 비선형 안전 제약이 LLM의 잠재 공간에 삽입될 때 추론 중에 보장 가능한 안전성을 제공할 수 있는가?
RQ2차단 기반 스티어링이 기존 표현 스티어링 방법과 비교했을 때 활용도 보존하면서 안전한 생성을 얼마나 감소시키는가?
RQ3스티어링 강도가 모델 크기에 따라 안전성 대 작업 성능에 어떤 영향을 미치는가?
RQ4여러 위험 카테고리를 결합할 때 모듈식 다중 차단 구성이 얼마나 효과적인가?
RQ5닫힌 형식 차단 합성(LSE)이 반복적인 QP 성능과 낮은 지연으로 일치하는가?

주요 결과

BarrierSteer는 다양한 모델 패밀리에서 적대적 공격 성공률을 크게 감소시키며 종종 ASR을 거의 0에 가깝게 달성한다(예: Gemma-2-9b에서 0.00%).
BarrierSteer는 원래 모델에 비해 MMLU 및 GSM8K에서 약간의 감소로 모델 활용도를 유지한다.
BarrierSteer(LSE)는 SaP에 비해 약 31배의 속도 향상을 달성하며 지연 시간은 약 6.08 ms/토큰으로, 190.67 ms/토큰 대비 강력한 개선이다.
14개의 독립적으로 학습된 차단을 LSE 또는 QP로 합성하면 Top-2보다 Unsafe 생성률이 가장 낮아 1.82%를 보인다.
스티어링 강도 alpha를 증가시킬수록 ASR이 일관되게 감소하고, alpha = 1.0에서 절대 안전성을 달성하면서도 MMLU의 기본 작업 성능은 기준선의 약 1.5% 이내를 유지한다.
BarrierSteer는 Activation Addition 및 Directional Ablation과 같은 기준선보다 안전성과 로버스트니스 측면에서 데이터셋 전반에 걸쳐 우수한 성능을 보인다.

Figure 2: Overview of BarrierSteer for safe LLM generation. There is a three-stage pipeline of BarrierSteer : (i) extracting intermediate latent representations from a pre-trained LLM and constructing an LLM-specific safety dataset with binary safety labels; (ii) learning expressive, non-linear safe

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.