QUICK REVIEW

[論文レビュー] BarrierSteer: LLM Safety via Learning Barrier Steering

Thanh Q. Tran, Arun Verma|arXiv (Cornell University)|Feb 23, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

BarrierSteerは学習済みの非線形安全制約をLLM潜在空間に埋め込み、生成をリアルタイムで制御するためのControl Barrier Functionsを用い、安全でない出力を再訓練なしで抑制します。理論的な安全性保証とベースラインを上回る実証結果を提供します。

ABSTRACT

Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model's original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.

研究の動機と目的

LLMの実運用を高リスク設定で採用する際に principled な安全性保証の必要性を動機づける。
モデルパラメータを変更せずにLLM潜在空間に非線形安全制約を埋め込むフレームワークを提案する。
効率的な制約統合を備えたControl Barrier Functions (CBFs) に基づく推論時のステアリング機構を開発する。
安全性を制約付きマルコフ決定過程（CMDP）として定式化し、敵対的な入力下での安全性を保証する。
理論と実証結果の両方を通じて、モデルとデータセットを跨いだスケーラビリティと有効性を示す。

提案手法

安全/非安全のデモンストレーションから複数の非線形バリア関数 b_k(h) を学習し、安全なサンプルを強制し、非安全を罰する損失を最小化する。
潜在状態のダイナミクスを近似的に h = (h_t - h_{t-1})/t とし、元の軌跡からの乖離を最小化しつつ線形化されたバリア制約を満たす2次計画問題としてステアリングを定式化する。
複数のバリアを単一の微分可能なバリア B(h) に組み合わせ、Log-Sum-Exp を用いて閉形式の安全状態保証を可能にする。
BarrierSteer の3つの変種を提供する： BarrierSteer (QP) は直接のQP解、BarrierSteer (Top-2) は最も違反した2つの制約を用いた高速な閉形式解、BarrierSteer (LSE) は組み合わせたバリアを用いた閉形式解。
ステアリング強度 alpha による安全性と有用性のトレードオフを示し、安全性の保証を維持しつつモデルの有用性を保つ。
riskカテゴリ全体で14の安全バリアをモジュール的に組み合わせ、Top-2、QP、LSE の3つの集約手法を比較し、安全でない生成率を評価する。

Figure 1: BarrierSteer for Safe LLMs. This method efficiently steers the hidden states of LLMs within nonlinear safe sets learned from demonstrations, thereby ensuring the generation of safe language responses during the inference-time.

実験結果

リサーチクエスチョン

RQ1推論時にLLM潜在空間に埋め込まれた学習済み非線形安全制約は証明可能な安全性を提供できるか。
RQ2バリアベースのステアリングは既存の表現ステアリング手法と比べて、安全でない生成を減少させつつ有用性を保つ点でどうか。
RQ3ステアリング強度がモデルサイズ間で安全性とタスク性能に与える影響は。
RQ4複数のリスクカテゴリを組み合わせたモジュラー的多バリア構成はどれほど効果的か。
RQ5閉形式のバリア組み合わせ（LSE）は反復的なQPと同等の性能をより低遅延で達成できるか。

主な発見

BarrierSteerはモデルファミリ全体で敵対的攻撃の成功率を大幅に低減し、Gemma-2-9bでASRをほぼゼロに近づける事例もある（例: 0.00%）。
BarrierSteerは元のモデルと比較してMMLUやGSM8Kでの効果が小幅に低下する程度でモデルの有用性を維持する。
BarrierSteer (LSE) は SaP に対して約31倍の高速化を実現（レイテンシ ~6.08 ms/トークン対 ~190.67 ms/トークン）。
14個の独立に学習されたバリアをLSEまたはQPで組み合わせると、Top-2よりも低い安全でない生成率を達成（1.82%）。
ステアリング強度 alpha を上げると一貫してASRが低下し、alpha = 1.0 で絶対的な安全性を得つつ、MMLUで基準モデルの約1.5%程度のタスク性能を維持する。
BarrierSteerは Activation Addition や Directional Ablation といったベースラインを、安全性とロバストネスの両方で上回る。

Figure 2: Overview of BarrierSteer for safe LLM generation. There is a three-stage pipeline of BarrierSteer : (i) extracting intermediate latent representations from a pre-trained LLM and constructing an LLM-specific safety dataset with binary safety labels; (ii) learning expressive, non-linear safe

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。