QUICK REVIEW

[論文レビュー] GAVEL: Towards rule-based safety through activation monitoring

Shir Rozenfeld, Rahul Pankajakshan|arXiv (Cornell University)|Jan 27, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

GAVELは認知要素（Cognitive Elements）を用いたモデル活性化に対するルールベースの安全枠組みを導入し、再学習なく構成可能で解釈可能・監査可能なAI安全を実現する。

ABSTRACT

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.

研究の動機と目的

interpretable activation primitivesとしてCognitive Elements (CEs)を導入し、モデルの挙動を記述する。
CE活性化に対する述語で安全性を強制するルールベースのフレームワーク（GAVEL）を提案する。
activationデータの収集と安全ポリシー設計を切り離し、精度と柔軟性を向上させる。
CE語彙とルールの組織間共有と再利用を可能にし、スケーラブルなガバナンスを実現する。

提案手法

CEsをトークンレベルの、解釈可能な活性化プリミティブとして定義する（例：脅威の発生、決済ツールの使用を示す）。
各CEの刺激データセットを作成し、典型例を明示的なCE指示で包み込み、活性化を喚起する（ERI法）。
トークンレベルのCE活性化に基づく多ラベルCE検出器gを訓練し、リアルタイムでアクティブなCEを識別する。
安全制約をCE存在ベクトルの時系列窓内のブール述語として表現し、述語が発火したときにアクションを強制する。
コミュニティ貢献型のCE語彙とルール、さらには自動化されたCE/ルール生成ツールをサポートするオープンでモデル非依存のワークフローを提供する。

Figure 1: Workflow of GAVEL. (1) Setup rules defined over Cognitive Elements (CEs) and specify actions, optionally reusing public rule sets. (2) Collect CE activations $H_{c}$ from both private and public CE datasets $\mathcal{D}_{c}$ by running the target LLM and capturing activations. (3) Train a

実験結果

リサーチクエスチョン

RQ1活性化を分解して、正確で解釈可能な安全監視を可能にする認知要素を得ることができるのか。
RQ2ルールベースの活性化安全フレームワークは、従来の乱用データセット手法と比較して精度と柔軟性を向上させるのか。
RQ3CEベースのルールをモデル間で共有・構成して、スケーラブルなAIガバナンスを支援できるのか。
RQ4GAVELは多様な乱用領域と閾値でリアルタイム検出をどの程度実現するのか。

主な発見

CEは活性化レベルでのモデル挙動を記述するモジュール化・組み合わせ可能な基盤を提供する。
ERI刺激法は、単純な事前埋め込みや改訂のみの場合よりCE検出精度を向上させる。
トークンレベルの活性化に基づく多ラベルCE検出器は、時系列窓でのリアルタイム述語評価を可能にする。
CE活性化に対するルールベースの適用は高い精度と解釈性を実現し、共有語彙によるコミュニティ協力を促進する。
GAVELは評価設定において複数の乱用カテゴリで強いROC-AUC性能と低い偽陽性を示す。

Figure 2: Classification performance of different CEs using different excitation methods, including ours (ERI).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。