QUICK REVIEW

[论文解读] GAVEL: Towards rule-based safety through activation monitoring

Shir Rozenfeld, Rahul Pankajakshan|arXiv (Cornell University)|Jan 27, 2026

Adversarial Robustness in Machine Learning被引用 0

一句话总结

GAVEL 在模型激活上引入基于规则的安全框架，利用认知要素（Cognitive Elements，CEs）实现可配置、可解释且可审计的AI安全性，而无需重新训练。

ABSTRACT

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.

研究动机与目标

将认知要素（CEs）作为可解释的激活原语来描述模型行为。
提出一个基于规则的框架（GAVEL），通过对CE激活的谓词来强制安全性。
将激活数据收集与安全策略设计解耦，以提升精度和灵活性。
实现跨组织共享和重复使用CE词汇表及规则，以实现可扩展的治理。

提出的方法

将认知要素（CEs）定义为逐标记、可解释的激活原语（例如：发出威胁、支付工具等）。
为每个CE创建激励数据集，并用明确的CE指令封装样例以诱发激活（ERI方法）。
训练一个对每标记CE激活进行多标签检测的检测器g，以实时识别活跃的CE。
将安全约束表示为CE存在向量在时间窗口内的布尔谓词，并在谓词触发时执行行动。
提供一个开放、与模型无关的工作流，支持社区贡献的CE词汇和规则，以及一个自动化的CE/规则生成工具。

Figure 1: Workflow of GAVEL. (1) Setup rules defined over Cognitive Elements (CEs) and specify actions, optionally reusing public rule sets. (2) Collect CE activations $H_{c}$ from both private and public CE datasets $\mathcal{D}_{c}$ by running the target LLM and capturing activations. (3) Train a

实验结果

研究问题

RQ1激活是否可以分解为认知要素，从而实现精准、可解释的安全监控？
RQ2相较于传统的滥用数据集方法，基于规则的激活安全框架是否提高了精度和灵活性？
RQ3基于CE的规则是否可以在模型之间共享与组合，以支持可扩展的AI治理？
RQ4在不同滥用领域和阈值下，GAVEL 在实时检测中的表现如何？

主要发现

CEs提供了描述模型行为的激活层面上的模块化、可组合基础。
ERI激励方法在CE检测准确性方面优于简单预填充或仅修订的方法。
在逐标记激活上运行的多标签CE检测器实现了对时间窗口内谓词的实时评估。
基于规则的对CE激活的强制执行实现了高精度和可解释性，且共享词汇表促进了社区协作。
在其评估设置中，GAVEL在多个滥用类别上显示出较强的ROC-AUC性能和较低的假阳性率。

Figure 2: Classification performance of different CEs using different excitation methods, including ours (ERI).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。