Skip to main content
QUICK REVIEW

[논문 리뷰] GAVEL: Towards rule-based safety through activation monitoring

Shir Rozenfeld, Rahul Pankajakshan|arXiv (Cornell University)|2026. 01. 27.
Adversarial Robustness in Machine Learning인용 수 0
한 줄 요약

GAVEL은 Cognitive Elements를 사용한 모델 활성화에 대한 규칙 기반 안전 프레임워크를 도입하여 재학습 없이도 구성 가능하고 해석 가능하며 감사 가능한 AI 안전성을 가능하게 한다.

ABSTRACT

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.

연구 동기 및 목표

  • Cognitive Elements (CEs)를 해석 가능한 활성화 프리미티브로 도입하여 모델 동작을 설명한다.
  • Propose a rule-based framework (GAVEL) that enforces safety via predicates over CE activations.
  • Decouple activation data collection from safety policy design to improve precision and flexibility.
  • Enable sharing and reuse of CE vocabularies and rules across organizations for scalable governance.

제안 방법

  • Define Cognitive Elements (CEs) as token-level, interpretable activation primitives (e.g., making a threat, payment tools).
  • Create excitation datasets for each CE and wrap exemplars with explicit CE directives to elicit activations (ERI method).
  • Train a multi-label CE detector g on per-token CE activations to identify active CEs in real time.
  • Represent safety constraints as Boolean predicates over CE presence vectors within a temporal window, and enforce actions when predicates fire.
  • Provide an open, model-agnostic workflow that supports community-contributed CE vocabularies and rules, plus an automated CE/rule generation tool.
Figure 1: Workflow of GAVEL. (1) Setup rules defined over Cognitive Elements (CEs) and specify actions, optionally reusing public rule sets. (2) Collect CE activations $H_{c}$ from both private and public CE datasets $\mathcal{D}_{c}$ by running the target LLM and capturing activations. (3) Train a
Figure 1: Workflow of GAVEL. (1) Setup rules defined over Cognitive Elements (CEs) and specify actions, optionally reusing public rule sets. (2) Collect CE activations $H_{c}$ from both private and public CE datasets $\mathcal{D}_{c}$ by running the target LLM and capturing activations. (3) Train a

실험 결과

연구 질문

  • RQ1Can activations be decomposed into cognitive elements that enable precise, interpretable safety monitoring?
  • RQ2Does a rule-based activation safety framework improve precision and flexibility compared to traditional misuse-dataset approaches?
  • RQ3Can CE-based rules be shared and composed across models to support scalable AI governance?
  • RQ4How does GAVEL perform in real-time detection across diverse misuse domains and thresholds?

주요 결과

  • CEs provide a modular, composable basis for describing model behavior at the activation level.
  • The ERI excitation method improves CE detection accuracy compared to naive prefilling or revision alone.
  • A multi-label CE detector operating on token-level activations enables real-time predicate evaluation over a temporal window.
  • Rule-based enforcement over CE activations achieves high precision and interpretability, with shared vocabularies enabling community collaboration.
  • GAVEL demonstrates strong ROC-AUC performance and low false positives across multiple misuse categories in their evaluation setup.
Figure 2: Classification performance of different CEs using different excitation methods, including ours (ERI).
Figure 2: Classification performance of different CEs using different excitation methods, including ours (ERI).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.