QUICK REVIEW

[論文レビュー] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Mrinank Sharma, Meg Tong|ArXiv.org|Jan 31, 2025

Criminal Law and Evidence被引用数 5

ひとこと要約

本論文は Constitutional Classifiers—憲法-guided synthetic data で訓練された保護機構—to defend LLMs against universal jailbreaks, after extensive red-teaming, modest deployment impact で強い頑健性を達成。脅威モデルを横断する実用性と柔軟性を示す。

ABSTRACT

Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.

研究の動機と目的

Motivate and define the problem of universal jailbreaks in LLMs and the need for practically deployable safeguards in high-risk domains like CBRN.
Propose Constitutional Classifiers that leverage natural-language constitutions to generate synthetic training data for input and output guards.
Demonstrate robustness of the classifiers through thousands of hours of red teaming and automated evaluations.
Show deployment viability and flexibility by applying enhanced constitutions to different threat models and measuring latency and refusals.

提案手法

Define a constitution as natural-language rules that categorize content into harmless and harmful classes to guide data generation.
Generate synthetic training data by composing constitution-guided inputs and outputs using a helpful-only model, followed by filtering with a separate filter model.
Augment data extensively with translations, paraphrasing, and jailbreaking technique variations to broaden coverage.
Train a dual-classifier system with an input classifier (next-token prediction framing) and an output classifier (token-level harmfulness prediction for streaming) with a combined loss that includes next-token-prediction and binary-cross-entropy terms.
Use automated red-teaming with a large monetary bounty to identify jailbreak attempts and develop rubric-based evaluation pipelines.
Evaluate robustness via human red-teaming and automated held-out jailbreak tests, reporting reductions in attack success rates and deployment costs.

Figure 1 : Constitutional Classifiers. (a) To defend LLMs against universal jailbreaks, we use classifier safeguards that monitor inputs and outputs. (b) To train these safeguards, we use a constitution defining categories of harmful and harmless content, enabling rapid adaptation to new threat mode

実験結果

リサーチクエスチョン

RQ1Can constitutional guidelines enable rapid adaptation to evolving threat models while maintaining practical deployment viability?
RQ2How effective are input and output classifiers, individually and combined, at blocking universal jailbreaks across CBRN-related queries?
RQ3What are the trade-offs in false positives and latency when deploying constitutional classifiers in production?
RQ4Do constitutional classifiers generalize to novel, held-out jailbreaks beyond the training constitution?
RQ5How does the robustness of constitutional classifiers compare to harmlessness-only or helpful-only baselines under red-teaming?

主な発見

Over 3,000 hours of red teaming, no universal jailbreak achieved that matches the level of detail of an unrestricted model for most target queries.
Automated evaluations show substantial improvement in jailbreak robustness when using constitutional classifiers compared to baselines.
Input-only and output-only classifiers reduce jailbreak attack success rates to 2% and 0.5%, respectively, relative to a helpful-only baseline.
Deployment viability is demonstrated with production traffic refusals increasing by 0.38 percentage points, and an inference overhead of 23.7%.
Expanded constitutions and data augmentation reduce false positives and adapt to new threat models while preserving streaming capabilities.

Figure 2 : Example output-classifier predictions. Unlike the input classifier, our output classifier makes a prediction at each token for the harmfulness of a full output sequence. This prediction is used to assess whether the output stream should be stopped at a given token position. In this figure

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。