QUICK REVIEW

[Paper Review] Specific versus General Principles for Constitutional AI

Sandipan Kundu, Yuntao Bai|arXiv (Cornell University)|Oct 20, 2023

Ethics and Social Impacts of AI7 citations

TL;DR

The paper compares trait-focused and general good-for-humanity constitutions in Constitutional AI, showing general principles can generalize to broad harms, while trait-specific approaches yield stronger trait-targeted control.

ABSTRACT

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

Motivation & Objective

Investigate how AI feedback from constitutions shapes behavior toward problematic traits.
Assess whether a single simple principle can generalize ethical behavior without extensive trait-specific rules.
Compare trait-focused preference models with a good-for-humanity preference model across safety and usefulness.
Explore scaling behaviors and generalization of preference models trained with constitutional AI methods.

Proposed method

Train Trait Preference Models (Trait PMs) using a constitutional process targeting five specific traits.
Train Good-for-Humanity (GfH) Preference Models with only high-level principles about humanity’s best interests.
Evaluate PMs on trait-related datasets and on harmlessness, helpfulness, and honesty tasks.
Use RL with AI feedback (RLAIF) guided by the PMs to produce policy models.
Compare PMs and policy models to standard RLHF-based baselines across multiple metrics.

Experimental results

Research questions

RQ1Can a single simple principle like doing what’s best for humanity train a PM that generalizes to multiple harmful traits?
RQ2How do trait-focused PMs compare to GfH PMs in detecting and discouraging problematic expressions?
RQ3What are the tradeoffs between general good-for-humanity guidance and trait-specific constitutions for safety and usefulness?
RQ4How does model size and response-generator model size affect PM performance and generalization?
RQ5To what extent do GfH-inspired approaches reduce tendencies toward power-seeking or self-preservation?

Key findings

A general good-for-humanity principle can yield harmless assistants and reduce problematic trait expressions without extensive trait-specific data.
Trait PMs outperform baseline PMs on targeted trait datasets, but general-purpose GfH PMs achieve comparable safety without extra supervision.
Larger PMs improve fine-grained trait detection but may not linearly improve safety scores across all tasks, with evidence of scaling transitions.
GfH-trained policies via RL with AI feedback can be nearly as harmless as CAI-constrained policies while reducing trait tendencies.
GfH PMs show improved performance on harmlessness and combined safety datasets compared to some baselines, though HH-RLHF remains strong on some measures.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.