QUICK REVIEW

[論文レビュー] Position: Capability Control Should be a Separate Goal From Alignment

Shoaib Ahmed Siddiqui, Eleni Triantafillou|arXiv (Cornell University)|Feb 5, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

このポジションペーパーは、能力制御をアラインメントとは別の目標として扱うべきだと主張し、データ・学習・システムの各層に跨る深層防御フレームワークを提示して、モデルの能力を制約する。

ABSTRACT

Foundation models are trained on broad data distributions, yielding generalist capabilities that enable many downstream applications but also expand the space of potential misuse and failures. This position paper argues that capability control -- imposing restrictions on permissible model behavior -- should be treated as a distinct goal from alignment. While alignment is often context and preference-driven, capability control aims to impose hard operational limits on permissible behaviors, including under adversarial elicitation. We organize capability control mechanisms across the model lifecycle into three layers: (i) data-based control of the training distribution, (ii) learning-based control via weight- or representation-level interventions, and (iii) system-based control via post-deployment guardrails over inputs, outputs, and actions. Because each layer has characteristic failure modes when used in isolation, we advocate for a defense-in-depth approach that composes complementary controls across the full stack. We further outline key open challenges in achieving such control, including the dual-use nature of knowledge and compositional generalization.

研究の動機と目的

能力制御を、アラインメントと区別されるモデル挙動の厳格な運用リミットとして定義する。
能力制御のための三層フレームワーク（データベースベース、学習ベース、システムベース）を提案する。
孤立して使用した場合の故障を緩和するため、層を組み合わせた防御-in-depth アプローチを提唱する。
堅牢な制御を実現する際の実用的な制約、デュアルユースの課題、未解決の研究課題を論じる。

提案手法

能力制御機構を三つのライフサイクル層（データベースベース、学習ベース、システムベースの制御）に分類する。
各層での具体的介入を説明（データのフィルタリング/キュレーション/合成データ；RLHF、アンラーニング、敵対的訓練；ガードレール、プロンプト、モニタリング）。
能力の抑制と除去を区別し、それぞれの実現可能性とリスクを議論する。
スタック全体で補完的な制御を組み合わせる防御-in-depth アーキテクチャを提唱する。

実験結果

リサーチクエスチョン

RQ1能力制御とアラインメントの違いは何で、なぜ別個の目標として扱うべきか。
RQ2データ・学習・システム層で能力制御をどのように実現できるか、各層の限界は何か。
RQ3防御-in-depth アプローチは、能力の悪用を抑制しつつ有用性を維持できるか。
RQ4堅牢な能力制御を達成する際のオープンな課題（デュアルユース知識、組成的一般化など）は何か。

主な発見

能力制御は、文脈に依存するアラインメントとは異なり、文脈に関係なく有害な能力に対して厳格なリミットを提供する。
データ・学習・システムの三層に跨る防御-in-depth アプローチが必要で、単一層だけでは十分でない。
データベースベースの制御はリコールと再出現の課題に直面し、再訓練を要する場合がある；学習ベースとシステムベースの制御にもそれぞれ限界がある。
システムベースの制御は強い保証を提供するが、特にオープンウェイトモデルでは回避されうる可能性があり、待機遅延を生じ得る。
制御の有効性を評価することは、敵対的なダイナミクスと能力の抑制と除去の区別のため難しい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。