QUICK REVIEW

[論文レビュー] LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

D. O. Glukhov, Ilia Shumailov|arXiv (Cornell University)|Jul 20, 2023

Adversarial Robustness in Machine Learning被引用数 20

ひとこと要約

本論文は、LLM出力の意味的検閲が理論的には不決定であり実務的には受け入れがたいと主張し、検閲をセキュリティ問題として扱うべきだと提案し、Mosaic Promptsを強力な回避機構として紹介している。

ABSTRACT

Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. As a result, we propose that the problem of censorship needs to be reevaluated; it should be treated as a security problem which warrants the adaptation of security-based approaches to mitigate potential risks.

研究の動機と目的

LLMの文脈における検閲を、意味的・統語的制約の下で提供者による入力/出力の規制として定義する。
不決定性と可逆変換における不変性に関連する意味的検閲の理論的限界を示す。
攻撃者が許容される構成要素から不適切な出力を再構成できることをMosaic Promptsを用いて示す。
検閲をセキュリティ問題として再定義することを提唱し、セキュリティ中心の緩和戦略を検討する。

提案手法

制約の下で入力/出力を許容文字列へ写像する関数として検閲を形式化する。
意味的検閲をライスの定理に結びつけ、許容言語の非自明な集合に対して不決定性を確立する。
可逆文字列変換の下で意味的出力検閲の不可能性を証明する（不変性の議論）。
許容出力が不適切なものへ構成されうることを説明するためにMosaic Promptsを導入する。
検閲の最悪場合と実務的制約を論じ、統語的代替案やアクセス制御に着想を得た防御を含む。
LLMの安全性と検閲へセキュリティエンジニアリングのアプローチを適用する議論を提供する。

実験結果

リサーチクエスチョン

RQ1標準モデルと制約の下で、LLMの入力/出力の意味的検閲は決定可能か？
RQ2不適切な内容の可逆変換を考慮した場合、意味的検閲を信頼性高く強制できるか？
RQ3検閲を回避する攻撃（例：Mosaic Prompts）は何か、検閲戦略全体にどのように一般化するか？
RQ4検閲はMLの問題よりもセキュリティ問題として扱うべきか、どのようなセキュリティ手法が適用可能か？
RQ5有用性の大幅な制限を招くことなく検閲リスクを管理する実用的な緩和策や設計原則（例：アクセス制御）は何か？

主な発見

意味的検閲はライスの定理を通じて不決定問題に結びつき、根本的な制限を示唆する。
逆変換は不適切な内容を保持できるため、特定の仮定の下で意味的出力検閲を不可能にする。
Mosaic Promptsは許容可能な構成要素から不適切な出力を構築でき、実務上検閲を回避する。
統語的検閲は限られた緩和を提供するが、構成的攻撃やツール統合により回避され得る。
セキュリティ重視のアプローチ（例：アクセス制御、システム設計の考慮）が、純粋なMLベースの検閲より推奨される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。