QUICK REVIEW

[論文レビュー] OR-Bench: An Over-Refusal Benchmark for Large Language Models

Justin Cui, Wei-Lin Chiang|arXiv (Cornell University)|May 31, 2024

Natural Language Processing Techniques被引用数 7

ひとこと要約

OR-Benchは、25のLLMを8つのファミリーにまたがって過剰拒否を測定するために、80,000の一見有害なプロンプトを自動生成します。さらに堅牢な安全性テストのために、難易度の高いHard-1Kサブセットと600の有害プロンプトを追加します。

ABSTRACT

Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at https://huggingface.co/bench-llms and our codebase is open-sourced at https://github.com/justincui03/or-bench. We hope this benchmark can help the community develop better safety aligned models.

研究の動機と目的

安全性整合性によって推進されるLLMの過剰拒否現象を動機づけ、定量化する。
有害そうに見える無害なプロンプトを生成する、スケーラブルなパイプラインを作成する。
包括的な評価のための大規模ベンチマーク（OR-Bench-80K）とより難しいサブセット（Hard-1K）を提供する。
コミュニティが安全性を重視しつつ有用性の高いLLM挙動を改善するのに役立つデータセットとツールを提供する。

提案手法

有毒なシードから出発して、それらを無害に見えるプロンプトへ書き直すことで、見かけ上有害なプロンプトを自動生成する。
三段階のパイプラインを用いる：有毒シードの生成、数ショットプロンプトを用いたプロンプトの書き換え、モデレーターベースのフィルタリング。
安全性を置換したモデル（Mistral-7B-Instruct）による拒否権付きの安全ラベリングのため、モデルアンサンブル・モデレーター（GPT-4-turbo、Llama-3-70B、Gemini-1.5-pro）を採用。
10カテゴリにわたるOR-Bench-80Kと、ともにOR-Bench-Hard-1KおよびOR-Bench-Toxicサブセットを構築する。
難易度の高いケースにはキーワードベースの拒否検出とLLMベースの評価を用いて、8ファミリーの25モデルを評価する。

Figure 1: Seemingly toxic prompts rejection rate vs toxic prompts rejection rate on OR-Bench-Hard-1K and OR-Bench-Toxic. Results are measured with temperature 0.0. The best performing models should be on the upper left corner where the model rejects the least number of safe prompts and the most numb

実験結果

リサーチクエスチョン

RQ1安全対策が厳格に適用されているとき、さまざまなLLMファミリーで過剰拒否はどれくらい普遍的か。
RQ2実際には有害ではないが過剰拒否を信頼性高く検証できる、巨大な見せかけ有害プロンプトを自動生成できるか。
RQ3モデルファミリー間で、安全性（有害プロンプトの拒否）と過剰拒否（無害プロンプトの拒否）とのトレードオフは？
RQ4異なるモデル構造とリリース世代は、この安全性と有用性のバランスでどう比較されるか？

主な発見

安全性と過剰拒否には強い相関がある。より多くの有害プロンプトを拒否するモデルは、より多くの安全なプロンプトも拒否する傾向がある（Spearman 0.878）。
GPT-3.5-turboは初期に大きな過剰拒否を示すが、後のバージョンで低下し、有害プロンプト拒否のトレードオフが生じる。
Claudeシリーズは高い安全性を達成する一方で過剰拒否も高い。Mistralシリーズは全体として拒否するプロンプトが少ない傾向。
Llama-3とGeminiファミリーは安全性と過剰拒否のパターンがさまざま。新しいGeminiバージョンはより安全で過剰拒否が起きにくくなる傾向。
OR-Bench-Hard-1Kサブセットは、多くのモデルが安全性の向上にもかかわらず、特定のカテゴリ（例：違法またはプライバシー関連のプロンプト）で依然苦戦していることを示している。

Figure 2: Overall creation process for the contributed datasets. We use Mixtral 8*7B as the toxic prompt generator and prompt rewritter and model ensemble (GPT-4-turbo-2024-04-09, Llama-3-70b, Gemini-1.5-pro) as the prompt moderator. See section 3 on how the design choices are made.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。