QUICK REVIEW

[論文レビュー] Toxicity in ChatGPT: Analyzing Persona-assigned Language Models

Ameet Deshpande, Vishvak Murahari|arXiv (Cornell University)|Apr 11, 2023

Artificial Intelligence in Healthcare and Education被引用数 23

ひとこと要約

本論文は、ChatGPT にペルソナを割り当てることが出力の毒性を著しく高める可能性があることを、大規模な分析で示しており、ペルソナとエンティティカテゴリによって変動し、差別的なパターンを含む。

ABSTRACT

Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. This may be potentially defamatory to the persona and harmful to an unsuspecting user. Furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems.

研究の動機と目的

ChatGPT のシステムパラメータを介したペルソナ割り当てが、多様なトピックとエンティティに対する毒性にどう影響するかを評価する。
エンティティ条件付きプロンプトと RealToxicityPrompts の継続を用いて、90 のペルソナと128 のエンティティに対する毒性の変化を定量化する。
ペルソナのタイプ、デモグラフィック、プロンプトのスタイルを含む、毒性の変動を引き起こす要因を特定する。

提案手法

システムパラメータを介して ChatGPT に 90 の異なるペルソナを割り当て、応答を誘導する。
128 のエンティティに関するエンティティ条件付き出力と RealToxicityPrompts の継続を生成して毒性を測定する。
生成には温度 1、トップ P 0.9、少量の頻度ペナルティを用いた nucleus sampling を使用する。
Perspective API で毒性を評価し、ペアごとに複数回生成した中の最大毒性を報告する。
毒性的なプロンプトに対してモデルが応答する頻度を捉えるための Probability of responding (POR) 指標を定義する。
ペルソナに対する見解（良い/悪い）と観測された毒性との相関を分析する。

実験結果

リサーチクエスチョン

RQ1デフォルト設定と比較して、ChatGPT にペルソナを割り当てると毒性は増加するか？
RQ2ダ dictators? 日本語としては「独裁者、ジャーナリスト、スポーツ選手などの異なるペルソナカテゴリおよびエンティティタイプで毒性はどう変化するか？
RQ3そのペルソナの知覚上の性格と、当該ペルソナを模倣した場合のモデルの毒性との関係は何か？
RQ4エンティティ条件付きおよび継続タスクにおいて、プロンプトスタイルは毒性にどのように影響するか？
RQ5割り当てられたペルソナの性別、人種、政治的傾向といったデモグラフィックにおける毒性のバイアスはあるか？

主な発見

ペルソナが割り当てられると、デフォルトの ChatGPT と比較して毒性が約6倍まで増加する。
毒性は異なるペルソナ個体識別で最大5倍程度変動し、独裁者とジャーナリストで高い毒性を示す。
エンティティとデモグラフィックグループ（例：性別、人種、国など）に対して毒性が異なり、特定のグループが他よりも標的にされる。
PROMPT STYLE は毒性に影響を与え、'Say something bad about' のような明示的なプロンプトは中立的なプロンプトより高い毒性をもたらす。
ペルソナに対するモデルの見解（bad/good）が観測された毒性と相関する（Pearson r ≈ 0.7, p < .05）。
歴史的なペルソナを含む、国やグループ全体で有害なステレオタイプや差別的内容が見られる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。