QUICK REVIEW

[論文レビュー] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen|arXiv (Cornell University)|Jun 20, 2023

Adversarial Robustness in Machine Learning被引用数 59

ひとこと要約

この論文は、GPT-3.5とGPT-4を、毒性、偏見、頑健性、プライバシー、倫理、そして公正さの八つの信頼性観点で総合的に評価し、標準ベンチマークおよび敵対的条件下での長所と脆弱性の両方を明らかにします。

ABSTRACT

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/ ; our dataset can be previewed at https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .

研究の動機と目的

GPT-3.5とGPT-4を、八つの信頼性観点（毒性、ステレオタイプ偏見、敵対的頑健性、アウト・オブ・ディストリビューション頑健性、敵対的デモンストレーションに対する頑健性、プライバシー、機械倫理、そして公正さ）にわたって評価する。
標準ベンチマークおよび敵対的/システムプロンプトの下で、GPT-4とGPT-3.5の性能を比較する。
未公開の脆弱性を特定し、将来の改善とGPTモデルのより安全な展開に資する。
より広範な信頼性ベンチマークを支えるオープンソースの評価プラットフォームとデータセットを提供する。）
method may be incomplete in translation
method move to proper array below

提案手法

標準ベンチマーク（例：RealToxicityPrompts、AdvGLUE）と新設計のプロンプトとデータセットを用いて、GPT-3.5とGPT-4の多視点評価を実施する。
毒性、偏見、頑健性などの観点を検査するため、多様なシステムプロンプトおよび敵対的/ユーザープロンプトを設計する。
スタイル変換、最近の出来事に関する質問、およびOODデモンストレーションを用いたインコンテキスト学習によって、アウト・オブ・ディストリビューション（OOD）頑健性を評価する。
訓練データ（Enron）からのリークと会話中のプライバシー漏洩を追跡し、個人を特定できる情報（PII）の開示を検証してプライバシーを評価する。
標準ベンチマークと脱獄/回避プロンプト、条件付き行動シナリオを含む機械倫理を検討する。
デモグラフィックの変動とベースレートのパリティを変えたゼロショットおよび少数ショット設定で公正さを評価する。）

Figure 1: Examples of undesirable responses of GPT-4 given benign system prompts from different trustworthiness perspectives. Offensive or sensitive information is masked.

実験結果

リサーチクエスチョン

RQ1標準ベンチマークの下で、毒性、偏見、敵対的頑健性、OOD頑健性、プライバシー、機械倫理、そして公正さの点で、GPT-4とGPT-3.5はどのように比較されるか？
RQ2これらの観点全体で、システムプロンプトとユーザープロンプトは信頼性の結果にどのような影響を与えるか？
RQ3敵対的デモンストレーションとバックドアプロンプトの下でどのような脆弱性が現れ、モデルのバージョン間で感受性はどう異なるか？
RQ4会話中にGPTモデルが訓練データやプライベート情報をどの程度漏らすか？
RQ5将来のLLM展開の信頼性と安全性を向上させる具体的な方針を特定できるか？

主な発見

GPT-4は標準ベンチマークで一般的にGPT-3.5より信頼性が高いが、システムやユーザープロンプトへの脱獄にはより脆弱である。
GPTモデルは注意深く設計された敵対的プロンプトの下で、毒性または偏見のある出力を生成することがあり、GPT-4は特定のプロンプトでより感受性が高い。
GPT-4は標的型の敵対的プロンプトに対してより脆弱で、指示をより正確に従う傾向があり、特定のプロンプトで毒性を高める。
訓練データ（例：Enron）や会話履歴のプライバシー漏洩は検出可能で、GPT-4はPIIの保護をより堅牢に示すが、プライバシー漏洩デモンストレーションの下では例外もある。
敵対的デモンストレーションは反事実、偽の相関関係、バックドアを通じてモデルを誤導でき、GPT-4はバックドア付きデモンストレーションに一般により脆弱である。
公正さと偏見にはトレードオフがあり、データをバランス良く取れば精度は高くなるが、不均衡な文脈では不公正さが大きくなる。バランスの取れた少数ショットプロンプトは公正さを改善できる場合がある。）

Figure 2: Examples of undesirable responses of GPT-4 given adversarial system prompts from different trustworthiness perspectives. (The word cf is a backdoor trigger added in the context.)

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。