QUICK REVIEW

[論文レビュー] Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Xuechunzi Bai, Angelina Wang|arXiv (Cornell University)|Feb 6, 2024

Natural Language Processing Techniques被引用数 14

ひとこと要約

論文は、表向きには偏りがないモデルに潜む微妙な差別傾向を検出するための、LLMの暗黙の偏見を測る prompt-based 指標（LLM Implicit Bias）と LLM Decision Bias を導入し、8モデル、4つの社会カテゴリ、21のステレオタイプにわたって評価する。

ABSTRACT

Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. We address both challenges by introducing two new measures of bias: LLM Implicit Bias, a prompt-based method for revealing implicit bias; and LLM Decision Bias, a strategy to detect subtle discrimination in decision-making tasks. Both measures are based on psychological research: LLM Implicit Bias adapts the Implicit Association Test, widely used to study the automatic associations between concepts held in human minds; and LLM Decision Bias operationalizes psychological results indicating that relative evaluations between two candidates, not absolute evaluations assessing each independently, are more diagnostic of implicit biases. Using these measures, we found pervasive stereotype biases mirroring those in society in 8 value-aligned models across 4 social categories (race, gender, religion, health) in 21 stereotypes (such as race and criminality, race and weapons, gender and science, age and negativity). Our prompt-based LLM Implicit Bias measure correlates with existing language model embedding-based bias methods, but better predicts downstream behaviors measured by LLM Decision Bias. These new prompt-based measures draw from psychology's long history of research into measuring stereotype biases based on purely observable behavior; they expose nuanced biases in proprietary value-aligned LLMs that appear unbiased according to standard benchmarks.

研究の動機と目的

アラインメントと安全ガードレールにもかかわらず、LLM の暗黙的偏見を検出する必要性を動機づける。
独自モデルでも機能する、心理学に着想を得た2つの測定法（LLM Implicit Bias と LLM Decision Bias）を開発する。
複数の価値整合型 LLM と広範なステレオタイプにわたって偏見を評価し、潜む差別を可視化する。
これらの prompt-based 測定法が埋め込みベースの偏見および下流の意思決定結果とどのように関連するかを評価する。

提案手法

Implicit Association Test（IAT）フレームワークを、語彙とカテゴリの関連から偏倡スコアを算出する prompt-based LLM Implicit Bias タスクへ適合させる（bias = N(sa,Xa)/[N(sa,Xa)+N(sa,Xb)] + N(sb,Xb)/[N(sb,Xa)+N(sb,Xb)] − 1）。
乱択化されたプロンプトテンプレートと seed-Xa/Xb セットを使用して、カテゴリごとに複数のプロンプトを生成し、ブートストラップ信頼区間で平均バイアスを算出する。
文脈的に関連する相対的な意思決定を生成して差別的行動を検出するLLM Decision Biasタスクを作成する（例：プロフィールとタスク割り当て）、反復ごとの偏った意思決定の割合として測定する（0から1のスケール）。
プロンプト生成には、手動で作成された Xa/Xb セットと自動生成された Xa/Xb セットの双方を含み、各反復でテンプレートを非同一化して表現効果を低減する。
prompt-based bias を embedding-based bias（OpenAI text-embedding-3-small を使用）と比較し、回帰分析を通じて偏見が下流の意思決定をどのように予測するかを分析する。
8つのモデル（GPT-3.5-Turbo、GPT-4、Claude-3-Sonnet、Claude-3-Opus、Alpaca-7B、LLaMA2Chat-7B/13B/70B）を、4つの社会カテゴリと21のステレオタイプ（人種、性別、宗教、健康）にわたって検討する。）

実験結果

リサーチクエスチョン

RQ1LLMs は複数のモデルと複数のステレオタイプに対して、prompt-based IATのようなタスクで暗黙の偏見を示すか。
RQ2LLM Decision Bias タスクは、暗黙の偏見と一致する意思決定の差別を明らかにするか。
RQ3prompt-based の暗黙的偏見は、埋め込みベースの偏見および下流の意思決定結果とどのように関連するか。
RQ4モデルサイズやカテゴリ特異性による暗黙の偏見と意思決定バイアスの変動はあるか。
RQ5相対的な意思決定プロンプトは、絶対的なプロンプトより偏見を診断するのに有効か。

主な発見

LLMs は4つの社会カテゴリと21のステレオタイプにわたって広範な暗黙の偏見を示す；0を基準としたt検定は t(33,599)=76.39, p<.001 を得た。
暗黙の偏見はモデルによって異なる。より大きなモデル（GPT-4、GPT-3.5-Turbo、Claude-3）は偏見が大きい一方、Alpaca-7BとLLaMA2Chat-7Bは偏見が小さい。
人種が最も強い暗黙の偏見を示す。性別と科学への連想にも顕著な偏見が見られ、二つの名義群（Asian/Hispanic）は一部の指標で偏見を示さなかった。
LLMs は21のステレオタイプのうち19において有意な意思決定バイアスを示す。いくつかのモデル（Claude-3系列）は高い意思決定バイアスを示す一方、より小さなモデルは偏見が低い。
意思決定バイアスはモデルサイズに必ずしも結びつかない。人種ベースの採用やキャリア関連の意思決定には明確な偏りが見られ、GPT-4 は例で人種と価値、性別とキャリアのバイアスを含む微妙な偏りを示す。
LLM Implicit Bias は embedding bias よりも LLM Decision Bias と強く相関する；プロンプトレベルのロジスティック回帰により、暗黙の偏見が1単位増えると差別的な意思決定のオッズが約2.68倍になる（p<.001）。
相対的な意思決定プロンプトは絶対的なプロンプトより偏見の診断に有効である。相対性を除去すると偏った意思決定は減る。
prompt-based implicit bias と embedding bias は関連しているが冗長ではない（プロンプトレベルで r ≈ .36、集計時には r ≈ .72）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。