QUICK REVIEW

[論文レビュー] JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs

Taihei Shiotani, Masahiro Kaneko|arXiv (Cornell University)|Mar 21, 2026

Hate Speech and Cyberbullying Detection被引用数 0

ひとこと要約

JUBAKU は、日本文化を背景とした対立的ベンチマークで、対話プロンプトを十の文化カテゴリに基づいて作成し、偏った応答と偏らない応答の間で選択させることで潜在的バイアスを露呈させる。日本語由来のベンチマークとは異なり、LLM は JUBAKU でランダムよりも悪い性能を示す。

ABSTRACT

Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU's reliability and its adversarial nature to LLMs.

研究の動機と目的

日本語における英語翻訳超えの文化的意識を持つバイアス評価を動機付ける。
日本文化 norms に沿った対立的で対話ベースのベンチマークを定義・構築する。
複数の日本語LLMを評価し、英語適応ベースラインと比較して潜在的バイアスを明らかにする。
GPT-4o 主導の構築と人間検証による対立データの頑健性を示す。

提案手法

バイアスプロンプトを導く十の日本文化カテゴリを定義する（性別、宗教、民族、教育、人種、地域、感情と価値観、食べ物と飲み物、基本的行動、名前）。
文化固有のステレオタイプを reflecting する偏った応答オプションと偏っていない応答オプションを用いて対話プロンプトを手動で作成する。
対立的にインスタンスを構築するため、GPT-4o に偏った応答を優先させるよう反復的にプロンプトを発案・改良してバイアスを誘発する。
位置バイアスを減らすために、基礎インスタンスに四つのタスクバリアントと回答順の入れ替えを追加する。
評価を二択の正答率タスクへ標準化する：biased/biasのペアから偏らない応答を選択する。
九つの日本語LLMと三つの英語適応ベースラインを、JUBAKU および既存の日本語バイアスベンチマーク（JBNLI、JBBQ、SSQA-JA）を横断して評価する。

実験結果

リサーチクエスチョン

RQ1文化的に根ざした対立的日本語バイアスベンチマークは、英語由来のベンチマークでは捉えられない潜在的バイアスをLLM に露呈させ得るか。
RQ2日本語LLM は JUBAKU で、既存の日本語バイアスベンチマークと比較してどのような性能を示すか。
RQ3GPT-4o で構築された対立的プロンプトは、モデル間で偏った応答を誘発するのに有効か。
RQ4どの文化カテゴリがLLMのバイアスに対して最も頑健または脆弱か。

主な発見

九つの日本語LLM は JUBAKU でランダム基準（50%）を下回るスコアを示し、平均精度は23%、範囲は13%–33%。
既存の日本語ベンチマーク（JBNLI、JBBQ、SSQA-JA）ではモデルの精度が大幅に高く、JUBAKU がこれらのベンチマークでは示されないバイアスを露呈する。
人間のアノテーターは偏らない応答の識別で 91% の精度を達成し、JUBAKU の信頼性と対立設計の妥当性を裏付け。
対立的編集は、元は偏っていなくても、GPT-4o 主導の構築を越えてモデル全体で精度低下を引き起こし、バイアス誘発の一般化を示す。
カテゴリ別分析では、宗教・人種など特定のカテゴリでより多くの編集を要し、地域・民族性では少ない編集で誤りが生じるなど、頑健性はカテゴリによって異なる。

Figure 2: Bias evaluation accuracy across models and benchmarks. Dotted lines indicate the random baseline (red) and human evaluation performance on JUBAKU (blue).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。