QUICK REVIEW

[論文レビュー] Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning

Qian Wang, Xuandong Zhao|arXiv (Cornell University)|Feb 2, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

この論文は Epistemic Independence Training (EIT) を導入します。これは報酬を予測不能にすることによりバイアスの手掛かりを弱め、敵対的なプロンプトに耐性をもちつつ真実と整合する性能を維持する robust な LLM ジャッジを作成する強化学習フレームワークです。

ABSTRACT

Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases -- often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals. Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues predictive. To address this gap, we propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. Code and data are available at https://anonymous.4open.science/r/bias-mitigation-with-rl-BC47.

研究の動機と目的

Bandwagon や authority などのバイアス手掛かいに直面する LLM ジャッジの epistemic independence を動機づけ formalize する。
報酬を予測不能にすることでバイアス手掛かいをショートカット推論の抑止へ用いるトレーニングフレームワークを開発する。
真の領域に基づく推論を促すバイアス注入と報酬整形戦略を設計する。
EIT が精度と頑健性を改善し、未知のバイアス種へ転移することを示す。
標的型 EIT トレーニングが、より大きな未訓練モデルよりバイアス耐性で上回ることを示す。

提案手法

Epistemic Independence Training (EIT) を強化学習フレームワークとして導入する。
50/50 の確率で正答と誤答を支持するバイアス信号を含む balanced conflict データ戦略を用いる。
構造・正確性・独立性のインセンティブを含む階層的報酬 R = R_struct + R_acc + R_ind を定義する。
R_ind にはバイアスに従うことへの敵対的ペナルティと、バイアスが真実と一致する場合の反対意見ペナルティを含み、バイアスの合意に対しては利得がない。
動的ベースラインを用いて期待報酬を最大化する Group Relative Policy Optimization (GRPO) を用いて訓練する。
訓練は Qwen3-1.7B および Qwen3-4B で MMLU-Pro を用い、トレーニング時の bias（bandwagon）を用い、 bandwagon、authority、distraction、position のバイアスでテストする。

Figure 1: The fragility of LLM judgment under bandwagon bias. Left: In a clean setting, OpenAI-o1 correctly identifies that the Great Wall of China is not visible from space with the naked eye. Right: When exposed to bandwagon bias (a fabricated consensus claiming visibility), the same model succumb

実験結果

リサーチクエスチョン

RQ1EIT は adversarial バイアス手掛かいの下で LLM ジャッジの精度と頑健性の両方を向上させることができるか？
RQ2bandwagon バイアスに対して学習した頑健性が unseen bias（authority や distraction）へ転移するか？
RQ3LLM ジャッジのバイアス緩和のための prompting ベースおよび supervised fine-tuning アプローチと EIT の比較はどうなるか？
RQ4モデルスケーリングだけでバイアス頑健性を達成できるのか、それとも標的型トレーニングが必要か？

主な発見

EIT は adversarial-bias 精度を向上させ (例：Qwen3-4B で Acc が wrong-bias 下で大幅に増加)、複数のバイアス種に対する頑健性を示す。
bandwagon バイアスでの訓練による頑健性は unseen bias（特に distraction）へ転移し、 substantial gains をもたらす。position bias は転移が限定的。
EIT は prompt ベースの緩和と SFT ベースのベースラインよりバイアス耐性で優れ、より大きな未訓練モデルは EIT の頑健性増加に及ばない。
EIT 訓練済みモデルは、 performative independence ではなく、実際の推論パターン（領域の関与、明示的検証、理由付きの異論）を示す。
訓練ダイナミクスはモデルサイズを超えて安定した収束を示し、学習が効率的で収束時に意味のある改善を生む。

Figure 2: Overview of EIT. Training phase (left): Questions are injected with bandwagon bias using the conflict strategy—correct-bias (green) points to the right answer, wrong-bias (red) points to the wrong answer. The policy $\pi_{\theta}$ generates multiple responses evaluated by our hierarchical

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。