QUICK REVIEW

[論文レビュー] Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Yankai Yang, Yancheng Long|arXiv (Cornell University)|Feb 7, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

Joint Reward Modeling (JRM) は言語監督を共有ビジョン-言語バックボーンに統合し、識別的報酬モデルに意味理解と潜在的推論を吹き込み、最先端の結果を達成しつつ推論を高速に維持します。

ABSTRACT

Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.

研究の動機と目的

複雑な画像編集の意味論を局所的な類似性を超えて扱う報酬モデルを動機づける。
効率的な識別報酬と意味的に豊かな生成的推論のギャップを埋める。
訓練時の潜在表現内に推論を内部化することで高速・安定な報酬評価を実現する。
テスト時には言語生成パスを削除し、識別的推論の効率を維持する。

提案手法

共通のビジョン-言語バックボーンを用い、識別報酬ヘッドと条件付き言語ヘッドという二つのヘッドを持つ。
Preferencesのランク付けロスと言語モデリングのクロスエントロピーロスを組み合わせたジョイント最適化で訓練する（L_total = (1-α)L_rank + αL_LM）。
報酬の不確実さを報酬をガウス分布として扱い、不確実性を考慮したランク付けロス（P(x_i ≻ x_j|c)）を用いる。
共有表現をランク付けと言語生成の両方をサポートするように内部化し、推論時に識別ヘッドで Latent CoT を可能にする。
推論時には言語生成経路を削除し、迅速な評価のため識別報酬スコアのみを用いる。

実験結果

リサーチクエスチョン

RQ1言語監督を用いた joint training によって、推論時の明示的なテキスト生成なしで高度な意味理解と推論を識別報酬モデルが獲得できるか。
RQ2潜在表現に推論の連鎖を内部化することが、画像編集のような複雑な多 modality タスクの報酬評価を改善し、効率を維持するか。
RQ3ジョイント訓練は表現の多様性と安定性に、純粋に識別的な訓練と比べてどのような影響を与えるか。
RQ4JRM は画像編集のオンライン強化学習で報酬信号として有効か。
RQ5言語監督の重みが性能と訓練ダイナミクスに与える影響はどの程度か。

主な発見

Method	PF	Cons.	Overall
GPT-4.1	0.673	0.602	0.705
GPT-5	0.777	0.669	0.755
Gemini-2.5-Pro	0.703	0.560	0.722
EditScore-8B	0.608	0.594	0.690
EditScore-72B	0.638	0.586	0.703
PaCo-Reward-7B	0.777	0.709	0.751
Gemini-3.0-Flash	0.717	0.662	0.769
EditReward	0.832	-	0.792
JRM (Ours)	0.854	-	0.851

JRM は EditReward-Bench で総合精度 85.1%、プロンプト追従時 85.4% で最先端を達成し、従来法を上回る。
MMRB2 では総合 69.3% に到達し、従来の最高を 7.4% 上回った。
JRM は有効特徴空間のランクを 91.77 に増大させ、ベースラインの 46.86 から表現崩壊の抑制を示す。
オンライン RL（Flow-GRPO）では、JRM ガイドモデルが GEdit-Bench および ImageEdit-Bench でベースラインより顕著な gains を示す（+1.00、+0.50）。
Latent CoT：共同訓練は高次元でより等方的な表現を生み、推論時には明示的なテキストなしでより豊かな意味因子をサポートする。
自己修正実験では、JRM ガイドの言語フィードバックが意味的一貫性と下流の編集報酬を改善することを示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。