QUICK REVIEW

[论文解读] Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models

Shidong Yang, Tongwen Huang|arXiv (Cornell University)|Feb 2, 2026

Topic Modeling被引用 0

一句话总结

论文提出 EGT，一种基于熵的数据高效训练框架，用于多模态推理奖励模型，利用响应熵来筛选数据并从易到难地安排训练，提升了最新水平的性能同时减少数据需求。

ABSTRACT

Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.

研究动机与目标

将熵识别为多模态奖励模型训练中样本难度和标注噪声的代理指标。
提出 Entropy-Guided Training (EGT)，将基于熵的数据筛选与渐进式训练课程相结合。
在三个多模态奖励基准上展示 EGT 的最新性能。
通过用较小的精选子集实现强结果来展示数据效率的优势。

提出的方法

生成高质量的推理轨迹，形成用于指令微调的 refined SFT 数据集。
使用答案标记熵和推理句子熵作为数据质量代理来计算响应熵。
通过剪除高熵样本来整理数据，创建用于基于 RL 的训练的精选数据集。
在强化学习中采用低到高熵课程进行训练，逐步解决更难的样本。
利用基于熵的排序和一个将准确性、逻辑与格式项结合的复合奖励函数。
在三个多模态奖励基准上进行评估并通过消融验证各组成部分。

实验结果

研究问题

RQ1响应熵是否可以作为多模态奖励数据集样本难度和标注噪声的无监督代理？
RQ2基于熵的数据筛选加上课程化训练是否比均匀或基于准确性的连接方法带来更好的性能与数据效率？
RQ3基于熵的选择（特别是答案标记熵）与句子熵或混合度量在数据剪裁中的表现有何差异？
RQ4训练数据量和熵水平对模型性能与鲁棒性有何影响？

主要发现

Model	# Param	VL-Reward	Multimodal	MM-RLHF	Avg.	Avg. Gain
GPT-4o (2024-08-06)	–	65.80	70.80	58.23	64.94	–
Claude-3.7-Sonnet (2025-02-24)	–	66.31	71.90	82.35	73.52	↑ 8.58
SliME [24]	7B	19.04	42.00	17.10	26.05	↓ 38.89
VITA-1.5 [5]	7B	16.48	53.60	20.58	30.22	↓ 34.72
Qwen2-VL-72B [1]	72B	39.50	70.90	48.23	52.88	↓ 12.06
MM-RLHF-Reward [26]	7B	50.15	67.10	82.00	66.42	↑ 1.48
IXC-2.5-Reward [23]	7B	65.80	66.60	71.18	67.86	↑ 2.92
R1-Reward [25]	7B	72.89	82.20	80.59	78.56	↑ 13.62
EGT (Ours)	7B	77.15	84.30	85.88	82.44	↑ 17.50

EGT 在三个多模态奖励基准上实现了最新的性能。
仅使用一个包含2500个样本的低熵子集进行训练，即可达到与全量数据训练相抗衡的结果。
熵作为数据难度与噪声的可靠代理，能够实现有效的剪裁。
基于熵的选择在消融中优于随机和基于准确性的选择策略。
最低熵数据带来最佳性能，而高熵数据可能降低学习效果。
从低到高熵的课程化训练稳定化了优化并提升了数据效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。