QUICK REVIEW

[论文解读] Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels

Micah Rentschler, Jesse Roberts|arXiv (Cornell University)|Jan 29, 2026

Topic Modeling被引用 0

一句话总结

RLME 使用来自评估者的元评估信号来训练语言模型，而非真实标签，在准确性和样本效率方面与 RLVR 相当，同时实现多目标控制与领域泛化。

ABSTRACT

Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g., "Is the answer correct?" or "Is the reasoning logically consistent?"). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc rationalization, and generalizes to open-domain settings where ground-truth labels are unavailable, broadening the domains in which LLMs may be trained with RL.

研究动机与目标

通过使用元评估信号减少对真实标签或任务验证器的依赖。
实现对齐 LLM 的可扩展性，采用灵活、语言驱动的标准。
在推理密集型任务和开放领域中展示有竞争力的性能。
研究元评估的鲁棒性、失败模式以及多目标控制。

提出的方法

给定提示 x，使用策略 πθ 生成响应。
使用评估者 πϕj 的元问题 qk 对响应进行评估，以获得概率 pkj。
将奖励 r(x,y) 计算为跨评估者与元问题的对数概率的加权和。
通过 GRPO 风格目标与 CISPO 来处理离策略数据，从而更新生成器。
允许不同的评估者配置（冻结自评、冻结他评、自评、集合评估）和元问题来塑造奖励。
将 RLME 与 RLVR 基线进行对比，以在没有真实标签的情况下评估性能。

Figure 1 : Overview of RLME. After generating an answer, one or more evaluators (may be the same model) assign probabilities to natural-language meta-questions about the output. These probabilities are aggregated into a scalar reward, which is then used to update the generative policy via reinforcem

实验结果

研究问题

RQ1单个元问题是否能够提供足够强的奖励信号，在没有真实标签的情况下提升准确性？
RQ2在可验证任务上，RLME 的准确性与样本效率相比基于标签的 RLVR 如何？
RQ3评估者的选择以及多目标元问题如何影响对齐与生成器行为？
RQ4元评估的 RL 的失败模式（如奖励劫持）与泛化属性有哪些？

主要发现

RLME 在 GSM8K 上实现的准确性和样本效率与 RLVR 相当，在所报道的实验中准确率超过 90%。
尽管从未观测到真实答案，RLME 在学习曲线方面与 RLVR 紧密跟踪。
元评估提供了可扩展的奖励信号，能够引导模型朝向可靠的推理模式，而非事后辩解。
该框架通过元问题和权重配置支持在多目标之间进行可控的权衡。
RLME 能泛化到不可用真实标签的开放领域设置，拓宽了 LLM 的 RL 对齐。
研究提供了对生成器/评估者选择、自评以及潜在奖励劫持行为的分析，阐明了优点与失效模式。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。