Skip to main content
QUICK REVIEW

[论文解读] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang|arXiv (Cornell University)|Feb 15, 2026
Topic Modeling被引用 0
一句话总结

HLE-Verified 提供两阶段验证与修订协议,以修复并认证 Humanity’s Last Exam,减少标注噪声并揭示模型表现的更真实面。

ABSTRACT

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,143 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate eight state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://huggingface.co/datasets/skylenage/HLE-Verified

研究动机与目标

  • 为 HLE 提供严格、可审计的验证协议,以降低标注噪声。
  • 在保持原始评估意图的前提下,对有缺陷的项目进行分类与修复。
  • 发布 gold、修订版和不确定项子集,并提供结构化元数据以增强透明度。

提出的方法

  • 将项目分解为问题、答案与推理三部分,并对每一部分进行有效性检查。
  • 阶段 I:二元专家验证加模型辅助复现(pass@8),产生经验证的 gold 子集(668 条)。
  • 阶段 II:对可修复项进行独立专家修订,辅以模型辅助的支持性建议,最终裁定以形成修订项(1,143 条)。
  • 不确定项(689 条)保留,并给出明确的不确定性描述,供未来社区改进。
  • 发布包含详细元数据、缺陷分类和修订痕迹以实现可审计性。
Figure 1: Structural composition of HLE-Verified.
Figure 1: Structural composition of HLE-Verified.

实验结果

研究问题

  • RQ1在高难度基准发布后进行的后验验证会如何影响模型测量性能?
  • RQ2在像 HLE 这样的多领域基准中,常见的失败模式有哪些,如何在不改变任务意图的情况下进行修正?
  • RQ3分解组件的验证是否能在跨领域上获得更忠实的模型评估?
  • RQ4修订后基准如何影响标定与置信度指示的评估指标?

主要发现

ModelΔ Acc(修订子集)
Gemini-3-pro+29.94
GPT-5.2+38.04
Claude-Opus4.5+32.94
Grok-4.1 fast-reasoning+34.82
Claude-Opus4.6+30.13
DeepSeek-V3.2+39.58
  • 与 HLE 相比,8 个前沿大模型在 HLE-Verified 上的平均准确率提升为 7–10 个百分点。
  • 在原本有缺陷但可修复的项目上,模型准确率提升为 30–40 个百分点,显示原始 HLE 存在显著的基准噪声。
  • 在修订子集上的校准误差下降,表明置信度评估更为可信。
  • 模型置信度与问题陈述或参考答案存在错误之间存在强相关性,支持修订的有效性。
  • 数据集以 gold(668 条)、修订(1,143 条)和不确定(689 条)子集及结构化元数据发布。
Figure 2: HLE Revision Stage I. High-Difficulty Problem Validity Verification & Golden Subset Construction
Figure 2: HLE Revision Stage I. High-Difficulty Problem Validity Verification & Golden Subset Construction

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。