[论文解读] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
HLE-Verified 提供两阶段验证与修订协议,以修复并认证 Humanity’s Last Exam,减少标注噪声并揭示模型表现的更真实面。
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,143 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate eight state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://huggingface.co/datasets/skylenage/HLE-Verified
研究动机与目标
- 为 HLE 提供严格、可审计的验证协议,以降低标注噪声。
- 在保持原始评估意图的前提下,对有缺陷的项目进行分类与修复。
- 发布 gold、修订版和不确定项子集,并提供结构化元数据以增强透明度。
提出的方法
- 将项目分解为问题、答案与推理三部分,并对每一部分进行有效性检查。
- 阶段 I:二元专家验证加模型辅助复现(pass@8),产生经验证的 gold 子集(668 条)。
- 阶段 II:对可修复项进行独立专家修订,辅以模型辅助的支持性建议,最终裁定以形成修订项(1,143 条)。
- 不确定项(689 条)保留,并给出明确的不确定性描述,供未来社区改进。
- 发布包含详细元数据、缺陷分类和修订痕迹以实现可审计性。

实验结果
研究问题
- RQ1在高难度基准发布后进行的后验验证会如何影响模型测量性能?
- RQ2在像 HLE 这样的多领域基准中,常见的失败模式有哪些,如何在不改变任务意图的情况下进行修正?
- RQ3分解组件的验证是否能在跨领域上获得更忠实的模型评估?
- RQ4修订后基准如何影响标定与置信度指示的评估指标?
主要发现
| Model | Δ Acc(修订子集) |
|---|---|
| Gemini-3-pro | +29.94 |
| GPT-5.2 | +38.04 |
| Claude-Opus4.5 | +32.94 |
| Grok-4.1 fast-reasoning | +34.82 |
| Claude-Opus4.6 | +30.13 |
| DeepSeek-V3.2 | +39.58 |
- 与 HLE 相比,8 个前沿大模型在 HLE-Verified 上的平均准确率提升为 7–10 个百分点。
- 在原本有缺陷但可修复的项目上,模型准确率提升为 30–40 个百分点,显示原始 HLE 存在显著的基准噪声。
- 在修订子集上的校准误差下降,表明置信度评估更为可信。
- 模型置信度与问题陈述或参考答案存在错误之间存在强相关性,支持修订的有效性。
- 数据集以 gold(668 条)、修订(1,143 条)和不确定(689 条)子集及结构化元数据发布。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。