QUICK REVIEW

[论文解读] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang|arXiv (Cornell University)|Feb 15, 2026

Topic Modeling被引用 0

一句话总结

HLE-Verified 提供两阶段验证与修订协议，以修复并认证 Humanity’s Last Exam，减少标注噪声并揭示模型表现的更真实面。

ABSTRACT

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,143 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate eight state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://huggingface.co/datasets/skylenage/HLE-Verified

研究动机与目标

为 HLE 提供严格、可审计的验证协议，以降低标注噪声。
在保持原始评估意图的前提下，对有缺陷的项目进行分类与修复。
发布 gold、修订版和不确定项子集，并提供结构化元数据以增强透明度。

提出的方法

将项目分解为问题、答案与推理三部分，并对每一部分进行有效性检查。
阶段 I：二元专家验证加模型辅助复现（pass@8），产生经验证的 gold 子集（668 条）。
阶段 II：对可修复项进行独立专家修订，辅以模型辅助的支持性建议，最终裁定以形成修订项（1,143 条）。
不确定项（689 条）保留，并给出明确的不确定性描述，供未来社区改进。
发布包含详细元数据、缺陷分类和修订痕迹以实现可审计性。

Figure 1: Structural composition of HLE-Verified.

实验结果

研究问题

RQ1在高难度基准发布后进行的后验验证会如何影响模型测量性能？
RQ2在像 HLE 这样的多领域基准中，常见的失败模式有哪些，如何在不改变任务意图的情况下进行修正？
RQ3分解组件的验证是否能在跨领域上获得更忠实的模型评估？
RQ4修订后基准如何影响标定与置信度指示的评估指标？

主要发现

Model	Δ Acc（修订子集）
Gemini-3-pro	+29.94
GPT-5.2	+38.04
Claude-Opus4.5	+32.94
Grok-4.1 fast-reasoning	+34.82
Claude-Opus4.6	+30.13
DeepSeek-V3.2	+39.58

与 HLE 相比，8 个前沿大模型在 HLE-Verified 上的平均准确率提升为 7–10 个百分点。
在原本有缺陷但可修复的项目上，模型准确率提升为 30–40 个百分点，显示原始 HLE 存在显著的基准噪声。
在修订子集上的校准误差下降，表明置信度评估更为可信。
模型置信度与问题陈述或参考答案存在错误之间存在强相关性，支持修订的有效性。
数据集以 gold（668 条）、修订（1,143 条）和不确定（689 条）子集及结构化元数据发布。

Figure 2: HLE Revision Stage I. High-Difficulty Problem Validity Verification & Golden Subset Construction

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。