QUICK REVIEW

[논문 리뷰] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang|arXiv (Cornell University)|2026. 02. 15.

Topic Modeling인용 수 0

한 줄 요약

HLE-Verified는 두 단계의 검증 및 수정 프로토콜을 제공하여 Humanity’s Last Exam을 수리하고 인증하며 주석 노이즈를 줄이고 더 충실한 모델 성능을 드러냅니다.

ABSTRACT

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 668 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,143 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate eight state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://huggingface.co/datasets/skylenage/HLE-Verified

연구 동기 및 목표

주석 노이즈를 줄이기 위해 HLE에 대한 엄격하고 검토 가능한 검증 프로토콜을 제공합니다.
원래 평가 의도를 보존하면서 결함이 있는 항목을 분류하고 수정합니다.
투명성을 위한 구조화된 메타데이터와 함께 골드, 수정된 항목, 불확실한 항목 하위집합을 공개합니다.

제안 방법

항목을 문제(Problem), 답변(Answer), 추론(Rationale) 구성요소로 분해하고 구성요소별 타당성 검사를 적용합니다.
1단계: 이진 전문가 검증과 모델 보조 복제(pass@8)을 통해 검증된 항목의 골드 부분집합(668)을 생성합니다.
2단계: 수정 가능한 항목에 대한 독립적 전문가 수정과 모델 보조 제안을 통한 지원 이후 최종 심의를 거쳐 수정된 항목(1,143)을 생성합니다.
불확실한 항목(689)은 미래의 커뮤니티 개선을 위해 명시적 불확실성 기술자와 함께 남겨둡니다.
출시는 감사 가능성을 위한 상세 메타데이터, 결함 분류체계 및 수정 추적 정보를 포함합니다.

Figure 1: Structural composition of HLE-Verified.

실험 결과

연구 질문

RQ1난이도가 높은 벤치마크의 게시 후 검증이 측정된 모델 성능에 얼마나 영향을 미치나요?
RQ2HLE와 같은 다-domain 벤치마크에서 일반적으로 나타나는 실패 모드는 무엇이며, 작업 의도를 바꾸지 않으면서 어떻게 수정할 수 있나요?
RQ3구성요소별 검증이 도메인 전반에 걸쳐 더 충실한 모델 평가를 제공하나요?
RQ4수정된 벤치마크가 보정 및 신뢰도 지시 평가 지표에 어떤 영향을 미치나요?

주요 결과

모델	Δ 정확도(수정된 하위집합)
Gemini-3-pro	+29.94
GPT-5.2	+38.04
Claude-Opus4.5	+32.94
Grok-4.1 fast-reasoning	+34.82
Claude-Opus4.6	+30.13
DeepSeek-V3.2	+39.58

여덟 가지 최첨단 LLM이 HLE-Verified에서 평균적으로 HLE 대비 정확도 증가를 7–10 퍼센트 포인트로 보였다.
원래는 잘못되었지만 수정 가능한 항목에서 모델 정확도 증가가 30–40 포인트로 나타나 원래 HLE에 상당한 벤치마크 노이즈가 있음을 시사한다.
수정된 하위집합에서 검증 후 보정 오차가 감소하여 더 신뢰할 수 있는 자신감 평가를 시사한다.
모델 자신감과 문제 진술이나 참조 답변의 오류 존재 간의 강한 상관관계가 있어 수정의 효과를 뒷받침한다.
데이터셋은 골드(668항목), 수정된 항목(1,143항목), 불확실한 항목(689항목) 하위집합과 구조화된 메타데이터로 공개된다.

Figure 2: HLE Revision Stage I. High-Difficulty Problem Validity Verification & Golden Subset Construction

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.