Skip to main content
QUICK REVIEW

[論文レビュー] Measuring short-form factuality in large language models

Jason Lee, Nguyen Karina|arXiv (Cornell University)|Nov 7, 2024
Natural Language Processing Techniques被引用数 11
ひとこと要約

SimpleQA は frontier 言語モデルが短く、事実を問う質問に対して単一の検証可能な答えを出せるかを評価するベンチマークであり、高い正確性、迅速な評価、最新モデルにとって難しい難易度を強調します。

ABSTRACT

We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions. We prioritized two properties in designing this eval. First, SimpleQA is challenging, as it is adversarially collected against GPT-4 responses. Second, responses are easy to grade, because questions are created such that there exists only a single, indisputable answer. Each answer in SimpleQA is graded as either correct, incorrect, or not attempted. A model with ideal behavior would get as many questions correct as possible while not attempting the questions for which it is not confident it knows the correct answer. SimpleQA is a simple, targeted evaluation for whether models "know what they know," and our hope is that this benchmark will remain relevant for the next few generations of frontier models. SimpleQA can be found at https://github.com/openai/simple-evals.

研究の動機と目的

  • Evaluate the factual short-form answering ability of large language models using a carefully constructed, adversarially crafted dataset.
  • Ensure questions have a single, indisputable answer and are time-stable to maintain evergreen validity.
  • Provide a fast, scalable grading workflow via prompts and API-based evaluation.
  • Measure calibration of model confidence versus actual correctness.
  • Offer open-source access to SimpleQA for reproducibility and broad use in frontier-model research.

提案手法

  • Construct 4,326 short, fact-seeking questions with a single answer.
  • Answer each question independently by two AI trainers and keep only matching responses.
  • Grade model outputs as Correct, Incorrect, or Not Attempted using a prompted grader.
  • Compute an F-score as the harmonic mean of overall correct and correct-given-attempted.
  • Annotate questions for topic, answer type, and diversity of sources.

実験結果

リサーチクエスチョン

  • RQ1Can frontier LLMs reliably provide correct short-form factual answers to single-answer questions?
  • RQ2Do larger models exhibit better correctness and calibration on short-form factuality tasks?
  • RQ3How calibrated are model confidence and answer frequency with respect to actual correctness?
  • RQ4What is the effect of model size and family (e.g., GPT-4o, Claude) on behavior in answering, not answering, or guessing?

主な発見

ModelCorrectNot attemptedIncorrectCorrect given attemptedF-score
Claude-3-haiku (2024-03-07)5.175.319.620.68.2
Claude-3-sonnet (2024-02-29)5.775.019.322.99.2
Claude-3-opus (2024-02-29)23.539.636.938.829.3
Claude-3.5-sonnet (2024-06-20)28.935.036.144.535.0
GPT-4o-mini8.60.990.58.78.6
GPT-4o38.21.060.838.038.4
OpenAI o1-mini8.128.563.411.39.4
OpenAI o1-preview42.79.248.147.044.8
  • Larger models outperform smaller ones on SimpleQA, with GPT-4o and Claude variants achieving higher scores than smaller counterparts.
  • Models show a trade-off between answering and abstaining, with some models answering many questions but with a high incorrect rate.
  • Calibration analyses reveal that while confidence and answer frequency correlate with accuracy, models substantially overstate their confidence.
  • Overall, the highest reported F-scores across evaluated models remain well below perfect performance, indicating a challenging benchmark for frontier models.
  • SimpleQA provides an accessible metric for measuring whether models know what they know and can be extended to future frontier-model generations.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。