QUICK REVIEW

[論文レビュー] AI Benchmark Half-Life in Recursive Corpora: A Theory of Validity Decay under Semantic Leakage and Regeneration

Colin White|arXiv (Cornell University)|Jun 27, 2024

Digital Rights Management and Security被引用数 18

ひとこと要約

LiveBenchを紹介する。汚染に強いLLMベンチマークで、6つのカテゴリにわたり自動採点・真実基準による評価を行い、月次更新で人間のジャッジの偏りなしにモデルの進歩を追跡する。

ABSTRACT

Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be resistant to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-limited versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 405B in size. LiveBench is difficult, with top models achieving below 70% accuracy. We release all questions, code, and model answers. Questions are added and updated on a monthly basis, and we release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.

研究の動機と目的

テストセットの汚染と人間のジャッジの偏りに対するLLMの評価を動機付ける。
自動的な真実基準スコアリングを備えた汚染制限ベンチマークとしてLiveBenchを提示する。
最新情報源から抽出した多様で最新のタスクを提供する。
多様なモデルファミリーにわたるベンチマークの更新と継続的な評価を示す。

提案手法

6つのベンチマークカテゴリを定義する（数学、コーディング、推論、データ分析、指示遵守、言語理解）。
カテゴリごとに2種類のタスクを取り入れる：最近情報ベースの問いと既存ベンチマークの難化版。
LLMジャッジなしの客観的真実基準スコアリングを用い、自動評価を可能にする。
初期は18タスクを含み、40以上のモデルを商用・オープンソースファミリーで評価。
月次で質問を更新し、リリースごとにおよそ1/6のアイテムを刷新して難易度を維持。
カテゴリ間の相関を分析し、他のベンチマークと比較し、アブレーションを行う。

実験結果

リサーチクエスチョン

RQ1LiveBenchはテストセットの汚染とジャッジの偏りを最小化しつつ、モデル能力を信頼性高く測定できるか？
RQ2最近のデータソースを反映した問題で、トップモデルは多様なカテゴリでどうパフォーマンスを示すか？
RQ3カテゴリやタスク間のモデル性能の関係は？
RQ4LiveBenchはChatBot ArenaおよびArena-Hardと比較して傾向と偏りの点でどうか？
RQ5月次更新がモデルのランキングと全体的な難易度に与える影響は？

主な発見

LiveBenchは6つのカテゴリにまたがる広範なカバレージを、自動採点された真実基準解答で達成している。
トップモデルは70%の精度を超えるのに苦戦しており、難易度の持続性と汚染制御の効果を示している。
相関分析は、数学・コーディング・推論が相互に関連する一方、指示遵守は他とより弱く相関することを示している。
LiveBenchのスコアは更新をまたいで高い順位安定性（0.997超）を示すが、全体的な難易度は時間とともに徐々に増加する。
真実基準のジャッジは、ベンチマーク全体でLLMベースのジャッジと相関するが、完全には一致しない。
llama-3.1-405b-instruct および qwen2.5-72b-instruct のようなオープンソースモデルは競争力があり、しばしば大規模な商用モデルと肩を並べる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。