[論文レビュー] Position: AI Evaluation Should Learn from How We Test Humans
The paper introduces a Computerized Adaptive Testing (CAT) framework to efficiently evaluate LLM cognitive abilities, comparing models to humans and among models, using IRT-based item parameters and Fisher information for adaptive question selection.
As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model's observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
研究の動機と目的
- Motivate and formalize an adaptive evaluation framework for LLMs inspired by human testing.
- Propose a two-stage CAT pipeline: (1) construct calibrated question pools via IRT; (2) adaptively select questions to estimate model ability.
- Enable model-vs-human and model-vs-model comparisons across subject knowledge, mathematical reasoning, and programming.
- Demonstrate efficiency gains over fixed benchmarks and provide diagnostic reports of LLMs.
提案手法
- Adopt the three-parameter logistic (IRT-3PL) model to calibrate each question with difficulty (β), discrimination (α), and guessing (c) parameters.
- Use joint maximum likelihood estimation to infer item and ability parameters from human response data.
- Apply Fisher information-based item selection to choose the next question that maximizes information at the current ability estimate.
- Estimate LLM ability θ^t via maximum likelihood given responses up to step t, with asymptotic normality and variance 1/(t I(θ)).
- Conduct a two-stage adaptive testing process: (1) build a calibrated question pool from MOOC, MATH, CODIA datasets; (2) interact with LLMs and update θ^t to guide question selection.
- Compare ChatGPT to humans and rank 6 instruction-tuned LLMs across Subject Knowledge, Mathematical Reasoning, and Programming using the adaptive framework.
実験結果
リサーチクエスチョン
- RQ1How can adaptive testing with IRT-derived item parameters efficiently estimate an LLM's cognitive ability?
- RQ2Can CAT-derived diagnostics reveal how LLMs compare to humans and to each other across knowledge domains, reasoning, and programming?
- RQ3What is the efficiency gain (in question count) of adaptive testing versus fixed benchmarks for evaluating LLMs?
- RQ4Do LLMs exhibit human-like response patterns (e.g., guessing/slipping) and how does this affect reliability of AI evaluations?
主な発見
- CAT achieves higher evaluation efficiency, needing at most 20 questions to reach similar accuracy as fixed tests.
- GPT4 leads among evaluated LLMs in mathematical reasoning, programming, and subject knowledge, often surpassing high-ability humans in several concepts.
- ChatGPT shows strong programming ability in certain areas (e.g., dynamic programming, search) but weaker in basic programming concepts and some mathematical reasoning tasks.
- ChatGPT behaves like a “careless student” with slip and guessing tendencies, and its responses can be fickle across repeated prompts.
- Adaptive question selection yields distinct yet overlapping question sets across models, with Jaccard similarities around 0.6, enabling both model-specific and joint evaluation benefits.
- Across domains, GPT4 generally achieves the highest average ability estimates, with Spark and Bard following, while ERNIEBot and QianWen trail.
- Programming and MOOC concept-level results indicate model-specific strengths and weaknesses, guiding targeted improvement.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。