[論文レビュー] From Human-Level AI Tales to AI Leveling Human Scales
The paper proposes a framework to calibrate AI capabilities on population-anchored, logarithmic scales by mapping item difficulty across human diversity using LLM-assisted demographic adjustment, enabling commensurate AI-human comparisons across benchmarks.
Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.
研究の動機と目的
- Motivate the need for commensurate, population-anchored AI evaluation beyond “human-level” benchmarks.
- Propose a pipeline to map item difficulties onto world-population scales using 18 capability dimensions.
- Leverage LLMs to extrapolate from sub-group data to world-population success rates.
- Calibrate base bases per dimension to achieve commensurability across tasks and domains.
提案手法
- Annotate 18 cognitive-demands per item using the ADeLe framework to obtain ratio-scaled demands.
- Gather item pools with observed human performance from PISA, TIMSS, ICAR, UK Biobank, and ReliabilityBench.
- Use LLMs to perform demographic post-stratification and predict world-population success probabilities for each item.
- Transform observed success rates into logarithmic difficulty levels L on a base-B scale (L = -log_B(p^W_i)).
- Validate extrapolations with sub-group to full-sample comparisons using MAE, RMSE, and correlations (Pearson r, Spearman ρ).
- Calibrate dimension-specific bases B by dominance-filtered means to improve commensurability across dimensions.

実験結果
リサーチクエスチョン
- RQ1Can LLM-assisted demographic adjustment reliably predict world-population item success from subgroup data?
- RQ2Do population-anchored, logarithmic capability scales align with empirical world-population performance across diverse domains?
- RQ3What base bases (B) per dimension best achieve commensurability between theoretical ADeLe levels and observed world-population difficulty?
- RQ4How do calibration outcomes affect the ranking and interpretation of AI versus human performance across benchmarks?
主な発見
- Extrapolation from sub-groups to full populations yields low MAE (≈0.03–0.04) and high Pearson correlations (r > 0.92) on ICAR, indicating reliable demographic adjustment in homogeneous item spaces.
- TIMSS shows weaker extrapolation performance (MAE ≈0.12–0.16; correlations ≈0.5–0.7) due to greater heterogeneity across datasets.
- Empirically, a universal base B=10 is insufficient; dimension-specific bases are needed, with volume/attention showing higher bases (B ≈ 17–32) and invariant dimensions showing lower bases (B ≈ 1–5).
- Calibration reveals three dimension groups: High-Base (B>10): Volume and Attention; Standard-Scaling (3<B<10): Metacognition and Knowledge; Invariant (B≈1): Comprehension/Expression and Spatial Reasoning/Nav.
- The dominance-filtered, mean-based calibration approach provides dimension-specific B values that improve the alignment between annotated theoretical levels and observed world-population difficulty.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。