QUICK REVIEW

[论文解读] OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

Zhen Huang, Zengzhi Wang|arXiv (Cornell University)|Jun 24, 2024

Explainable Artificial Intelligence (XAI)被引用 7

一句话总结

本论文提出 OlympicArena Medal Table，用以在不同学科对 AI 模型进行排名，比较 Claude-3.5-Sonnet、Gemini-1.5-Pro、GPT-4o（及其他）在 OlympicArena 基准上的表现，并分析强项、差距以及语言/模态的影响。

ABSTRACT

In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).

研究动机与目标

评估最新的 AI 模型在 OlympicArena 基准上的表现，以识别当前在多学科认知能力方面的领先者和差距。
将 OlympicArena Medal Table 作为跨学科的明确排名框架引入。
通过学科、推理类型、语言和模态提供精细分析，以理解模型的优势与局限。

提出的方法

使用 OlympicArena 测试划分以避免数据泄露并实现基于规则的评估。
评估包括文本输入和多模态输入在内的 LMMs 与 LLMs。
对非编程任务计算准确率，对编程任务计算 pass@k（k=1, n=5）。
通过 OlympicArena Medal Table 根据 Gold、Silver、Bronze，然后是 Overall score 对模型进行排名。
进行按学科、推理类型、语言和模态的精细分析。

实验结果

研究问题

RQ1哪些 AI 模型（Claude-3.5-Sonnet、Gemini-1.5-Pro、GPT-4o）在 OlympicArena 各学科中获得顶级奖牌？
RQ2开源模型在多学科认知任务中的表现相对于专有模型如何？
RQ3在传统的数学/编码任务与知识密集型科学领域（物理、化学、生物学）中，模型的相对强项是什么？

主要发现

模型	金牌	银牌	铜牌	总计	总体分数
GPT-4o	4	3	0	7	40.47
Claude-3.5-Sonnet	3	3	0	6	39.24
GPT-4V	0	1	1	2	33.17
Gemini-1.5-Pro	0	0	6	6	35.09
Claude-3-Sonnet	0	0	0	0	25.53
Qwen1.5-32B-Chat	0	0	0	0	24.36
Qwen-VL-Max	0	0	0	0	21.41
Gemini-Pro-Vision	0	0	0	0	21.02
LLaVA-NeXT-34B	0	0	0	0	18.16
Yi-34B-Chat	0	0	0	0	18.01
InternVL-Chat-V1.5	0	0	0	0	17.39
InternLM2-Chat-20B	0	0	0	0	17.33
Yi-VL-34B	0	0	0	0	15.07
Qwen-VL-Chat	0	0	0	0	7.34
Qwen-7B-Chat	0	0	0	0	4.34

Claude-3.5-Sonnet 与 GPT-4o 竞争激烈，在某些科目上在物理、化学和生物学方面甚至超过了它。
Gemini-1.5-Pro 与 GPT-4V 紧随 GPT-4o/Claude-3.5-Sonnet 之后，与前两名存在明显差距。
开源模型落后于专有模型，且在各学科领域未获得任何奖牌。
整体结果显示 GPT-4o、Claude-3.5-Sonnet、Gemini-1.5-Pro 为 OlympicArena Medal Table 的前三模型。
当前撰写时，奖牌表揭示了开源模型与专有模型之间的明显差距。
各学科的表现表明 GPT-4o 在数学/编码方面具有优势，Claude-3.5-Sonnet 在知识较少但推理方面表现较强。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。