QUICK REVIEW

[論文レビュー] OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

Zhen Huang, Zengzhi Wang|arXiv (Cornell University)|Jun 24, 2024

Explainable Artificial Intelligence (XAI)被引用数 7

ひとこと要約

この論文は、オリンピック Arena メダルテーブルを導入して、Claude-3.5-Sonnet、Gemini-1.5-Pro、GPT-4o（他も含む）を OlympicArena ベンチマークで比較し、強み・ギャップ、および言語/モダリティの影響を分析する。

ABSTRACT

In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).

研究の動機と目的

OlympicArenaベンチマークで最新のAIモデルを評価し、多分野にわたる認知能力のリーダーとギャップを特定する。
分野を横断する明確なランキングフレームワークとしてOlympicArena Medal Tableを導入する。
主題、推論タイプ、言語、モダリティごとに細かな分析を提供し、モデルの強みと限界を理解する。

提案手法

データリークを避けRuleベースの評価を可能にするため、OlympicArenaテスト分割を使用する。
テキストのみとマルチモーダル入力を含むLMMsとLLMsの両方を評価する。
非プログラミングタスクの正確性と、プログラミングタスクのpass@k (k=1, n=5)を計算する。
Gold, Silver, Bronzeに基づいてOlympicArena Medal Tableでモデルをランク付けし、次にOverallスコア。
主題、推論タイプ、言語、モダリティごとに細かな分析を提示する。

実験結果

リサーチクエスチョン

RQ1どのAIモデル（Claude-3.5-Sonnet、Gemini-1.5-Pro、GPT-4o）がOlympicArenaの分野でトップメダルを獲得するか？
RQ2多分野の認知タスクにおいてオープンソースモデルはプロプライエタリモデルと比較してどの程度の性能か？
RQ3伝統的な数学/コーディングタスクと知識集約的な科学分野（物理、化学、生物学）におけるモデルの相対的強みは？

主な発見

Model	Gold	Silver	Bronze	Total	Overall Scores
GPT-4o	4	3	0	7	40.47
Claude-3.5-Sonnet	3	3	0	6	39.24
GPT-4V	0	1	1	2	33.17
Gemini-1.5-Pro	0	0	6	6	35.09
Claude-3-Sonnet	0	0	0	0	25.53
Qwen1.5-32B-Chat	0	0	0	0	24.36
Qwen-VL-Max	0	0	0	0	21.41
Gemini-Pro-Vision	0	0	0	0	21.02
LLaVA-NeXT-34B	0	0	0	0	18.16
Yi-34B-Chat	0	0	0	0	18.01
InternVL-Chat-V1.5	0	0	0	0	17.39
InternLM2-Chat-20B	0	0	0	0	17.33
Yi-VL-34B	0	0	0	0	15.07
Qwen-VL-Chat	0	0	0	0	7.34
Qwen-7B-Chat	0	0	0	0	4.34

Claude-3.5-SonnetはGPT-4oと高い競争力を持ち、物理、化学、生物学のいくつかの科目でそれを上回る。
Gemini-1.5-ProとGPT-4VはGPT-4o/Claude-3.5-Sonnetのすぐ後に続き、上位二つとの差が顕著。
オープンソースモデルはプロプライエタリモデルに遅れ、分野スポットでメダルを獲得できない。
全体の結果は、GPT-4o、Claude-3.5-Sonnet、Gemini-1.5-ProがOlympicArena Medal Tableのトップ3モデルであることを示す。
メダルテーブルは執筆時点でオープンソースとプロプライエタリモデルの明確なギャップを示している。
分野横断のパフォーマンスは、GPT-4oの数学/コーディングの強みと、Claude-3.5-Sonnetの知識量を要しない推論の強みを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。