QUICK REVIEW

[논문 리뷰] OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

Zhen Huang, Zengzhi Wang|arXiv (Cornell University)|2024. 06. 24.

Explainable Artificial Intelligence (XAI)인용 수 7

한 줄 요약

본 논문은 OlympicArena 메달 표를 도입하여 다학제 분야에서 AI 모델을 순위화하고 Claude-3.5-Sonnet, Gemini-1.5-Pro, GPT-4o(및 기타 모델)을 OlympicArena 벤치마크에서 비교하며, 강점, 격차, 언어/모달리티 효과를 분석한다.

ABSTRACT

In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).

연구 동기 및 목표

최신 AI 모델을 OlympicArena 벤치마크에서 평가하여 다학제적 인지 능력의 현재 선두와 격차를 식별한다.
다양한 학문 분야에 걸친 명확한 순위 프레임워크로서 OlympicArena Medal Table을 도입한다.
주제, 언어, 모달리티별로 세밀한 분석을 제공하여 모델의 강점과 한계를 이해한다.

제안 방법

데이터 누출을 방지하고 규칙 기반 평가를 가능하게 하기 위해 OlympicArena 테스트 분할을 사용한다.
텍스트 전용 및 다중 모달 입력을 포함하여 LMM과 LLM 모두를 평가한다.
비프로그래밍 과제에 대한 정확도를 계산하고 프로그래밍 과제에는 pass@k (k=1, n=5)을 산출한다.
골드, 실버, 브론즈를 기준으로 모델을 순위하고, 이어서 Overall 점수로 종합한다.
주제, 추론 유형, 언어 및 모달리티별 세밀한 분석을 제시한다.

실험 결과

연구 질문

RQ1어떤 AI 모델(Claude-3.5-Sonnet, Gemini-1.5-Pro, GPT-4o)이 OlympicArena 학문 분야에서 최고 메달을 받는가?
RQ2오픈 소스 모델은 다학제적 인지 과제에서 독점 모델에 비해 어떤 성능을 보이는가?
RQ3전통적인 수학/코딩 과제 대비 지식집약적 과학 분야(물리학, 화학, 생物학)에서 모델의 상대적 강점은 무엇인가?

주요 결과

모델	금메달	은메달	동메달	합계	종합 점수
GPT-4o	4	3	0	7	40.47
Claude-3.5-Sonnet	3	3	0	6	39.24
GPT-4V	0	1	1	2	33.17
Gemini-1.5-Pro	0	0	6	6	35.09
Claude-3-Sonnet	0	0	0	0	25.53
Qwen1.5-32B-Chat	0	0	0	0	24.36
Qwen-VL-Max	0	0	0	0	21.41
Gemini-Pro-Vision	0	0	0	0	21.02
LLaVA-NeXT-34B	0	0	0	0	18.16
Yi-34B-Chat	0	0	0	0	18.01
InternVL-Chat-V1.5	0	0	0	0	17.39
InternLM2-Chat-20B	0	0	0	0	17.33
Yi-VL-34B	0	0	0	0	15.07
Qwen-VL-Chat	0	0	0	0	7.34
Qwen-7B-Chat	0	0	0	0	4.34

Claude-3.5-Sonnet은 GPT-4o와의 경쟁이 매우 치열하며 일부 과목에서 물리학, 화학, 생물학에서 GPT-4o를 능가한다.
Gemini-1.5-Pro와 GPT-4V는 GPT-4o/Claude-3.5-Sonnet 뒤를 밀접하게 따른다. 상위 두 모델과의 차이가 상대적으로 크다.
오픈 소스 모델은 독점 모델에 뒤처지며 분야 지점에서 메달을 획득하지 못한다.
전체 결과로 GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro가 OlympicArena 메달 표에 따라 상위 3개 모델이다.
작성 시점에서 오픈 소스와 독점 모델 간의 격차가 메달 표에 명확하게 드러난다.
수학/코딩에서의 성능이 GPT-4o에 강하고, Claude-3.5-Sonnet은 지식이 적은 추론에서 강점을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.