QUICK REVIEW

[논문 리뷰] AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui|arXiv (Cornell University)|2023. 04. 13.

Artificial Intelligence in Healthcare and Education인용 수 61

한 줄 요약

AGIEval은 기초 모델을 평가하기 위해 20개 과제에 걸쳐 8,062개의 질문으로 구성된 이중언어 기반의 인간 시험 기반 벤치마크를 제시하며, GPT-4가 인간 중심 테스트의 일부에서 뛰어나지만 복잡한 추론과 도메인 전문 지식에서는 어려움을 겪는 것으로 나타났습니다.

ABSTRACT

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

연구 동기 및 목표

공식 시험을 통해 인간 인지 및 의사결정과 일치하는 과제에 평가의 중심을 두는 것.
객관식 문항 형식을 사용하여 견고하고 표준화되며 자동화된 지표를 제공한다.
벤치마크 다국어 역량을 영어와 중국어 과제로 평가한다.
LM 평가의 투명성 및 재현성을 촉진하기 위해 모델 출력물을 공개한다.

제안 방법

고표준의 공식 시험들(가오카오, SAT, LSAT, GMAT, AMC/AIME, 공무원, 변호사 시험)에서 질문을 수집한다.
표준화된 채점을 위해 객관식과 빈칸 채우기만 포함한다.
객관식에는 정확도, 빈칸 채우기에는 Exact Match/ F1을 지표로 사용한다.
Chain-of-Thought 프롬프트를 사용하느냐 여부에 따라 zero-shot 및 few-shot 설정에서 모델을 평가한다.
고정 생성 설정(온도 0, 최대 토큰 2048)으로 Text-Davinci-003, ChatGPT, GPT-4에 대한 Azure OpenAI 서비스 API를 사용한다.
분석 및 재현성을 지원하기 위해 모든 모델 출력을 공개한다.

실험 결과

연구 질문

RQ1최첨단 기초 모델이 공식 시험에서 파생된 인간 수준의 실제 과제에서 얼마나 잘 수행하는가?
RQ2이 모델들이 이중언어 과제에서 이해, 지식, 추론, 계산의 강점과 한계는 무엇인가?
RQ3사고 과정 체인(Chain-of-Thought) 프롬프트와 few-shot 시나리오가 인간 중심 추론 과제의 성능을 향상시키는가?
RQ4다양한 시험에서 모델의 성능이 평균 및 상위 인간 응시자와 어떻게 비교되는가?

주요 결과

GPT-4는 zero-shot CoT 설정에서 SAT, LSAT, 수학 대회에서 평균 인간 성능을 능가한다.
GPT-4는 SAT 수학에서 95% 정확도, 중국의 가오카오 영어 시험에서 92.5% 정확도를 달성한다.
모델은 복잡한 추론이나 특정 도메인 지식(예: 법학, 화학, 물리학)을 필요로 하는 과제에서 어려움을 겪는다.
이해, 지식, 추론 및 계산에 걸친 평가가 각 모델의 뚜렷한 강점과 한계를 보여준다.
벤치마크는 AGI를 향한 일반적 능력 향상 방향에 대한 통찰력을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.