QUICK REVIEW

[논문 리뷰] Evaluating the Performance of Large Language Models on GAOKAO Benchmark

Xiaotian Zhang, Chunyang Li|arXiv (Cornell University)|2023. 05. 21.

Topic Modeling인용 수 18

한 줄 요약

이 논문은 LLM 평가를 위한 중국 GAOKAO 기반 벤치마크 GAOKAO-Bench를 도입하여 제로샷 성능과 인간 정렬(align)에서 목적형 vs 주관형 문제를 조사하고, 목적형 문제에서 강점과 개선이 필요한 영역에 대한 발견을 제시합니다.

ABSTRACT

Large Language Models(LLMs) have demonstrated remarkable performance across various natural language processing tasks; however, how to comprehensively and accurately assess their performance becomes an urgent issue to be addressed. This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples, including both subjective and objective questions. To align with human examination methods, we design a method based on zero-shot settings to evaluate the performance of LLMs. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.Our findings reveal that LLMs have achieved competitive scores in Chinese GAOKAO examination, while they exhibit significant performance disparities across various subjects. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores. In conclusion, this research contributes a robust evaluation benchmark for future large language models and offers valuable insights into the advantages and limitations of such models.

연구 동기 및 목표

GAOKAO 문제를 사용하여 중국 교육 과제에 대해 도메인 특화적이고 인간 정합적인 평가를 제시한다.
LLM 능력을 평가하기 위해 2010–2022년 Gaokao 데이터를 모든 과목에 걸쳐 포함하는 벤치마크를 제공한다.
질문과 모델 출력의 매핑에서 제로샷 프롬프트의 효과를 평가한다.
목적형 질문과 주관형 질문에서의 모델 성능 차이를 구분하고 과목별 강점과 약점을 식별한다.

제안 방법

GAOKAO-question 데이터를 (2010–2022) 수집하여 수식에 대한 LaTeX를 포함한 JSON 코퍼스로 구성한다.
질문 유형에 맞춘 제로샷 프롬프트를 적용하여 LLM으로부터 다중 출기를 생성한다.
표준 정답과의 정확한 일치를 통해 객관식/객관적 질문을 채점하고, 주관식은 인간 전문가 평가로 채점한다.
고등학생 교사들을 참여시켜 채점의 인간 벤치마크와의 일치를 검증한다.
과목별 및 질문 유형별 채점 비율을 분석하여 강점(예: English)과 약점(예: Physics, Chemistry, Math_I)을 식별한다.

실험 결과

연구 질문

RQ1제로샷 설정으로 프롬프트될 때 대형 언어 모델이 GAOKAO 문제에서 어떻게 수행하는가?
RQ2과목별로 본 객관식 대비 주관식 Gaokao 문제에서 LLM의 상대 성능은 어떻게 되는가?
RQ3어떤 과목이나 질문 유형에서 LLM 성능과 인간 벤치마크 간의 차이가 가장 큰가?

주요 결과

모델은 객관적 질문에서 가장 우수하게 수행하며 식별된 영어 관련 문제 유형에서 높은 채점 비율을 보인다(예: English_Reading_Comp = 88.3%, English_MCQs = 78.1%, English_Fill_in_Blanks = 73.8%).
주관식 질문에 대한 모델의 채점은 전반적으로 더 낮고 과목별로 다르며, 계산 및 추론 요구가 큰 Physics, Chemistry, Biology, Mathematics의 차이가 더 큼.
전반적으로 지식 기반 질문에서 강한 성능을 보이나, 긴 중국어 독해 및 특정 논리/수리 추론 과제에서 어려움을 겪는다.
과목 수준 분석은 영어 관련 과제가 가장 강하고, Physics, Chemistry, Math_I가 주목할 만한 도전을 제시한다.
주관식 채점을 인간 벤치마크에 맞추기 위해 Shanghai Caoyang No. 2 Middle School 교사들의 인간 평가를 사용했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.