QUICK REVIEW

[논문 리뷰] Measuring Massive Multitask Chinese Understanding

Hui Zeng|arXiv (Cornell University)|2023. 04. 25.

Radiomics and Machine Learning in Medical Imaging인용 수 11

한 줄 요약

본 논문은 의학, 법률, 심리학, 교육 분야에 걸친 대형 중국어 언어모델의 다중 작업 평가를 제안하고, 4개 도메인과 하위 작업에 걸친 제로샷 성능을 보고한다.

ABSTRACT

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

연구 동기 및 목표

대형 중국어 언어모델에 대한 포괄적 역량 평가의 필요성을 제기한다.
네 가지 도메인과 다수의 하위 작업을 포괄하는 다중 작업 평가 테스트를 도입한다.
제로샷 및 도메인 수준의 성능 인사이트를 제공하여 모델의 한계를 식별한다.

제안 방법

네 가지 도메인 영역(의학, 법률, 심리학, 교육)을 정의하고 의학에서 15개의 하위 작업과 교육에서 8개의 하위 작업을 열거한다.
모든 하위 작업에 대해 제로샷 설정에서 대형 중국어 언어모델을 평가한다.
도메인 전반 및 하위 도메인 성능 패턴을 식별하기 위해 모델 간 성능을 비교한다.

실험 결과

연구 질문

RQ1네 가지 주요 도메인에 걸친 대형 중국어 언어모델의 제로샷 성능은 얼마인가?
RQ2제로샷 설정에서 어떤 도메인이나 하위 작업이 가장 강하거나 약한 모델 역량을 드러내는가?
RQ3모델 및 도메인 간 최고 제로샷 성능과 최저 성능은 어떻게 비교되는가?

주요 결과

최고 제로샷 모델은 최저 모델보다 평균적으로 약 18.6 퍼센트 포인트 차이로 능가한다.
네 가지 도메인에 걸쳐 모든 모델 중 평균 제로샷 정확도 최고치는 0.512이다.
하위 영역에서 GPT-3.5-turbo는 임상 의학에서 0.693 제로샷 정확도를 달성하여 모든 하위 작업 중 최고치를 기록했다.
법률 도메인에서 모든 모델의 성능은 저조하며, 최고 제로샷 정확도는 겨우 0.239이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.