QUICK REVIEW

[논문 리뷰] KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Jifan Yu, Xiaozhi Wang|arXiv (Cornell University)|2023. 06. 15.

Topic Modeling인용 수 24

한 줄 요약

KoLA는 지식 중심의 네 수준 분류법을 설계하고, 알려진 데이터와 진화하는 데이터를 다루며, 대조 점수 체계와 28 LLMs를 19 tasks에서 평가하는 자가 대조 지표를 제공합니다. 분기별 업데이트를 통해 진행 상황을 추적합니다.

ABSTRACT

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For extbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For extbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For extbf{evaluation criteria}, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge-creating ability. We evaluate $28$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

연구 동기 및 목표

Bloom의 분류학에서 영감을 받은 네 수준의 인지적 지식 분류법(KM, KU, KA, KC)을 사용하여 세계 지식을 구조화합니다.
알려진 데이터(위키피디아 부분)와 evolving data(최근 기사)를 결합하여 기억력과 새로운 지식에의 적응을 테스트하는 공정한 평가를 제공합니다.
표준화된 점수를 통한 교차 작업 비교를 가능하게 하고 지식 창출을 평가하는 자가 대조 지표를 제공하는 대조적 평가 프레임워크를 제공합니다.
KoLA 분기 시즌을 제공하여 개발 상황을 추적하고 LLM 지식 시스템 개선을 위한 실행 가능한 진단을 제공합니다.

제안 방법

KM, KU, KA, KC 네 수준의 인지 지식 분류법을 채택하여 19개의 과제로 memorization, understanding, applying, creating 지식을 구성합니다.
알려진 데이터는 Wikipedia/Wikidata5M에서, evolving data는 최근에 게시된 기사에서 얻어 memorization 및 업데이트 능력을 테스트합니다.
교차 모델 비교를 가능하게 하는 Task 전반에 걸친 표준화 점수를 포함한 대조적 평가 시스템과 지식 창출 평가를 위한 self-contrast 지표를 구현합니다.
미리지식 K와의 대조를 통해 KC를 자동 평가하도록 설계하고 Rouge-L 기반의 유사도 측정치를 사용하여 혼합 KC 점수를 계산합니다.

실험 결과

연구 질문

RQ1세계 지식을 암기, 이해, 적용, 창출하는 데 있어 LLM들이 어떻게 다른가?
RQ2알고리즘 크기와 정렬이 알려진 데이터와 진화하는 데이터 전반의 다양한 지식 능력에 어떤 영향을 미치는가?
RQ3표준화된 교차 작업 점수가 다양한 LLM들 간의 공정하고 해석 가능한 리더보드를 제공할 수 있는가?
RQ4자가 대조 지표가 지식 창출을 효과적으로 평가하고 환각 영향 감소에 기여하는가?

주요 결과

더 큰 기본 모델은 정렬되지 않을 때 지식을 더 많이 암기하는 경향이 있으며 KM에 강한 크기 효과가 있다.
정렬 및 지시 조정은 상위 수준 능력(KA, KC)을 향상시키지만 원시 기억력(KM)을 감소시켜 낮은 수준의 기억에 대한 정렬세를 나타낸다.
상용 모델은 표준 KoLA 점수에서 일반적으로 오픈 소스 모델보다 우수한 반면, 오픈 소스 모델은 전반적으로 약한 성능을 보인다.
지시 조정 후에는 모델 크기와 상위 수준 능력 간의 상관관계가 더 두드러지게 나타나고, KM 암기력의 향상은 덜 두드러진다.
KoLA의 evolving data seasons은 보지 못한 지식에 대한 공정한 평가와 시간에 따른 모델 개발 추적을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.