[Paper Review] Through the Lens of Core Competency: Survey on Evaluation of Large Language Models
This survey organizes LLM evaluation around four core competencies—knowledge, reasoning, reliability, and safety—providing definitions, benchmarks, metrics, and an extensible framework to integrate diverse tasks.
From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of improvement. However, LLMs are extremely hard to thoroughly evaluate for two reasons. First of all, traditional NLP tasks become inadequate due to the excellent performance of LLM. Secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. To tackle these problems, existing works proposed various benchmarks to better evaluate LLMs. To clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning LLM evaluations. We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety. For every competency, we introduce its definition, corresponding benchmarks, and metrics. Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. Finally, we give our suggestions on the future direction of LLM's evaluation.
Motivation & Objective
- Clarify why traditional NLP benchmarks fall short for modern LLMs and motivate a competency-based evaluation framework.
- Define four core competencies (knowledge, reasoning, reliability, safety) and their subcomponents.
- Aggregate and categorize 540+ evaluation tasks to map to the core competencies and identify representative benchmarks.
- Propose an extensible project to show many-to-many relationships between competencies and tasks and enable adding new tasks.
- Offer guidance on future directions for LLM evaluation including potential new competencies and evaluation directions.
Proposed method
- Survey and synthesis of 540+ tasks used in LLM evaluations across academia and industry.
- Definition and taxonomy of four core competencies and sub-competencies.
- Mapping of tasks to competencies to enable coherent evaluation and extensibility.
- Discussion of representative benchmarks and datasets for each competency (knowledge, reasoning, reliability, safety).
- Provision of an extensible project (GitHub) to illustrate task–competency relationships and support future updates.
- Outline of future directions and potential additional competencies (e.g., sentiment) to broaden evaluation coverage.
Experimental results
Research questions
- RQ1How can diverse LLM evaluation tasks be unified under a concise, extensible competency framework?
- RQ2What are the definitions, benchmarks, and metrics that best capture knowledge, reasoning, reliability, and safety in LLMs?
- RQ3How can new tasks be incorporated into the evaluation system without disrupting the framework?
- RQ4What practical guidance and tooling can support researchers in applying the core competency framework to LLM evaluation?
Key findings
- Four core competencies for LLM evaluation are proposed: knowledge, reasoning, reliability, and safety.
- A systematic aggregation of 540+ evaluation tasks is organized into a competency-based taxonomy.
- The framework supports combining tasks by competency and adding new tasks within the system.
- An extensible project is provided to model the many-to-many relationships between competencies and tasks for community use.
- The paper discusses future directions, including potential additions such as sentiment competency, and highlights the need for regularly updated test sets to prevent leakage and reflect real-world use.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.