Skip to main content
QUICK REVIEW

[论文解读] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

Songyang Liu, Chaozhuo Li|ArXiv.org|Jun 6, 2025
Ethics in Business and Education被引用 4
一句话总结

本论文提供对大型语言模型(LLMs)安全评估的全面、系统性综述,阐明为何、是什么、在哪里以及如何评估安全性,并识别挑战与未来方向。

ABSTRACT

With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), including content generation, human-computer interaction, machine translation, and code generation. However, their widespread deployment has also raised significant safety concerns. In particular, LLM-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts, which has attracted increasing attention from both academia and industry. Although numerous studies have attempted to evaluate these risks, a comprehensive and systematic survey on safety evaluation of LLMs is still lacking. This work aims to fill this gap by presenting a structured overview of recent advances in safety evaluation of LLMs. Specifically, we propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the background of safety evaluation of LLMs, how they differ from general LLMs evaluation, and the significance of such evaluation; (ii) What to evaluate, which examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (iv) How to evaluate, which reviews existing mainstream evaluation methods based on the roles of the evaluators and some evaluation frameworks that integrate the entire evaluation pipeline. Finally, we identify the challenges in safety evaluation of LLMs and propose promising research directions to promote further advancement in this field. We emphasize the necessity of prioritizing safety evaluation to ensure the reliable and responsible deployment of LLMs in real-world applications.

研究动机与目标

  • 解释LLMs安全评估的背景及重要性,以及它与一般LLM评估的不同之处。
  • 对主要的安全评估任务与维度(毒性、鲁棒性、伦理、偏见/公平性、真实性等)进行分类与整理。
  • 总结安全评估中常用的评估指标、数据集、基准与工具包。
  • 回顾评估方法学,并按评估者角色(自动化 vs 人工)进行分类。
  • 识别当前挑战并提出推进LLM安全评估与标准化的方向。

提出的方法

  • 提出四维框架:为何要评估、要评估什么、在哪里评估、以及如何评估LLMs的安全性。
  • 提供跨维度的安全评估任务详细分类,如毒性、鲁棒性、伦理、偏见/公平性、真实性等,以及其他维度。
  • 汇编并分类在安全评估中使用的现有评估指标、数据集、基准和工具包。
  • 回顾并按评估者类型(自动化系统 vs 人类评估者)对评估方法学进行分类。
  • 讨论挑战并概述未来潜在研究方向,以实现安全评估的标准化与提升。

实验结果

研究问题

  • RQ1LLMs安全评估与一般模型评估有何关键动机与背景差异?
  • RQ2用于评估LLMs安全性的主要任务与维度有哪些?
  • RQ3常用的安全评估指标、数据集、基准是什么,存在哪些工具?
  • RQ4安全评估如何进行(评估工具包与方法),由谁来执行(人类还是自动化评估者)?
  • RQ5未来LLMs安全评估的主要挑战与有前景的方向是什么?

主要发现

  • 提供对LLM安全评估最新进展的全面而系统的综述。
  • 确立跨多维度的安全评估任务清晰分类框架。
  • 整理研究者可用的评估指标、数据集/基准、工具包与方法。
  • 强调标准化与更广泛采用安全评估实践的必要性。
  • 讨论挑战并提出促进LLMs安全、负责任开发与部署的方向。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。