QUICK REVIEW

[논문 리뷰] The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs

Songyang Liu, Chaozhuo Li|ArXiv.org|2025. 06. 06.

Ethics in Business and Education인용 수 4

한 줄 요약

이 논문은 대형 언어 모델(LLMs)의 안전성 평가에 대한 포괄적이고 체계적인 조사로, 왜, 무엇을, 어디에서, 어떻게 평가할지 제시하고 도전과제와 향후 방향을 식별한다.

ABSTRACT

With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), including content generation, human-computer interaction, machine translation, and code generation. However, their widespread deployment has also raised significant safety concerns. In particular, LLM-generated content can exhibit unsafe behaviors such as toxicity, bias, or misinformation, especially in adversarial contexts, which has attracted increasing attention from both academia and industry. Although numerous studies have attempted to evaluate these risks, a comprehensive and systematic survey on safety evaluation of LLMs is still lacking. This work aims to fill this gap by presenting a structured overview of recent advances in safety evaluation of LLMs. Specifically, we propose a four-dimensional taxonomy: (i) Why to evaluate, which explores the background of safety evaluation of LLMs, how they differ from general LLMs evaluation, and the significance of such evaluation; (ii) What to evaluate, which examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and related aspects; (iii) Where to evaluate, which summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (iv) How to evaluate, which reviews existing mainstream evaluation methods based on the roles of the evaluators and some evaluation frameworks that integrate the entire evaluation pipeline. Finally, we identify the challenges in safety evaluation of LLMs and propose promising research directions to promote further advancement in this field. We emphasize the necessity of prioritizing safety evaluation to ensure the reliable and responsible deployment of LLMs in real-world applications.

연구 동기 및 목표

LLMs의 안전성 평가의 배경과 중요성을 설명하고 이것이 일반적인 LLM 평가와 어떻게 다르는지 설명한다.
주요 안전성 평가 작업과 차원들(독성, 강인성, 윤리, 편향/공정성, 진실성 등)을 분류하고 체계적으로 정리한다.
안전성 평가에 널리 사용되는 평가 지표, 데이터셋, 벤치마크, 도구 키트를 요약한다.
평가 방법론을 검토하고 자동 평가 vs 인간 평가자의 역할에 따라 접근 방식을 분류한다.
현재의 도전과제를 식별하고 LLM 안전성 평가와 표준화를 진전시키기 위한 방향을 제시한다.

제안 방법

왜 평가할 것인지, 무엇을 평가할 것인지, 어디에서 평가할 것인지, 그리고 LLM의 안전성을 어떻게 평가할 것인지의 네 가지 차원 프레이밍을 제안한다.
독성, 강인성, 윤리, 편향/공정성, 진실성 등과 같은 차원에 걸친 안전성 평가 작업의 상세한 분류 체계를 제공한다.
안전성 평가에 사용되는 기존의 평가 지표, 데이터셋, 벤치마크, 도구 키트를 수집하고 분류한다.
평가 방법론을 검토하고 평가자 유형(자동화 시스템 대 인간 평가자)에 따라 분류한다.
도전과제를 논의하고 안전성 평가를 표준화하고 발전시키기 위한 향후 연구 방향을 제시한다.

실험 결과

연구 질문

RQ1LLM 안전성 평가를 일반 모델 평가와 구별하는 주요 동기와 배경은 무엇인가?
RQ2LLM을 평가하는 데 사용되는 주요 안전성 평가 작업과 차원은 무엇인가?
RQ3안전성 평가에 일반적으로 사용되는 지표, 데이터셋, 벤치마크는 무엇이며 어떤 도구가 존재하는가?
RQ4안전성 평가는 어떻게 수행되는가(평가 도구 키트 및 방법) 그리고 누가 수행하는가(인간 평가자 대 자동 평가자)?
RQ5LLM의 향후 안전성 평가에서 주요 도전과제와 유망한 방향은 무엇인가?

주요 결과

LLM 안전성 평가에 대한 최근 발전에 대한 포괄적이고 체계적인 검토를 제공한다.
다수의 차원에 걸친 안전성 평가 작업에 대한 명확한 분류 체계를 확립한다.
연구자들을 위해 평가 지표, 데이터셋/벤치마크, 도구 키트, 방법을 통합 정리한다.
안전성 평가 관행의 표준화와 더 넓은 채택의 필요성을 강조한다.
도전과제를 논의하고 LLM의 안전하고 책임 있는 개발 및 배치를 촉진하기 위한 방향을 제시한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.