QUICK REVIEW

[논문 리뷰] Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

Jiawen Deng, Jiale Cheng|arXiv (Cornell University)|2023. 02. 18.

Software Engineering Research인용 수 11

한 줄 요약

본 설문조사는 대형 언어 모델에서의 안전 연구를 위한 프레임워크를 제공하고, 안전 위험, 평가 방법, 그리고 사전 학습부터 배포까지의 개선 전략을 자세히 설명한다.

ABSTRACT

As generative large model capabilities advance, safety concerns become more pronounced in their outputs. To ensure the sustainable growth of the AI ecosystem, it's imperative to undertake a holistic evaluation and refinement of associated safety risks. This survey presents a framework for safety research pertaining to large models, delineating the landscape of safety risks as well as safety evaluation and improvement methods. We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models, encompassing preference-based testing, adversarial attack approaches, issues detection, and other advanced evaluation methods. Additionally, we explore the strategies for enhancing large model safety from training to deployment, highlighting cutting-edge safety approaches for each stage in building large models. Finally, we discuss the core challenges in advancing towards more responsible AI, including the interpretability of safety mechanisms, ongoing safety issues, and robustness against malicious attacks. Through this survey, we aim to provide clear technical guidance for safety researchers and encourage further study on the safety of large models.

연구 동기 및 목표

독성, 편향성(불공정성), 윤리, 논란의 여지 있는 의견, 잘못된 정보, 개인정보 보호, 악용 등을 포함하여 대형 언어 모델에서의 안전 위험 범위를 정의한다.
선호도 기반 테스트, 적대적 공격, 안전 문제 탐지 등을 포함한 안전 평가 방법을 조사한다.
더 안전한 모델 개발을 가이드하기 위해 사전 학습, 정렬, 추론 및 후처리 전반에 걸친 안전 개선 전략을 요약한다.

제안 방법

안전 위험을 여섯 영역으로 분류하여 구조화된 위험 구도를 제공한다.
선호도 기반 테스트, 적대적 공격, 탐지 방법 등을 포함한 평가 프레임워크를 설명한다.
사전 학습, 정렬, 추론, 후처리의 네 단계에 걸친 안전 개선 기법을 검토한다.

실험 결과

연구 질문

RQ1What is the scope of LM safety risks?
RQ2How do we quantify and evaluate these risks?
RQ3How can LMs’ safety be improved?

주요 결과

LM의 안전 위험은 여섯 영역으로 분류된다: Toxicity, Unfairness, Ethics, Controversial Opinions, Misleading Information, Privacy, and Malicious Use.
평가 방법에는 선호도 기반 테스트, 적대적 안전 공격, 그리고 안전 이슈 탐지기에 주목하며, 고급 지시 따름 모델에 주의한다.
안전 개선은 사전 학습 데이터 선별, 정렬 기술(RLHF, 제어된 생성, 프롬프트 설계 포함), 추론 시 안전장치 및 후처리 방어를 포괄한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.