QUICK REVIEW

[论文解读] Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

Jiawen Deng, Jiale Cheng|arXiv (Cornell University)|Feb 18, 2023

Software Engineering Research被引用 11

一句话总结

本综述为大语言模型的安全研究提供框架，详细描述从训练前到部署的安全风险、评估方法和改进策略。

ABSTRACT

As generative large model capabilities advance, safety concerns become more pronounced in their outputs. To ensure the sustainable growth of the AI ecosystem, it's imperative to undertake a holistic evaluation and refinement of associated safety risks. This survey presents a framework for safety research pertaining to large models, delineating the landscape of safety risks as well as safety evaluation and improvement methods. We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models, encompassing preference-based testing, adversarial attack approaches, issues detection, and other advanced evaluation methods. Additionally, we explore the strategies for enhancing large model safety from training to deployment, highlighting cutting-edge safety approaches for each stage in building large models. Finally, we discuss the core challenges in advancing towards more responsible AI, including the interpretability of safety mechanisms, ongoing safety issues, and robustness against malicious attacks. Through this survey, we aim to provide clear technical guidance for safety researchers and encourage further study on the safety of large models.

研究动机与目标

定义大语言模型在毒性、不公平、伦理、具争议观点、错误信息、隐私以及恶意使用等方面的安全风险范围。
调查包括偏好测试、对抗攻击以及安全问题检测在内的评估方法。
概括覆盖训练前、对齐、推理和后处理的安全改进策略，以指导更安全的模型开发。

提出的方法

将安全风险分为六个领域，以提供结构化的风险全景。
描述包括偏好测试、对抗攻击和检测方法的评估框架。
回顾覆盖四个阶段的安全改进技术：训练前、对齐、推理和后处理。

实验结果

研究问题

RQ1LMs 的安全风险范围是什么？
RQ2如何量化和评估这些风险？
RQ3如何改进语言模型的安全性？

主要发现

语言模型的安全风险分为六个领域：毒性、不公平、伦理、具争议观点、错误信息、隐私以及恶意使用。
评估方法包括偏好测试、对抗性安全攻击和安全问题检测器，并关注高级指令遵循模型。
安全改进涵盖训练前的数据筛选、对齐技术（提示设计、RLHF、受控生成）、推理时的防护、以及后处理防御。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。