QUICK REVIEW

[論文レビュー] Safety Assessment of Chinese Large Language Models

Hao Sun, Zhexin Zhang|arXiv (Cornell University)|Apr 20, 2023

Adversarial Robustness in Machine Learning被引用数 16

ひとこと要約

本論文は、中国語LLMの安全性ベンチマークを提示し、8つの安全シナリオと6つの指示-攻撃タイプにわたって15モデルを評価し、LLM評価者を用いて安全性をスコアリングし、コミュニティ利用のためSafetyPromptsを公開している。

ABSTRACT

With the rapid popularity of large language models such as ChatGPT and GPT-4, a growing amount of attention is paid to their safety concerns. These models may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes such as fraud and dissemination of misleading information. Evaluating and enhancing their safety is particularly essential for the wide application of large language models (LLMs). To further promote the safe deployment of LLMs, we develop a Chinese LLM safety assessment benchmark. Our benchmark explores the comprehensive safety performance of LLMs from two perspectives: 8 kinds of typical safety scenarios and 6 types of more challenging instruction attacks. Our benchmark is based on a straightforward process in which it provides the test prompts and evaluates the safety of the generated responses from the evaluated model. In evaluation, we utilize the LLM's strong evaluation ability and develop it as a safety evaluator by prompting. On top of this benchmark, we conduct safety assessments and analyze 15 LLMs including the OpenAI GPT series and other well-known Chinese LLMs, where we observe some interesting findings. For example, we find that instruction attacks are more likely to expose safety issues of all LLMs. Moreover, to promote the development and deployment of safe, responsible, and ethical AI, we publicly release SafetyPrompts including 100k augmented prompts and responses by LLMs.

研究の動機と目的

包括的な安全性分類とベンチマークを開発することにより、中国語LLMの安全なデプロイメントを促進する。
自動化されたLLM評価を用いて、複数の中国語モデルとOpenAIモデルの安全性パフォーマンスを評価する。
安全性テストとモデル改善を支援するための公的なセーフプロンプト資源を提供する。

提案手法

8つの典型的な安全シナリオと6つの指示-攻撃タイプの分類を定義する。
手動で作成した安全プロンプトを作成し、LLM評価者を用いてモデル応答の安全性を判断する。
シナリオ別の安全スコアとマクロ平均を計算して、全体の安全性Sを導出する。
GPTシリーズや中国語LLMを含む15モデルをベンチマークサイトで収集・評価する。
100kのSafetyPromptsでプロンプトを拡充し、公開する。
安全性評価のためのリーダーボードとオープンソース資源を提供する。

実験結果

リサーチクエスチョン

RQ1標準的な安全シナリオにおける現在の中国語LLMの安全性パフォーマンスはどうか？
RQ2指示攻撃は、典型的な安全シナリオと比較してLLMsの安全性にどのように影響するか？
RQ3自動的なLLMベースの評価者は、モデル出力の安全性を信頼性高く判断できるか？
RQ4安全プロンプトの拡充がモデルの安全性に与える影響は？
RQ5統一された安全リーダーボードで、異なるモデルはどのように比較されるか？

主な発見

OpenAI’s ChatGPTは、安全でない入力の拒否と安全データにより、ほとんどのシナリオで安全スコアをリードする。
指示-攻撃は、モデルを問わず、典型的なシナリオより一貫して低い安全スコアを生み出す。
指示データで訓練されたモデルは、一般的な対話モデルよりも安全性で優れている傾向にある。
いくつかのシナリオでは、ChatGPTはChatGLMやMiniChatなどの中国語LLMと同程度だが、指示-攻撃のギャップは依然大きい。
指示-攻撃の安全スコアは典型的なシナリオのスコアを下回り、ChatGPTと他のモデルとの全体的な安全性ギャップを広げている。
SafetyPromptsライブラリは、拡張された100kのプロンプトと回答で公開されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。