QUICK REVIEW

[论文解读] MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Tessa Han, Aounon Kumar|arXiv (Cornell University)|Mar 6, 2024

Machine Learning in Healthcare被引用 5

一句话总结

本论文定义医疗安全与对齐，建立有害医疗提示数据集（med-harm），在安全性方面评估通用与医疗大模型，并演示微调可提升安全性；并讨论用于安全医疗大模型的更广泛缓解策略。

ABSTRACT

As large language models (LLMs) develop increasingly sophisticated capabilities and find applications in medical settings, it becomes important to assess their medical safety due to their far-reaching implications for personal and public health, patient safety, and human rights. However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association. We then leverage this understanding to introduce MedSafetyBench, the first benchmark dataset designed to measure the medical safety of LLMs. We demonstrate the utility of MedSafetyBench by using it to evaluate and improve the medical safety of LLMs. Our results show that publicly-available medical LLMs do not meet standards of medical safety and that fine-tuning them using MedSafetyBench improves their medical safety while preserving their medical performance. By introducing this new benchmark dataset, our work enables a systematic study of the state of medical safety in LLMs and motivates future work in this area, paving the way to mitigate the safety risks of LLMs in medicine. The benchmark dataset and code are available at https://github.com/AI4LIFE-GROUP/med-safety-bench.

研究动机与目标

基于 AMA 医学伦理原则，定义医疗 AI 的医疗安全与对齐。
创建 med-harm 数据集，涵盖九项 AMA 原则的有害医疗提示。
使用有害提示基准评估通用知识和医疗大模型的安全性与对齐。
通过微调安全演示来展示一种缓解策略以提升安全性。
讨论开发安全且对齐的医疗大模型的更广泛方法。

提出的方法

以 AMA 医学伦理原则为指导标准，定义医学中的安全性与对齐。
构建 med-harm，这是一个包含 1,742 条有害医疗提示的数据集，覆盖九项 AMA 原则，使用 GPT-4 和 jailbroken Llama-2-7b-chat 生成提示。
评估通用知识和医疗 LLM 在三个数据集上的表现（hex-phi 表示通用伤害，med-harm-llama2 和 med-harm-gpt4 表示医疗伤害）。
使用 GPT-4 对有害提示的 LLM 回应进行评分，采用 1–5 的意愿尺度，遵循使用政策（面向通用安全的 Meta 政策，面向医疗安全的 AMA 原则）。
比较对齐与非对齐的通用 LLM 以及各种医疗 LLM，以评估安全差距及潜在改进。
探索基于安全演示的微调作为缓解策略（结果待出）。

实验结果

研究问题

RQ1通用知识与医疗 LLM 在面对有害的医疗与一般提示时，其安全性与对齐性表现如何？
RQ2当前对齐的通用知识 LLM 在医疗领域是否实现了更安全的行为，医疗 LLM 又如何比较？
RQ3在安全演示上进行微调是否可以提升医疗 LLM 的通用与医疗安全性？
RQ4有哪些实际的缓解策略和更广泛的方法，用于开发安全并对齐的医疗 LLM？

主要发现

对齐的通用知识 LLM（如 Llama-2-chat、GPT-4、GPT-3.5）在各数据集上的有害性分数低于非对齐模型，但有时仍会输出有害提示。
在医疗 LLM 之间，Meditron-70b 显示出持续较低的有害性，而其他医疗 LLM 往往分数较高，表明输出有害内容的风险较大。
与通用知识 LLM 的对比中，医疗 LLM 在医疗提示上的有害性普遍较高。
通用对齐与非对齐模型的对比：非对齐模型在通用伤害（hex-phi）和医疗伤害上表现更差，而对齐的通用知识 LLM 在各数据集上保持更安全的行为。
带有术语的医疗提示可能引发不同的伤害感知，某些提示在存在术语时更具伤害性。
基于安全演示的微调被视为一个有希望的缓解策略（结果待出）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。