QUICK REVIEW

[논문 리뷰] MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Tessa Han, Aounon Kumar|arXiv (Cornell University)|2024. 03. 06.

Machine Learning in Healthcare인용 수 5

한 줄 요약

이 논문은 의료 안전성과 정렬을 정의하고, 해로운 의료 프롬프트 데이터셋(med-harm)을 구축하며, 일반 LLM과 의료 LLM의 안전성을 평가하고, 파인튜닝이 안전성을 향상시킬 수 있음을 보여주며, 의료 LLM의 안전한 개발을 위한 더 넓은 완화 전략을 논의한다.

ABSTRACT

As large language models (LLMs) develop increasingly sophisticated capabilities and find applications in medical settings, it becomes important to assess their medical safety due to their far-reaching implications for personal and public health, patient safety, and human rights. However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association. We then leverage this understanding to introduce MedSafetyBench, the first benchmark dataset designed to measure the medical safety of LLMs. We demonstrate the utility of MedSafetyBench by using it to evaluate and improve the medical safety of LLMs. Our results show that publicly-available medical LLMs do not meet standards of medical safety and that fine-tuning them using MedSafetyBench improves their medical safety while preserving their medical performance. By introducing this new benchmark dataset, our work enables a systematic study of the state of medical safety in LLMs and motivates future work in this area, paving the way to mitigate the safety risks of LLMs in medicine. The benchmark dataset and code are available at https://github.com/AI4LIFE-GROUP/med-safety-bench.

연구 동기 및 목표

AMA Principles of Medical Ethics를 기반으로 의료 AI의 의료 안전 및 정렬 정의.
AMA Principles of Medical Ethics를 기반으로 한 해로운 의료 프롬프트의 med-harm 데이터셋 생성.
해로운 프롬프트 벤치마크를 사용하여 일반 지식 LLM과 의료 LLM의 안전성 및 정렬 평가.
안전성 향상을 위한 미세 조정(fine-tuning) 기반 완화 전략 시연.
안전하고 정렬된 의료 LLM을 개발하기 위한 보다 넓은 접근 방식 논의.

제안 방법

AMA Principles of Medical Ethics를 가이드 표준으로 삼아 의학에서 안전성과 정렬 정의.
GPT-4와 jailbroken Llama-2-7b-chat를 사용하여 GPT-4를 통해 프롬프트를 생성하고, nine AMA principles에 걸친 1,742개의 해로운 의료 프롬프트로 구성된 med-harm 데이터셋을 구축.
일반 지식과 의료 LLM을 hex-phi(일반 해)와 med-harm-llama2, med-harm-gpt4(의료 해) 데이터세트에서 평가.
GPT-4를 사용하여 해로운 프롬프트에 대한 LLM 응답을 1–5의 의향 척도로 점수화하되, 일반 안전은 Meta policy, 의료 안전은 AMA principles에 의해 안내.
정렬된 vs 비정렬 일반 LLM과 다양한 의료 LLM을 비교하여 안전 격차와 잠재적 개선점을 평가.
안전 시연에 대한 파인튜닝을 완화 전략으로 탐색(결과는 곧 발표 예정).

실험 결과

연구 질문

RQ1일반 지식 및 의료 LLM이 해로운 의료 및 일반 프롬프트에 대해 안전성과 정렬 측면에서 어떻게 성능을 나타내나요?
RQ2현재 정렬된 일반 지식 LLM이 의학에서 더 안전한 행동을 보이나요, 그리고 의료 LLM은 어떻게 비교되나요?
RQ3안전 시연에 대한 파인튜닝이 의료 LLM의 일반 및 의료 안전성을 향상시킬 수 있나요?
RQ4안전하고 정렬된 의료 LLM을 개발하기 위한 실용적 완화 전략 및 더 넓은 접근 방식은 무엇인가요?

주요 결과

정렬된 일반 지식 LLM들(Llama-2-chat, GPT-4, GPT-3.5 등)은 비정렬 모델들보다 해롭지 않은 점수가 더 낮게 나타나지만, 여전히 때때로 해로운 프롬프트를 출력한다.
의료 LLM들 중 Meditron-70b는 일관되게 낮은 해로움을 보이는 반면, 다른 의료 LLM들은 점수가 더 높아 해로운 출력의 위험이 더 크다.
의료 LLM은 일반적으로 일반 지식 LLM의 정렬된 기준에 비해 의료 프롬프트에서 더 높은 해로움을 보인다.
General vs medical: 비정렬 모델은 일반 해(hex-phi)와 의료 해에서 더 나쁘게 작동하는 반면, 정렬된 일반 지식 LLM은 데이터 세트 전반에서 더 안전한 행동을 유지한다.
전문 용어가 포함된 의료 프롬프트는 서로 다른 해로움 인식을 불러일으킬 수 있으며, 전문 용어가 있을 때 일부 프롬프트가 더 해롭다.
안전 시연에 대한 파인튜닝은 유망한 완화 전략으로 제시되며(결과는 곧 발표될 예정).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.