QUICK REVIEW

[论文解读] Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Raphael Poulain, Hamed Fayyaz|arXiv (Cornell University)|Apr 23, 2024

Artificial Intelligence in Law被引用 7

一句话总结

本研究在三個臨床問答資料集上，透過紅隊演練與提示策略，評估八個大型語言模型（LLMs）的偏見模式，揭示偏見呈現出異質性，且提示設計（特別是 Chain of Thought）可減少偏見結果。

ABSTRACT

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.

研究动机与目标

評估 LLMs 如何在使用標準化病例情境資料集的受控臨床任務中展現社會偏見。
比較通用型與領域調整之 LLMs，以理解模型架構與訓練數據對偏見的影響。
評估提示策略（zero-shot、few-shot、Chain of Thought）對偏見模式的影響。
確定在輸出中風險較高的任務類型與亞群體，並討論緩解方法。

提出的方法

使用三個帶有標準化情境的臨床問答資料集（Q-Pain、nurse bias、NEJM Healer）來探測跨人口統計的偏見。
進行紅隊演練，透過改變患者人口統計特徵並用多個 LLMs（開源通用型、領域聚焦型、以及閉源）評估輸出。
在所選資料集上測試三種提示技術：zero-shot、few-shot、以及 Chain of Thought，以衡量偏見與表現差異。
用 Welch’s ANOVA 與成對 t 檢定（對於二元結果）量化偏見，並用 Pearson’s Chi-Squared 檢驗（對於 Likert 量表評分）進行分析。
分析不同模型架構與提示方法的結果，以識別偏見模式及可能的緩解效果。

Figure 1: Visual description of the evaluation framework.

实验结果

研究问题

RQ1LLMs 在受控的臨床決策任務中展現偏見模式的程度為何？
RQ2模型設計選擇（架構、領域特定微調）如何影響觀察到的偏見？
RQ3提示策略（zero-shot、few-shot、Chain of Thought）對臨床 QA 任務中的公正性有何影響？

主要发现

在受保護群體與任務之間存在偏見差異，某些模型在建議或感知上顯示顯著差異。
模型大小本身無法預測偏見；一些較小的領域調整模型表現出顯著偏見，而其他則相對公正。
臨床調整模型（例如 Palmyra-Med、Meditron）在疼痛管理與治療建議方面可能顯示顯著偏見，而 GPT-4 依任務而異。
與 zero-shot 或簡單 few-shot 提示相比，Chain of Thought 提示往往可減少偏見並改善決策之理由說明。
提示工程與謹慎的問題表述可影響公正性，提出在不重新訓練模型的情況下緩解偏見的實用途徑。

Figure 2: Results on the Q-Pain dataset. The LLMs were presented with clinical vignettes describing various medical contexts and were asked whether they would prescribe pain medication to the patients. Each demographic is color-coded and the bars represent the average probability of denying the pain

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。