QUICK REVIEW

[論文レビュー] Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Raphael Poulain, Hamed Fayyaz|arXiv (Cornell University)|Apr 23, 2024

Artificial Intelligence in Law被引用数 7

ひとこと要約

この研究は、red-teaming および prompting strategies を用いて、three clinical QA datasets における eight LLMs のバイアスパターンを評価し、異質な biases を明らかにし、prompt design（特に Chain of Thought）によって偏った結果を減らすことができる、ということを示している。

ABSTRACT

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.

研究の動機と目的

標準化されたビネットデータセットを用いた管理された臨床タスク全体で、LLMs が社会的バイアスをどのように示すかを評価する。
汎用LLMsとドメイン特化チューニング済みLLMsを比較し、モデルアーキテクチャと訓練データがバイアスに与える影響を理解する。
prompting strategies（zero-shot、few-shot、Chain of Thought）がバイアスパターンに与える影響を評価する。
偏った出力のリスクが高いタスクタイプとサブポピュレーションを特定し、緩和アプローチを検討する。

提案手法

標準化されたビネットを用いた three clinical QA データセット（Q-Pain、nurse bias、NEJM Healer）を用いて、デモグラフィック別のバイアスを検証する。
red-teaming を適用し、患者デモグラフィックをローテーションさせ、複数の LLMs（オープンソースの汎用、ドメイン特化、クローズドソース）で出力を評価する。
選択されたデータセットで three prompting techniques: zero-shot、few-shot、Chain of Thought をテストし、バイアスとパフォーマンスの差を測定する。
Welch’s ANOVA と binary outcomes の対比較 t-test、および Likert-scale 評価の Pearson’s Chi-Squared テストでバイアスを定量化する。
モデルアーキテクチャと prompting 手法を横断して結果を分析し、バイアスパターンと潜在的な緩和効果を特定する。

Figure 1: Visual description of the evaluation framework.

実験結果

リサーチクエスチョン

RQ1制御された臨床意思決定タスクで LLMs がどの程度偏ったパターンを示すか。
RQ2モデル設計の選択（アーキテクチャ、ドメイン特化のファインチューニング）が観測されたバイアスにどのように影響するか。
RQ3prompting strategies（zero-shot、few-shot、Chain of Thought）が臨床QAタスクの公平性に与える影響は何か。

主な発見

保護されたグループやタスク間でバイアスの不均等が存在し、いくつかのモデルは推奨や認識に著しい差を示す。
モデルサイズだけではバイアスを予測できない。小型のドメイン調整モデルの中には顕著なバイアスを示すものもあれば、他は比較的公正なままのものもある。
Clinically-tuned models（例：Palmyra-Med、Meditron）は痛み管理と治療推奨に顕著なバイアスを示すことがあり、GPT-4 はタスクによって異なる。
Chain of Thought prompting は、zero-shot や単純な few-shot prompting と比較して、バイアスを低減し意思決定の正当化を改善する傾向がある。
Prompt engineering と質問の慎重な framing は、公平性に影響を与える可能性があり、モデルの再訓練を行わずにバイアスを緩和する実用的な道を示唆している。

Figure 2: Results on the Q-Pain dataset. The LLMs were presented with clinical vignettes describing various medical contexts and were asked whether they would prescribe pain medication to the patients. Each demographic is color-coded and the bars represent the average probability of denying the pain

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。