QUICK REVIEW

[論文レビュー] Self-Verification Improves Few-Shot Clinical Information Extraction

Zelalem Gero, Chandan Singh|arXiv (Cornell University)|May 30, 2023

Topic Modeling被引用数 18

ひとこと要約

この論文は自己検証（SV）フレームワークを導入し、同じ LLM に対して異なるプロンプトを用いた複数回の呼び出しを通じて臨床情報を抽出し、出力をエビデンスに基づいてグラウンド化し、誤りを絞り込むことで少数-shot の性能を改善し、解釈可能なスパンベースの grounding を生み出します。

ABSTRACT

Extracting patient information from unstructured text is a critical task in health decision-support and clinical research. Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning, in contrast to supervised learning which requires much more costly human annotations. However, despite drastic advances in modern LLMs such as GPT-4, they still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs. This is made possible by the asymmetry between verification and generation, where the latter is often much easier than the former. Experimental results show that our method consistently improves accuracy for various LLMs in standard clinical information extraction tasks. Additionally, self-verification yields interpretations in the form of a short text span corresponding to each output, which makes it very efficient for human experts to audit the results, paving the way towards trustworthy extraction of clinical information in resource-constrained scenarios. To facilitate future research in this direction, we release our code and prompts.

研究の動機と目的

ヘルスケア領域におけるLLMsを用いた少数-shot の臨床情報抽出の精度と解釈性の制約を解消する。
自己検証パイプラインを提案し、出力を逐次 refine、grounding、prune する。
SV が複数のタスクとモデルで抽出精度を向上させつつ、解釈可能な根拠 grounding を提供することを示す。
将来の研究を促進するためのコードとプロンプトを公開する。

提案手法

同じ LLM を異なるプロンプトで使用する四段階の SV パイプライン: Original extraction、Omission で欠落要素を見つける、Evidence grounding で根拠となるスパンを返す、Prune で不正確さを除去。
長い入力の場合、Omission ステップを新たな欠落が見つからなくなるまで最大五回繰り返す。
Grounding spans は抽出された各項目に対するテキスト根拠を提供し、人間による監査を可能にする。
SV を単一の大規模プロンプトのベースラインと比較し、macro F1、recall、precision を複数のタスクとモデルに渡って評価した。
プロンプトと正確なプロンプトは GitHub リポジトリに提供されている。

Figure 1: Overview of self-verification pipeline for clinical information extraction. Each step calls the same LLM with different prompts to refine the information from the previous steps. Below each step we show abbreviated outputs for extracting a list of assigned diagnoses from a sample clinical

実験結果

リサーチクエスチョン

RQ1自己検証は、さまざまなモデルにわたって、少数-shot の臨床情報抽出タスクで抽出精度を改善しますか？
RQ2SV は解釈可能な grounding を提供して人間の監査を支援し、性能を犠牲にしませんか？
RQ3個別の SV コンポーネント（Omission、Evidence、Prune）は、タスクを横断して全体の性能にどう寄与しますか？
RQ4SV の有効性と grounding 品質にはモデル依存またはタスク依存のパターンがあるか？

主な発見

SV はタスクとモデルを横断して一貫して抽出精度を改善する（平均 F1 の改善は約 0.056）。
GPT-4 は大幅な利得を示す（例: 臨床試験群での F1 増加 >0.1、薬物状態で >0.3）。
アブレーション結果は、Omission が recall を、Prune が precision を高め、Full SV が双方をバランスよく改善して F1 を向上させる。
SV は人間の判断とよく一致する解釈可能な grounding spans を生み出す（例えば GPT-4 では人間が注釈した介入と 93% のスパンオーバー）。
長い入力の場合、Omission は F1 の増分に寄与する傾向があり、Prune は精度の改善を維持する；Full SV は均衡のとれた改善を提供。

Figure 2: Example output and interpretation for medication status. For each element of the output list, our pipeline outputs the text span which contains evidence for that generated output (shown with highlighting).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。