QUICK REVIEW

[論文レビュー] Accelerating Clinical Evidence Synthesis with Large Language Models

Zifeng Wang, Lang Cao|arXiv (Cornell University)|Jun 25, 2024

Machine Learning in Healthcare被引用数 7

ひとこと要約

TrialMind は、検索、スクリーニング、データ抽出、統合を含むエンドツーエンドの臨床エビデンス合成のための LLM 駆動パイプラインで、人間の監視を伴い、TrialReviewBench dataset で評価されました。

ABSTRACT

Synthesizing clinical evidence largely relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating clinical evidence. Here, we introduce TrialMind, a generative artificial intelligence (AI) pipeline for facilitating human-AI collaboration in three crucial tasks for evidence synthesis: study search, screening, and data extraction. To assess its performance, we chose published systematic reviews to build the benchmark dataset, named TrialReviewBench, which contains 100 systematic reviews and the associated 2,220 clinical studies. Our results show that TrialMind excels across all three tasks. In study search, it generates diverse and comprehensive search queries to achieve high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind surpasses traditional embedding-based methods by 30% to 160%. In data extraction, it outperforms a GPT-4 baseline by 29.6% to 61.5%. We further conducted user studies to confirm its practical utility. Compared to manual efforts, human-AI collaboration using TrialMind yielded a 71.4% recall lift and 44.2% time savings in study screening and a 23.5% accuracy lift and 63.4% time savings in data extraction. Additionally, when comparing synthesized clinical evidence presented in forest plots, medical experts favored TrialMind's outputs over GPT-4's outputs in 62.5% to 100% of cases. These findings show the promise of LLM-based approaches like TrialMind to accelerate clinical evidence synthesis via streamlining study search, screening, and data extraction from medical literature, with exceptional performance improvement when working with human experts.

研究の動機と目的

爆発的に増加する医療文献の中で、迅速で最新の臨床エビデンス合成の必要性を動機づける。
検索、スクリーニング、データ抽出、エビデンス統合のためのエンドツーエンドのAI支援パイプライン（TrialMind）を提案する。
LLM駆動のエビデンス合成をベンチマークするために TrialReviewBench を作成・活用する。
複数の癌治療トピックにわたり、ベースラインおよび人間の専門家と比較して TrialMind を評価する。

提案手法

合成を4つのタスクに分解する：検索のクエリ生成、ユーザー編集可能な基準を用いた適格性スクリーニング、PDF/XML からの構造化データ抽出、そしてフォレストプロットによる統合。
PICO強化プロンプトを使用してPubMedに似た検索のための包括的なブール検索式を生成し、ユーザー入力で検索クエリを補強・洗練させる。
出力をユーザー提供のフィールド説明に合わせて研究特性とアウトカムを抽出し、手動検証のために出力を出典にリンクさせる。
メタ分析のために臨床アウトカムを標準化し、統合エビデンスを示すフォレストプロットを生成する。
TrialMind を TrialReviewBench（870 件の研究、25 のメタ分析）を用いてベンチマークし、GPT-4 および MedCPT/MPNet のベースラインと人間のベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1LLM駆動のパイプラインは、大規模な文献データベースからすべての標的研究を高いリコールで検索・ランキングできるか。
RQ2ユーザー編集可能な包含基準と多段階プロンプトは、ベースラインのLLM手法より研究のスクリーニングとランキングを改善するか。
RQ3未構造ドキュメントから研究デザイン、集団、結果を TrialMind がどれだけ正確に抽出し、メタ分析をサポートできるか。
RQ4TrialMind が生成する統合エビデンスは、フォレストプロットおよび全体的な品質において、ベースラインや人間の判断に一致または上回るか。

主な発見

TrialMind は 25 件のレビューで平均 Recall 0.921 を達成し、GPT-4 (0.079) および Human baseline (0.230) を上回った。
TrialMind は 4 トピックで一貫して Recall がほぼ1に達し、Hormone Therapy および Hyperthermia で顕著な改善を示した（例：Recall@50 がベースラインと比較して 10.53- から 33.33 倍に改善）。
研究デザイン/集団/結果のデータ抽出のトピック間での正確さは0.72〜0.83、対象フィールドの精度は0.86を上回り、リコールは0.93を超えた。
人間の評価者は、統合されたフォレストプロットに関してGPT-4ベースラインより TrialMind を好み、5件の研究で勝率は62.5%〜100%だった。
TrialMind は幻影を低減し、追跡可能な出典を提供して出力の人間による検証・訂正を可能にした。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。