QUICK REVIEW

[논문 리뷰] Accelerating Clinical Evidence Synthesis with Large Language Models

Zifeng Wang, Lang Cao|arXiv (Cornell University)|2024. 06. 25.

Machine Learning in Healthcare인용 수 7

한 줄 요약

TrialMind는 엔드-투-엔드 임상 증거 합성을 위한 LLM-driven 파이프라인으로(검색, 선별, 데이터 추출, 합성 포함) 인간 감독과 함께, TrialReviewBench 데이터세트에서 평가되었습니다.

ABSTRACT

Synthesizing clinical evidence largely relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating clinical evidence. Here, we introduce TrialMind, a generative artificial intelligence (AI) pipeline for facilitating human-AI collaboration in three crucial tasks for evidence synthesis: study search, screening, and data extraction. To assess its performance, we chose published systematic reviews to build the benchmark dataset, named TrialReviewBench, which contains 100 systematic reviews and the associated 2,220 clinical studies. Our results show that TrialMind excels across all three tasks. In study search, it generates diverse and comprehensive search queries to achieve high recall rates (Ours 0.711-0.834 v.s. Human baseline 0.138-0.232). For study screening, TrialMind surpasses traditional embedding-based methods by 30% to 160%. In data extraction, it outperforms a GPT-4 baseline by 29.6% to 61.5%. We further conducted user studies to confirm its practical utility. Compared to manual efforts, human-AI collaboration using TrialMind yielded a 71.4% recall lift and 44.2% time savings in study screening and a 23.5% accuracy lift and 63.4% time savings in data extraction. Additionally, when comparing synthesized clinical evidence presented in forest plots, medical experts favored TrialMind's outputs over GPT-4's outputs in 62.5% to 100% of cases. These findings show the promise of LLM-based approaches like TrialMind to accelerate clinical evidence synthesis via streamlining study search, screening, and data extraction from medical literature, with exceptional performance improvement when working with human experts.

연구 동기 및 목표

Motivate the need for rapid, up-to-date clinical evidence synthesis amid explosive medical literature growth.
Propose an end-to-end AI-assisted pipeline (TrialMind) for search, screening, data extraction, and evidence synthesis.
Create and use TrialReviewBench to benchmark LLM-driven evidence synthesis.
Evaluate TrialMind against baselines and human experts across multiple cancer-treatment topics.

제안 방법

Decompose synthesis into four tasks: query generation for search, eligibility screening with user-editable criteria, structured data extraction from PDFs/XMLs, and synthesis via forest plots.
Use PICO-enriched prompts to generate comprehensive Boolean queries for PubMed-like searches and augment/refine queries with user input.
Extract study characteristics and outcomes by aligning outputs to user-provided field descriptions and linking outputs to sources for manual verification.
Standardize clinical outcomes for meta-analysis and generate forest plots to present synthesized evidence.
Benchmark TrialMind using TrialReviewBench (870 studies across 25 meta-analyses) and compare against GPT-4 and MedCPT/MPNet baselines and human baselines.

실험 결과

연구 질문

RQ1Can an LLM-driven pipeline retrieve and rank all target studies from large literature databases with high recall?
RQ2Do user-editable inclusion criteria and multi-step prompting improve study screening and ranking over baseline LLM methods?
RQ3How accurately can TrialMind extract study design, populations, and results from unstructured documents and support meta-analysis?
RQ4Does the synthesized evidence produced by TrialMind match or outperform baselines and human judgments in forest plots and overall quality?

주요 결과

TrialMind achieved average Recall of 0.921 across 25 reviews, outperforming GPT-4 (0.079) and Human baseline (0.230).
TrialMind consistently reached Recall near 1 in four topics, with notable gains in Hormone Therapy and Hyperthermia (e.g., Recall@50 improved 10.53- to 33.33-fold vs baselines).
Data extraction accuracy across topics ranged from 0.72 to 0.83 for study design/population/results, with precision above 0.86 and recall above 0.93 for studied fields.
Human evaluators preferred TrialMind over GPT-4 baselines for synthesized forest plots, with winning rates 62.5%-100% across five studies.
TrialMind reduced hallucinations and provided traceable sources, enabling human verification and correction of outputs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.