QUICK REVIEW

[論文レビュー] End-To-End Clinical Trial Matching with Large Language Models

Dyke Ferber, Lars Hilgers|arXiv (Cornell University)|Jul 18, 2024

Statistical Methods in Clinical Trials被引用数 13

ひとこと要約

この論文はGPT-4oを用いた世界の腫瘍学試験を検索し、患者のEHRと照合して基準レベルの適格性マッチングを行うエンドツーエンドのパイプラインを提示し、高い精度を達成し、いくつかのタスクで専門医を上回った。

ABSTRACT

Matching cancer patients to clinical trials is essential for advancing treatment and patient care. However, the inconsistent format of medical free text documents and complex trial eligibility criteria make this process extremely challenging and time-consuming for physicians. We investigated whether the entire trial matching process - from identifying relevant trials among 105,600 oncology-related clinical trials on clinicaltrials.gov to generating criterion-level eligibility matches - could be automated using Large Language Models (LLMs). Using GPT-4o and a set of 51 synthetic Electronic Health Records (EHRs), we demonstrate that our approach identifies relevant candidate trials in 93.3% of cases and achieves a preliminary accuracy of 88.0% when matching patient-level information at the criterion level against a baseline defined by human experts. Utilizing LLM feedback reveals that 39.3% criteria that were initially considered incorrect are either ambiguous or inaccurately annotated, leading to a total model accuracy of 92.7% after refining our human baseline. In summary, we present an end-to-end pipeline for clinical trial matching using LLMs, demonstrating high precision in screening and matching trials to individual patients, even outperforming the performance of qualified medical doctors. Our fully end-to-end pipeline can operate autonomously or with human supervision and is not restricted to oncology, offering a scalable solution for enhancing patient-trial matching in real-world settings.

研究の動機と目的

ClinicalTrials.govの適切な腫瘍学試験へ患者のEHRをマッピングするエンドツーエンドのパイプラインを実証する。
No-SQLとベクトル類似検索をハイブリッド化して、関連する試験を効率的に取得する。
構造化され検証可能な出力を伴う基準レベルの適格性チェックを実行するためにLLMを用いる。
基準レベルの説明を提供し、 ground truthを洗練させるための人間‑AI協働を可能にする。

提案手法

正確な基準とベクトル検索の両方をサポートするハイブリッドデータベース（MongoDB + ChromaDB）を構築する。
試験テキストをBAAI/bge-large-en-v1.5 (768次元)で埋め込み、ベクトル検索のために50トークンのオーバーラップを持つようにテキストを分割する。
GPT-4oを用いてNo-SQLクエリを生成し、複数段階のプログラム的な試験フィルタリングを実行する。
適格性基準を構造化されたネスト型のプログラミングオブジェクトとして表現し、それに対して患者データを評価する。
推論過程スタイルの思考過程で説明付きの基準レベルのTrue/False/Unknown出力を提供する。
5名の腫瘍専門医を人間のベースラインとしてAIを評価し、AIのフィードバックを用いて ground truthを反復的に洗練させる。

実験結果

リサーチクエスチョン

RQ1GPT-4oは、特定の患者EHRに対して10万件を超える腫瘍学試験の中から適格な候補試験のプールを正しく特定できるか？
RQ2システムは単一の基準レベルで適格性基準を正確に評価し、説明を提供できるか？
RQ3エンドツーエンドのパイプラインは試験マッチングと基準評価において、人間の専門家のパフォーマンスと同等か、それを上回るか？
RQ4プログラム的で構造化された出力アプローチは、適格性評価のための自由形式テキストプロンプトより信頼性が高く、移植性があるか？
RQ5人間とAIの反復的な洗練がground truthの正確性に与える影響は何か？

主な発見

パイプラインは、テストケースの93.3%（15のベースケース）で、関連性があり人間が事前選定した試験を取得した。
初期の基準レベルマッチングは、ヒトの評価に対して88.0%の精度（1,398/1,589基準）を達成。
AIのフィードバックで人間のground truthを洗練させ、全体の精度を92.7%に向上。
GPT-4o単独で、レビュー後の人間の決定に対して39.3%の修正を生んでおり、AI支援による是正の可能性がかなりあることを示している。
最終候補セットにおいてトップ5とトップ10のランキングは、それぞれ10/15と14/15のターゲット試験を捉えた。
このアプローチはエンドツーエンドの試験マッチングにおいて高い精度を示し、がんに特有の制限には依存しない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。