QUICK REVIEW

[论文解读] TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

Saketh Vinjamuri, Marielle Fis Loperena|arXiv (Cornell University)|Mar 22, 2026

Machine Learning in Healthcare被引用 0

一句话总结

TimeTox 开发了一个端到端的基于LLM的管道，用于自动从临床试验方案的 Schedule of Assessments 表中提取时间毒性，比较 vanilla 与两阶段架构，并在644份真实世界肿瘤学方案上进行验证。

ABSTRACT

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

研究动机与目标

需要量化协议文档中的患者时间负担（time toxicity）的动机。
开发一个使用 Gemini 模型的端到端管道，以从 SoA 表中提取并计算时间毒性。
比较单次传递（vanilla）与两阶段（结构-再计数）提取架构。
通过多次运行共识与真实世界协议部署评估生产可行性。

提出的方法

使用 Google Gemini 模型对完整协议 PDF 进行摘要提取。
实现两种提取架构：vanilla 单次传递与两阶段结构-再计数。
应用基于位置的多次运行共识，以缓解跨运行的臂名不稳定性。
在 20 个人工合成的时间表上训练并验证，具有真实地面-truth 的时间毒性值。
处理 644 份真实世界肿瘤学方案，以演示生产可行性。
提供开源代码和合成地面真值生成器。

Figure 1: Representative SoA table from a complex synthetic breast oncology protocol (BRST-2025-01) showing two treatment arms with three visit days per cycle.

实验结果

研究问题

RQ1基于LLM的管道是否能够从 Schedule of Assessments 表中准确量化时间毒性？
RQ2哪种架构（vanilla vs 两阶段）在合成数据与真实世界数据中具有更高的准确性和稳定性？
RQ3多次运行共识是否能提升对时间毒性提取的鲁棒性，降低运行间的变异？
RQ4在生产规模上，提取在时间、成本和跨方案的可重复性方面是否可行？

主要发现

两阶段提取在合成数据上具有较高的准确性，但在真实世界的稳定性较差（MAE 0.81 天；精确匹配 0.3%；临床可接受性 100%）在 240 个合成对比中。
Vanilla 提取在合成数据上准确性适中，但在真实世界上稳定性强（此处未给出 MAE；临床可接受性在 644 份方案中达到 95.3%；完美稳定性 82.0%）。
生产部署采用 vanilla，3 轮共识，覆盖 644 份方案，生成 1,288 条臂的时间毒性数据。
处理时间：合成摘要每份方案 2–3 分钟；vanilla 提取约 4 分钟/份；644 份方案总计约 128 小时。
时间毒性的开启源代码和地面真值生成器可在 TimeTox 的 GitHub 仓库获取。

Figure 2: Step-by-step pipeline for processing protocol PDFs via the Gemini API to extract relevant schedules and generate a consolidated summary document.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。