[论文解读] EviAgent: Evidence-Driven Agent for Radiology Report Generation
EviAgent 通过明确的视觉证据 grounding 与检索知识,基于计划–ReAct–证据提取管线,在 MIMIC-CXR、CheXpert Plus、IU-Xray 上总体领先;且可在本地部署,使用开源骨干。
Automated radiology report generation holds immense potential to alleviate the heavy workload of radiologists. Despite the formidable vision-language capabilities of recent Multimodal Large Language Models (MLLMs), their clinical deployment is severely constrained by inherent limitations: their "black-box" decision-making renders the generated reports untraceable due to the lack of explicit visual evidence to support the diagnosis, and they struggle to access external domain knowledge. To address these challenges, we propose the Evidence-driven Radiology Report Generation Agent (EviAgent). Unlike opaque end-to-end paradigms, EviAgent coordinates a transparent reasoning trajectory by breaking down the complex generation process into granular operational units. We integrate multi-dimensional visual experts and retrieval mechanisms as external support modules, endowing the system with explicit visual evidence and high-quality clinical priors. Extensive experiments on MIMIC-CXR, CheXpert Plus, and IU-Xray datasets demonstrate that EviAgent outperforms both large-scale generalist models and specialized medical models, providing a robust and trustworthy solution for automated radiology report generation.
研究动机与目标
- 解决端到端放射学报告生成中缺乏明确的视觉证据与外部知识的问题。
- 提出一个透明、基于证据的智能体框架,使结论建立在视觉发现之上。
- 使用开源骨干实现本地化、隐私保护的运行。
提出的方法
- Plan-Act-Report 范式将任务分解为细粒度单元。
- 工具增强的 ReAct 循环以动态调用感知工具与检索模块。
- 证据提取以构建可验证的证据链 E 来自工具输出。
- 检索增强的知识库提供外部临床先验。
- 通过 Model Context Protocol 实现易于工具集成且无需微调的扩展性。
实验结果
研究问题
- RQ1证据驱动的智能体能否在明确的视觉证据和外部知识的基础上 grounding 放射学报告?
- RQ2多专家工具协作是否比端到端的多模态大语言模型在临床准确性与可追溯性方面有提升?
- RQ3计划与证据提取对报告质量有何影响?
- RQ4检索-增强的知识如何影响临床内容与语言质量?
主要发现
| Model | MIMIC-CXR RaTE | MIMIC-CXR Semb | MIMIC-CXR RadCliQ -1 | CheXpert Plus RaTE | CheXpert Plus Semb | CheXpert Plus RadCliQ -1 | IU-Xray RaTE | IU-Xray Semb | IU-Xray RadCliQ -1 |
|---|---|---|---|---|---|---|---|---|---|
| GPT-5.1 | 4.91 | 4.98 | 8.67 | 5.62 | 4.85 | 4.23 | 8.17 | 5.55 | 7.42 |
| Claude 4.5 Sonnet | 3.51 | 3.79 | 8.59 | 4.40 | 3.41 | 3.14 | 8.30 | 4.33 | 7.12 |
| Gemini-2.5-Flash | 5.74 | 6.19 | 9.06 | 6.48 | 4.28 | 4.70 | 7.88 | 5.05 | 6.67 |
| LLaVA-Med-7B | 1.74 | 2.88 | 4.28 | 2.23 | 1.71 | 2.85 | 3.86 | 2.06 | 2.07 |
| HuatuoGPT-V-7B | 2.20 | 5.07 | 7.33 | 3.17 | 1.94 | 4.27 | 5.65 | 2.84 | 1.21 |
| BiMediX2-8B | 1.41 | 2.76 | 3.84 | 1.86 | 1.22 | 2.35 | 3.69 | 1.68 | 0.51 |
| MedGemma-4B-IT | 5.44 | 5.61 | 8.16 | 5.97 | 4.16 | 3.82 | 7.68 | 4.80 | 7.24 |
| Lingshu-7B | 5.88 | 5.87 | 8.66 | 6.37 | 4.60 | 3.91 | 7.75 | 5.20 | 7.39 |
| InternVL2.5-8B | 2.41 | 3.44 | 7.33 | 3.41 | 2.55 | 2.86 | 7.06 | 3.52 | 6.81 |
| InternVL3-8B | 3.07 | 4.71 | 7.55 | 4.05 | 2.82 | 3.88 | 7.29 | 3.85 | 3.68 |
| Qwen2.5-VL-7B | 2.30 | 3.49 | 7.30 | 3.43 | 2.21 | 2.93 | 7.13 | 3.33 | 5.79 |
| Qwen3-VL-8B | 3.94 | 4.66 | 7.51 | 4.75 | 3.72 | 3.78 | 7.48 | 4.58 | 5.74 |
| EviAgent (Ours) | 6.04 | 6.32 | 8.70 | 6.61 | 4.91 | 6.66 | 8.45 | 5.72 | 7.48 |
- EviAgent 在三个数据集的大多数指标上均达到最佳表现,特别是在 RaTE、Semb 与 RadCliQ -1 上。
- 在 MIMIC-CXR 上,EviAgent 在 RaTE 上超过 GPT-5.1,在 Semb 上超过 Lingshu-7B;在 IU-Xray 上,RadCliQ -1 达到 110.2。
- 移除分类、定位或检索都会降低性能,而计划与证据提取尤为关键。
- 定性分析显示诊断精度与错误可追溯性提升,错误可追溯到感知模块而非推理引擎。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。