Skip to main content
QUICK REVIEW

[论文解读] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei, Adam E. Flanders|arXiv (Cornell University)|Jan 21, 2026
Artificial Intelligence in Healthcare and Education被引用 0
一句话总结

REVEAL-CXR 策划一个经过放射科医生验证的200张胸部X光片基准数据集,包含12个心胸标签,使用 AI 辅助标注以加速专家注释,用于评估多模态大语言模型。

ABSTRACT

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

研究动机与目标

  • 提供一个高质量、由专家注释的胸部X光片基准,聚焦心胸体征。
  • 演示一种 AI 辅助标注工作流,以扩大放射科注释规模。
  • 确保数据在多中心、多机构下的平衡,设定保留集用于独立模型评估。

提出的方法

  • 使用 GPT-4o 从放射科报告中提取异常发现。
  • 将提取的发现映射到12个 predefined benchmark 标签,使用本地托管的 Phi-4-Reasoning 模型。
  • 分层抽样选取 AI 建议标签的研究,以便专家评审(每个研究1–6个标签)。
  • 由来自10个机构的17名放射科医师在网页平台上裁定标签(同意全部/大多数同意/不同意)。
  • 仅保留至少两名评审者“同意全部”的研究,生成381项研究;选取100份已发布数据和100份保留数据集。
  • 使用 Cohen’s kappa 及自举置信区间计算评审者之间的一致性;并将放射科医生与多数投票参考标准进行比较。

实验结果

研究问题

  • RQ1AI 辅助标注工作流是否能产出经放射科医生验证的胸部X光片标签?
  • RQ2在一个12标签的心胸胸部X光片基准中,放射科医生之间的一致性如何(Interrater reliability)?
  • RQ3在保留的多中心胸部X光片数据集中,放射科医生标签与 AI 提议标签的比较如何?
  • RQ4已发布与保留子集在影像采集属性上是否均衡,是否适合公平的模型评估?
  • RQ5这类基准在多模态 LLM 评估中的局限性与潜在作用是什么?

主要发现

  • 创建并公开发布了一个包含12个标签的200张胸部X光片基准,且每个研究由三位放射科医生审核。
  • 放射科医生之间的一致性(二元 Agree/Disagree)Kappa 值为 0.622(95% 置信区间 0.590, 0.651)。
  • 气体可吸入性阴影的达成度较低(Kappa = 0.484,95% CI [0.440, 0.524]);大多数发现的 Kappa 在 0.744 至 0.809 之间。
  • 在619/1000例(61.9%)中,多数意见与 LLM 建议标签相左,显示AI标签存在较高的分歧。
  • 已发布与保留子集在获取属性上无显著差异(卡方检验 p 值均 >0.05)。
  • 该数据集强调罕见或多重发现,381项研究达到两名以上放射科医生的一致。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。