QUICK REVIEW

[论文解读] Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Ran Zhang, Yucong Lin|arXiv (Cornell University)|Mar 24, 2026

COVID-19 diagnosis using AI被引用 0

一句话总结

本论文提出 Ran Score，一种面向发现级的放射科报告生成评估指标，建立在临床医生引导的LLM框架之上，提取21个胸部X光发现并将模型输出与放射科参考文献对齐，显示出强的宏平均性能和跨语言鲁棒性。

ABSTRACT

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

研究动机与目标

开发一个临床医生引导的框架，从胸部X光报告中提取多标签发现。
将 Ran Score 定义为面向发现的宏平均评估指标，用于评估报告的一致性。
创建一个大型、临床对齐的标签资源（21-label taxonomy），用于胸部X光发现。
对多种放射科报告生成模型进行与放射科医生推导参考的评估。
展示跨语言泛化能力，在 ChestX-CN 上无需语言特定提示修改即可实现。

提出的方法

基于探索性抽取和放射科医生输入，建立21标签胸部X光发现分类体系。
通过对独立标注报告的多数表决（≥4/6）来建立放射科医生的参考标准。
采用人机协作框架，进行迭代、临床医生引导的提示 refine，以实现高标签特异性准确度（≥90%）。
通过错误驱动分析优化提示，解决同义词、否定与低发生率标签的问题。
将优化后的提示应用于从生成报告和参考报告中提取发现，计算 Ran Score 作为宏平均 F1。
使用 Ran Score 与传统指标比较多种基于LLM的报告生成模型（及基线）。

Figure 1: Human–LLM collaborative framework for multi-label finding extraction from chest X-ray reports. (a) Exploratory extraction and clustering of disease-related entities from 3,000 chest X-ray reports to inform the definition of standardized finding labels. (b) Establishment of the diagnostic r

实验结果

研究问题

RQ1临床医生引导的人机提示循环是否能够使 LLM 提取的发现与放射科参考标准在21个标签上对齐？
RQ2Ran Score 是否提供一个具有临床意义的发现级评估，强调低发生率的异常？
RQ3在不进行语言特定调整的前提下，提示优化框架在跨语言（MIMIC-CXR-EN 到 ChestX-CN）上的泛化能力如何？
RQ4在 Ran Score 与传统指标评估下，不同的胸部X光报告生成模型表现如何？
RQ5少量示例提示优化对宏观与微观平均在各标签上的性能影响如何？

主要发现

在 MIMIC-CXR-EN 开发队列上，经过提示优化后，发现提取的宏观平均 F1 从 0.753 提升到 0.956。
Ran Score 在可比标签上超越 CheXbert 15.7 个百分点。
优化后，若干标签达到了完美 F1 分数（如 Fracture、Pneumothorax、Cavity and Cyst）。
Qwen3-14B 对许多发现具有强烈的逐标签准确性，并对 ChestX-CN 展现出鲁棒的跨语言泛化，在多标签上表现出高准确性。
在生成报告的模型中，LLM-RG4 在 Ran Score 下实现最高的宏观平均 F1，R2GenGPT 等随后；XrayGPT 在宏观平均评估中表现最差。
定性放射科医生评审与自动 Ran Score 排名一致，支持评估框架的临床相关性。

Figure 2: Flowchart of dataset construction and cohort allocation. Reports from MIMIC-CXR were screened and divided into three non-overlapping MIMIC-CXR-EN cohorts: a 3,000-report cohort for taxonomy construction, a 300-report development cohort for prompt optimization and radiologist reference-stan

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。