QUICK REVIEW

[論文レビュー] Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Ran Zhang, Yucong Lin|arXiv (Cornell University)|Mar 24, 2026

COVID-19 diagnosis using AI被引用数 0

ひとこと要約

本論文は Ran Score を提案。臨床医-guided LLM フレームワークを用い、21の胸部X線所見を抽出し、モデル出力を放射線科医リファレンスと整合させるfinding-level評価指標。マクロ平均での高い性能と多言語間の頑健性を示す。

ABSTRACT

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

研究の動機と目的

胸部X線レポートから多ラベルの findings を抽出する臨床医主導フレームワークを開発する。
Ran Score を report fidelity の finding-level、マクロ平均評価指標として定義する。
胸部X線 findings の21ラベル分類体系という大規模で臨床に整合したラベルリソースを作成する。
放射線科医由来のリファレンスと複数の放射線診断レポート生成モデルを評価する。
言語特化のプロンプト変更を要さず ChestX-CN へのクロスリンガル一般化を実証する。

提案手法

Exploratory extract 及び radiologist の入力から 21-label Chest X-ray finding taxonomy を構築。
Independent に注釈付けされたレポートに対し、多数決（≥4/6）で radiologist reference 標準を確立。
Human–LLM 協働フレームワークを用い、臨床医主導のプロンプト改良を反復的に行い、ラベル特異的精度を高める（≥90%）。
同義語、否定、低頻度ラベルへの対応としてエラードリブン分析でプロンプトを最適化。
最適化したプロンプトを用い、生成レポートおよびリファレンスレポートから所見を抽出して Ran Score を macro-averaged F1 として計算。
Ran Score と従来指標を用いて複数の LLM ベースのレポート生成モデルを比較。

Figure 1: Human–LLM collaborative framework for multi-label finding extraction from chest X-ray reports. (a) Exploratory extraction and clustering of disease-related entities from 3,000 chest X-ray reports to inform the definition of standardized finding labels. (b) Establishment of the diagnostic r

実験結果

リサーチクエスチョン

RQ1臨床医主導の Human–LLM プロンプトループは 21 ラベルにわたり LLM が抽出した finding を放射線科医リファレンス標準と整合させられるか。
RQ2 Ran Score は低頻度の異常を重視する臨床的に意味のある finding-level 評価を提供するか。
RQ3 プロンプト最適化フレームワークは言語特化の調整なしに MIMIC-CXR-EN から ChestX-CN へ一般化できるか。
RQ4 Ran Score を用いた評価は他のモデルで従来指標と比較してどう変わるか。
RQ5 少数ショットのプロンプト最適化はラベル間で macro- vs micro-平均の性能にどのような影響を与えるか。

主な発見

FindExtraction のマクロ平均 F1 スコアは、プロンプト最適化前の 0.753 から開発コホートの最適化後に 0.956 に改善。
Ran Score は同等ラベルで CheXbert を 15.7 ポイント上回った。
最適化後、Fracture、Pneumothorax、Cavity、Cyst などのラベルで F1 が完璧に近い値を達成。
Qwen3-14B は多くの findings でラベル単位の高精度を示し、ChestX-CN へのクロスリンガル一般化も高精度で安定。
生成レポートモデルの中で、LLM-RG4 が Ran Score 下で最も高いマクロ平均 F1 を示し、R2GenGPT らが続く。XrayGPT はマクロ平均評価で最下。
Qualitative な放射線科医のレビュは Ran Score のランキングと一致し、評価フレームワークの臨床的関連性を支持。

Figure 2: Flowchart of dataset construction and cohort allocation. Reports from MIMIC-CXR were screened and divided into three non-overlapping MIMIC-CXR-EN cohorts: a 3,000-report cohort for taxonomy construction, a 300-report development cohort for prompt optimization and radiologist reference-stan

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。