QUICK REVIEW

[論文レビュー] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei, Adam E. Flanders|arXiv (Cornell University)|Jan 21, 2026

Artificial Intelligence in Healthcare and Education被引用数 0

ひとこと要約

REVEAL-CXRは、AI支援ラベリングを用いて専門家の注釈を迅速化し、多模塊LLMを評価するための放射線科医検証済みベンチマーク200枚の胸部レントゲン画像（公開100枚、ホールドアウト100枚）と12の心胸部ラベルをキュレーションします。

ABSTRACT

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

研究の動機と目的

高品質で専門家が注釈した胸部レントゲンベンチマークを心胸部所見に焦点を当てて提供する。
放射線科の注釈をスケールさせるためのAI支援ラベリングワークフローを実証する。
ホールドアウトセットを含むバランスの取れた多施設・多機関データを確保し、独立したモデル評価を可能にする。

提案手法

GPT-4oを用いて放射線報告から異常所見を抽出する。
抽出された所見を12の事前定義ベンチマークラベルにローカルでホストされたPhi-4-Reasoningモデルを用いてマッピングする。
AIが提案したラベルを用いた研究を専門家レビューのために層別サンプリングする（研究ごとに1–6ラベル）。
10機関から来た17人の放射線科医がウェブプラットフォームでラベルを審定する（Agree All / Agree Mostly / Disagree）。
Agree All評価が少なくとも2つある研究のみを保持し、381件の研究を作成；100件の公開データセットと100件のホールドアウトデータセットを選択する。
カインのkappaによる評価者間一致とブートストラップ信頼区間を算出し、放射線科医と多数決参照との比較を行う。

実験結果

リサーチクエスチョン

RQ1AI支援ラベリングワークフローは胸部レントゲンの放射線科医検証済みラベルを生産できるか。
RQ212ラベルの心胸部胸部レントゲンベンチマークにおける放射線科医間の信頼性はどの程度か。
RQ3ホールドアウト・多施設データセットにおける放射線科医ラベルとAI提案ラベルの比較はどうか。
RQ4公開データセットとホールドアウトサブセットの撮像取得特性は公正な評価を可能にする balancedか。
RQ5こうしたベンチマークがマルチモーダルLLM評価においてどのような限界と潜在的役割を持つか。

主な発見

12のラベルを持つ200枚の胸部レントゲンベンチマークを作成し公開、各研究は3名の放射線科医が審査。
放射線科医間の一致（バイナリのAgree/Disagree）についてのCohen’s kappaは0.622（95%CI 0.590, 0.651）。
気腔陰影は一致が低く、kappa = 0.484、95%CI [0.440, 0.524]；ほとんどの所見は0.744〜0.809のkappaを示す。
LLM提案ラベルに対して過半数がDisagreeと投票したケースは619 / 1,000件（61.9%）で、AIラベルからの乖離が頻繁に見られる。
公開サブセットとホールドアウトサブセットには取得特性の有意差は認められず（Chi-square p値はすべて>0.05）。
データセットは希少または複数の所見を強調しており、381件の研究が二三名の放射線科医の同意を得ている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。