QUICK REVIEW

[논문 리뷰] RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

Yishu Wei, Adam E. Flanders|arXiv (Cornell University)|2026. 01. 21.

Artificial Intelligence in Healthcare and Education인용 수 0

한 줄 요약

REVEAL-CXR는 방사선 전문의가 검증한 200장의 흉부 X선 벤치마크를 선별합니다(100개 공개, 100개 홀드아웃)로 12개의 흉부심장 라벨을 사용하며, AI 보조 라벨링을 활용해 전문 주석 작성을 빠르게 진행하고 다중 모달 LLM 평가를 위한 벤치마크입니다.

ABSTRACT

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

연구 동기 및 목표

심폐 소견에 초점을 맞춘 고품질의 전문가 주석 흉부 X선 벤치마크를 제공한다.
방사선 의사 주석의 규모 확장을 위한 AI 보조 라벨링 워크플로를 시연한다.
독립적 모델 평가를 위한 홀드아웃 세트를 포함한 균형 잡힌 다기관, 다기관 데이터 구성을 보장한다.

제안 방법

GPT-4o를 사용해 방사선 영상 보고서에서 이상 소견을 추출한다.
추출된 소견을 로컬에서 호스팅되는 Phi-4-Reasoning 모델을 사용해 12개의 미리 정의된 벤치마크 라벨로 매핑한다.
AI-제안 라벨이 있는 연구를 전문가 검토용으로 층화 샘플링한다(연구당 1–6개 라벨).
10개 기관의 17명의 방사선 전문의가 웹 플랫폼을 통해 라벨을 심의한다(Agree All / Agree Mostly / Disagree).
Agree All 등급이 두 개 이상인 연구만 남겨 381건의 연구를 생성하고; 100개를 공개, 100개를 홀드아웃 데이터셋으로 선정한다.
코헨의 카파와 부트스트랩 신뢰구간으로 평가자 간 일치를 계산하고, 방사선 전문의들을 다수 표기된 기준과 비교한다.

실험 결과

연구 질문

RQ1AI 보조 라벨링 워크플로가 흉부 X선에 대해 방사선 전문의가 검증한 라벨을 신뢰할 수 있게 생성할 수 있는가?
RQ212라벨 흉부심장 벤치마크에 대한 방사선 전문의 간의 일치도는 어떠한가?
RQ3홀드아웃 다기관 흉부 X선 데이터 세트에서 방사선 전문의 라벨은 AI가 제안한 라벨과 어떻게 비교되는가?
RQ4공개 및 홀드아웃 하위집합이 모델 평가에 공정한 영상 취득 특성을 가지도록 균형을 이루는가?
RQ5다중 모달 LLM 평가를 위한 이러한 벤치마크의 한계점과 잠재적 역할은 무엇인가?

주요 결과

200장의 흉부 X선과 12개의 라벨 벤치마크가 생성되어 공개적으로 공개되었고, 각 연구에는 세 명의 방사선 전문의가 검토했다.
방사선 전문의 간 일치도(이진 Agree/Disagree) 위한 코헨의 카파는 0.622(95% CI 0.590, 0.651)이다.
Airspace opacity는 일치도가 낮게 나타나(카파 = 0.484, 95% CI [0.440, 0.524]); 대부분의 소견은 카파가 0.744에서 0.809 사이이다.
1,000건 중 619건(61.9%)에서 다수결이 LLM이 제안한 라벨에 대해 반대하는 경향을 보였으며, AI 라벨과의 차이가 빈번했다.
공개 데이터셋과 홀드아웃 데이터셋의 취득 특성에 유의미한 차이가 나타나지 않았다(χ² p-value 모두 >0.05).
데이터셋은 희귀하거나 다중 소견을 강조하며, 381건의 연구가 두 명 이상의 방사선 전문의 합의에 도달했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.