QUICK REVIEW

[논문 리뷰] RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Pierre Chambon, Christian Bluethgen|arXiv (Cornell University)|2022. 11. 23.

Colorectal Cancer Screening and Detection인용 수 56

한 줄 요약

RoentGen은 잠재 확산 모델을 적응시켜 의학 텍스트 프롬프트로 조건부 고해상도 흉부 X선 이미지를 생성하고, 하위 작업을 위한 도메인 특화 미세 조정 및 데이터 증강을 가능하게 한다.

ABSTRACT

Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.

연구 동기 및 목표

의료 영상에서 자연 이미지와 의학 개념 간의 분포 차이로 인해 도메인 적응형 생성 모델의 필요성을 제시한다.
사전 학습된 잠재 확산 모델을 흉부 X선 데이터와 영상의학 보고에 적응시켜 RoentGen을 개발한다.
정량 지표와 전문가 평가를 사용하여 이미지 충실도, 다양성 및 텍스트-이미지 정렬을 평가한다.
도메인 특화 미세 조정을 통해 분류기 성능 향상과 텍스트 인코더 표현 강화와 같은 다운스트림 이점을 입증한다.

제안 방법

흉부 X선 데이터와 영상의학 보고서 말뭉치에 대해 Stable Diffusion 파이프라인(VAE, U-Net, 텍스트 인코더)을 미세 조정하거나 재학습한다.
도메인 특화 텍스트 인코더(RadBERT, SapBERT) 또는 도메인 적응 CLIP 인코더를 사용하여 짧은 도메인 내 의료 프롬프트에 대한 조건 생성을 수행한다.
확산 과정에서 실제 노이즈와 U-Net이 예측한 노이즈 간의 평균 제곱 오차를 최소화하는 공동 손실을 사용한다.
전략을 비교한다: U-Net 미세 조정, 텍스트 인코더 미세 조정, 인코더 교체/유지, 학습 단계 수 및 학습률의 변화.
여러 프롬프트 및 토큰 길이가 다른 프롬프트에 대해 FID, MS-SSIM 및 도메인 관련 지표로 충실도와 다양성을 평가한다.
합성 데이터에 대한 영상-영상 검색, 영상-문장 검색, 다중 라벨 분류 및 영상의학 보고서 생성을 통한 사실적 정확성을 평가한다.

실험 결과

연구 질문

RQ1사전 학습된 잠재 확산 모델을 효과적으로 도메인 의학 프롬프트를 조건으로 흉부 X-선을 고충실도로 생성하도록 적응시킬 수 있는가?
RQ2어떤 미세 조정 전략(U-Net, 텍스트 인코더, 도메인 특화 인코더)의 조합이 CXR에 대해 가장 높은 충실도와 개념 정렬을 제공하는가?
RQ3공유된 U-Net으로 학습될 때 도메인 특화 텍스트 인코더가 생성 품질을 향상시키는가, 그리고 텍스트 인코더가 도메인 내 미세 조정으로 이익을 얻을 수 있는가?
RQ4RoentGen이 생성한 합성 CXR이 실제 데이터 보강을 통해 이미지 분류와 같은 다운스트림 작업을 향상시키는가?
RQ5도메인 중심 평가(텍스트-이미지 정렬, 영상의학 보고서 생성, 검색 작업)가 생성된 CXR의 사실적 정확성을 어떻게 반영하는가?

주요 결과

RoentGen은 영상의학 특화 언어를 조건으로 시각적으로 설득력 있고 다양한 합성 CXR을 생성할 수 있다.
U-Net과 도메인 특화 텍스트 인코더를 모두 미세 조정하면 부분적 또는 단일 구성요소 미세 조정보다 더 높은 이미지 충실도와 개념적 정확성을 얻을 수 있다.
CLIP 텍스트 인코더를 도메인 특화 인코더(RadBERT 또는 SapBERT)로 교체하고 U-Net을 공동으로 학습하면 FID XRV 및 관련 지표가 향상된다.
미세 조정은 텍스트 인코더에 도메인 내 지식을 압축하여 pneumothorax와 같은 질환에 대한 표현을 최대 25% 향상시킨다.
합성 CXR을 이용한 데이터 증강은 실데이터+합성 데이터로 학습했을 때 다운스트림 분류기 성능을 5% 향상시키고, 순수 합성 데이터로 학습했을 때는 3% 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.