QUICK REVIEW

[논문 리뷰] XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Omkar Thawkar, Abdelrahman Shaker|arXiv (Cornell University)|2023. 06. 13.

Multimodal Machine Learning Applications인용 수 39

한 줄 요약

XrayGPT는 고정된 의료 비전 인코더를 미세 조정된 의료 LLM과 정렬하여 흉부 X-레이에 대한 상호작용적이고 고품질의 방사선학 요약을 생성하며, 대규모 의료 보고서 데이터로 학습됩니다.

ABSTRACT

The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks. Such models are trained on massive datasets comprising billions of public image-text pairs with diverse tasks. However, their performance on task-specific domains, such as radiology, is still under-investigated and potentially limited due to a lack of sophistication in understanding biomedical images. On the other hand, conversational medical models have exhibited remarkable success but have mainly focused on text-based analysis. In this paper, we introduce XrayGPT, a novel conversational medical vision-language model that can analyze and answer open-ended questions about chest radiographs. Specifically, we align both medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna), using a simple linear transformation. This alignment enables our model to possess exceptional visual conversation abilities, grounded in a deep understanding of radiographs and medical domain knowledge. To enhance the performance of LLMs in the medical context, we generate ~217k interactive and high-quality summaries from free-text radiology reports. These summaries serve to enhance the performance of LLMs through the fine-tuning process. Our approach opens up new avenues the research for advancing the automated analysis of chest radiographs. Our open-source demos, models, and instruction sets are available at: https://github.com/mbzuai-oryx/XrayGPT.

연구 동기 및 목표

비전-언어 모델에서 영상의학 구체적인 이해를 향상시키려는 동기를 부여한다.
흉부 방사선 사진에 대한 인터랙티브하고 간결한 요서를 생성하고 추후 질문에 답할 수 있는 모델을 개발한다.
의료 정확성을 위한 구성요소를 미세 조정하기 위해 고품질 방사선 보고서 요약을 활용한다.
생물의학 다중모달 연구를 진전시키기 위해 모델, 데이터 및 지침을 오픈 소스로 제공한다.

제안 방법

MedClip을 고정된 의학 시각 인코더로 사용하여 이미지 특징을 추출한다.
시각 특징을 언어 공간으로 매핑하기 위해 학습 가능한 선형 변환을 적용한다.
의료 대화에 Vicuna(대형 언어 모델)를 미세 조정하여 방사선 지식을 확고히 한다.
MIMIC-CXR 및 OpenI 데이터셋의 고품질 인터랙티브한 요약으로 두 단계로 학습한다.
시스템 프롬프트와 의사 프롬프트로 LLM을 안내하는 두 쿼리 프롬핑 스킴을 사용한다.
baseline과 비교하기 위해 Rogue 점수와 GPT 기반 평가를 사용하여 평가한다.

실험 결과

연구 질문

RQ1 frozen medical visual encoder와 fine-tuned medical LLM의 정렬이 정확하고 인터랙티브한 흉부 X-레이 요약을 생성할 수 있는가?
RQ2의료 영상 작업에서 방사선 보고서의 고품질, 작업 특정 요약이 LLM 성능을 향상시키는가?
RQ3MedClip 및 Vicuna 구성요소가 방사선 특정 요약 성능에 미치는 증분적 영향은 무엇인가?

주요 결과

XrayGPT는 MIMIC-CXR 테스트 세트에서 Rogue 점수 기준으로 baseline 대비 상당한 개선을 보인다.
MedCLIP, MedVicuna, 및 RadVicuna 구성요소를 추가하면 Rogue 점수가 점진적으로 상승한다 (R-1: 0.1313에서 0.3213으로; R-2: 0.0221에서 0.0912로; R-L: 0.0879에서 0.1997로).
LLM 기반 평가(ChatGPT)는 Referring을 고른 비율에서 XrayGPT를 baseline보다 선호하였다(82% vs 6%).
모델은 MIMIC-CXR 테스트 세트에서 Zhu et al. (2023) 최첨단 baseline 대비 R-1에서 절대 19%의 이득을 얻는다.
Stage-1은 213,514 이미지-텍스트 쌍을 사용하고; Stage-2는 방사선 특정 요약을 개선하기 위해 3k OpenI 쌍을 사용한다.
정성적 결과는 방사선 전문의와 유사한 대화 능력과 상세한 소견을 시연한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.