QUICK REVIEW

[논문 리뷰] Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

Hong Liu, Dong Wei|arXiv (Cornell University)|2026. 03. 05.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

두 단계 CTRG 프레임워크를 소개합니다. 구조 특정 시각 질의와 구조 수준 이미지-텍스트 대조 학습을 사용하여 CT 이미지 패치를 구조화된 보고서 내용과 정렬하고, 소프트 타깃과 다양성 강화 음수 큐를 통해 교차 모달 표현 및 보고서 생성을 개선합니다.

ABSTRACT

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

연구 동기 및 목표

고수준 해부 구조 지식을 활용하여 보고서 생성을 위한 미세한 CT 이미지 표현을 학습한다.
CT 구조를 보고서 내용과 정렬하기 위한 구조별 이미지-텍스트 대조 학습을 개발한다.
소프트 의사 타깃과 다양성 향상 음수 큐를 통해 교차 모달 학습의 거짓 음수를 완화한다.
구조 학습이 후속 보고서 생성 단계에 정보를 제공하는 두 단계 학습 프레임워크를 구축한다.

제안 방법

CT-ViT를 사용하여 이미지 패치를 추출한다.
Ns 구조별 시각 질의 Ns를 학습하여 구조를 관찰하고 교차 주의를 통해 S^v를 얻는다.
키워드 기반 구조 라벨링을 가진 사전 학습된 텍스트 인코더에서 구조별 텍스트 토큰 S^t를 추출한다.
구조 관찰 기반 이미지-텍스트 대조 손실 L_so-itc를 S^v와 S^t 사이에 적용하고 동적 음수 텍스트 큐를 사용한다.
텍스트-텍스트 유사도를 통해 소프트 의사 타깃을 도입하여 KL-발산 손실 L_so-kl을 형성하고 거짓 음수를 완화한다.
알파(=0.5로 설정)라는 균형 매개변수로 L_so-pre로 손실들을 결합한다.
두 번째 단계에서 시각 인코더, 질의, 패치 선택기를 고정하고 입력으로 S^v와 선택된 T^s(S^t?) (구조당 10개의 패치, K=10)을 사용하여 텍스트 디코더를 학습한다.
BERT 디코더와 LoRA를 사용한 LLaMA2-7B를 실험하고 보고서 생성을 위한 체인-다음 토큰 목표로 학습한다.

실험 결과

연구 질문

RQ1구조 수준의 교차 모달 정렬이 단어 수준이 아닌 CTRG 성능을 향상시킬 수 있는가?
RQ2소프트 의사 타깃과 다양성 향상 음수 큐가 CT-보고서 정렬을 위한 대조 학습을 개선하는가?
RQ3보고서 생성 단계에서 구조 정보를 갖춘 시각 모듈을 동결하는 것이 디코딩 중 성능을 유지하거나 향상시키는가?
RQ4학습된 CT 표현이 CTRG 도메인/데이터 세트 간에 얼마나 잘 전달되는가?
RQ5구조당 선택된 이미지 패치의 하위 집합이 성능과 효율성에 미치는 영향은?

주요 결과

두 개의 공용 데이터셋(CT-RATE 및 CTRG-Chest-548K)에서 CE 지표로 최첨단 CTRG 방법을 능가한다.
L_so-itc 및 L_so-kl로 구성된 구조 수준의 교차 모달 학습이 베이스라인 대비 CE 지표를 향상시킨다.
다양성 강화 음수 큐와 패치 선택(K=10개 구조당 패치)은 효율성과 정확성을 개선하여 토큰 부하를 감소시킨다(시각 토큰 110 vs 4096).
CT-RATE에서 학습한 CT 표현을 CTRG-Chest-548K로 전이하면 상당한 CE 향상이 나타나 교차 도메인 일반화가 입증된다.
면밀한 학습으로 LLaMA2-7B를 사용해도 강한 성능을 달성할 수 있지만 NLG 지표가 일부 설정에서 BERT에 뒤처질 수 있으며 이는 데이터 크기 때문일 가능성이 있다.
우리 방법이 CT-CLIP보다 보고서와 볼륨 간 검색에서 향상되어 더 미세한 구조-텍스트 정렬을 확인시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.