QUICK REVIEW

[论文解读] Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

Hong Liu, Dong Wei|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

引入一个两阶段的 CTRG 框架，使用结构特定的视觉查询和结构层面的图像-文本对比学习，将 CT 图像补丁与结构化报告内容对齐，并通过软目标与多样性增强的负队列来改进跨模态表示与报告生成。

ABSTRACT

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

研究动机与目标

利用高级解剖结构知识来学习用于报告生成的细粒度 CT 图像表示。
开发结构化的图像-文本对比学习，以将 CT 结构与报告内容对齐。
通过软伪目标和多样性增强的负队列来缓解跨模态学习中的假阴性问题。
构建一个两阶段训练框架，其中结构学习为后续的报告生成阶段提供信息。

提出的方法

使用 CT-ViT 提取图像补丁。
学习 Ns 个结构特定的视觉查询以观测结构并通过跨注意力获得 S^v。
从带有关键字结构标注的预训练文本编码器中提取结构特定的文本标记 S^t。
对 S^v 和 S^t 使用带有动态负文本队列的结构观察驱动的图像-文本对比损失 L_so-itc。
通过文本-文本相似性引入软伪目标，形成 KL 散度损失 L_so-kl 以缓解假阴性。
将损失合成为 L_so-pre，平衡系数 alpha（设为 0.5）。
在第二阶段冻结视觉编码器、查询和补丁选择器，训练文本解码器，输入为 S^v 和选定的 T^s（每个结构 K=10 个补丁）。
尝试使用 BERT 解码器和带 LoRA 的 LLaMA2-7B；以链式下一个 token 目标进行报告生成训练。

实验结果

研究问题

RQ1结构层面的跨模态对齐（而非逐词对齐）是否能提升 CTRG 的性能？
RQ2软伪目标与多样性增强的负队列是否能提升 CT-报告对齐的对比学习效果？
RQ3在报告生成阶段冻结结构信息驱动的视觉模块，是否能维持或提升解码时的性能？
RQ4学习的 CT 表征在 CTRG 不同领域/数据集之间的迁移能力如何？
RQ5每个结构选择子集图像补丁对性能和效率的影响有多大？

主要发现

在两个公开数据集（CT-RATE 和 CTRG-Chest-548K）上，CE 指标超越现有 CTRG 方法。
使用 L_so-itc 和 L_so-kl 的结构层面跨模态学习相较基线提高 CE 指标。
多样性增强的负队列和补丁选择（每结构 10 个补丁）提高效率和精度，使得可处理的令牌数量从 4096 降至 110。
在 CT-RATE 上学习的 CT 表征迁移到 CTRG-Chest-548K 时获得显著的 CE 增益，验证跨域泛化能力。
使用 LLaMA2-7B 在精心训练下也能获得较强的性能；但在某些设置下 NLG 指标可能落后于 BERT，可能与数据规模有关。
报告到体积的检索在我们的方法下优于 CT-CLIP，确认了更细致的结构文本对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。