QUICK REVIEW

[论文解读] ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

Duy Vu Minh Nguyen, Chinh Thanh Truong|arXiv (Cornell University)|Mar 16, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

ViX-Ray 引入了含有专家发现与印象的 5,400 个样本的越南胸部 X 光数据集，对开源 VLM 与 GPT-4V、Gemini 进行基准测试，并分析越南放射学报告中的语言模式。

ABSTRACT

Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.

研究动机与目标

推动在临床使用中需要详细专家标注的越南 Chest X-ray 多模态数据集的必要性。
提供一个包含图像、患者元数据、发现和越南放射科医师印象的新数据集（ViX-Ray）。
对一系列开源的越南语与多语言 VLM 在发现与印象生成任务上与专有模型进行基准评估。
分析越南语放射学报告中的语言模式（病人体部位与诊断）。
评估三阶段提示与微调在越南医学背景下的模型能力。

提出的方法

整合 ViX-Ray，包含来自越南医院的 5,400 张胸部 X 光图像，配以专家发现与印象标注。
使用句法分析对发现与印象进行语言分析，以提取病人体部位提及与诊断信息。
在 ViX-Ray 上对一组开源越南语与多语言 VLM 进行微调（规模在 7B 以下），并与 GPT-4V、Gemini 进行对比评估。
使用三阶段评估流程：阶段 1 为发现生成，阶段 2 为印象生成，阶段 3 为多轮生成（先发现后印象）。
以词汇指标（ROUGE、BLEU）结合基于精确/召回的事实评价（使用 GPT-4o 将原子事实分解）来评估输出。

实验结果

研究问题

RQ1在 ViX-Ray 的训练下，越南语与多语言 VLM 能否生成临床相关的胸部 X 光影像发现？
RQ2在越南医疗场景中，模型生成的印象与专家诊断相比有多准确？
RQ3多轮（发现后印象）微调是否提高了临床输出的事实准确性与词汇质量？
RQ4开源越南语 VLM 与专有模型（GPT-4V、Gemini）在越南放射学任务上的对比表现如何？

主要发现

Qwen2.5-VL-7B 在整个评估管线的各阶段都取得最佳综合性能。
多语言模型表现各异；Qwen2.5-VL-7B 往往超越其他模型，而 InternVL2.5 表现较差。
在多轮生成中，较大模型如 Qwen2.5-VL-7B 与 MiniCPM-V 提升了词汇质量和事实准确性。
GPT-4V 与 Gemini 的输出存在精度有限且易发生幻觉，有时甚至拒绝为临床任务生成内容。
ViX-Ray 的输出揭示了在精确性方面的显著挑战，以及需要针对人群特异的医学 VLM 基准测试。
阶段性与多轮微调相较于基线，提升了开源越南语 VLM 的临床实用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。