QUICK REVIEW

[논문 리뷰] ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

Duy Vu Minh Nguyen, Chinh Thanh Truong|arXiv (Cornell University)|2026. 03. 16.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

ViX-Ray는 전문적 소견과 임상을 포함한 5,400샘플의 베트남 흉부 X선 데이터셋을 도입하고, 오픈소스 VLM을 GPT-4V 및 Gemini와 비교 벤치마킹하며, 베트남어 방사선 의학 보고서의 언어적 패턴을 분석합니다.

ABSTRACT

Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.

연구 동기 및 목표

임상 사용을 위한 상세한 전문가 주석을 갖춘 베트남어 흉부 X-레이 다중모달 데이터셋의 필요성을 제고한다.
베트남 방사선과 의사의 이미지, 환자 메타데이터, 소견 및 임상을 포함한 새로운 데이터셋(ViX-Ray)을 제공한다.
발견 및 임상 소견 생성에서 오픈소스 베트남어 및 다국어 VLM의 범위를 독점 모델과 대조하여 벤치마크한다.
베트남어 방사선 보고서의 언어적 패턴(신체 부위 및 진단)을 분석한다.
베트남 의료 맥락에서 모델의 역량을 평가하기 위해 3단계 프롬프트 및 파인튜닝을 평가한다.

제안 방법

베트남 병원에서 5,400장의 흉부 X-레이 이미지를 모아 전문가 소견 및 임상 소견으로 주석을 달아 ViX-Ray를 구성한다.
소견 및 임상을 구문 분석을 사용하여 신체 부위 언급과 진단을 추출하는 언어 분석을 수행한다.
크기 7B 미만의 오픈소스 베트남어 및 다국어 VLM 세트를 ViX-Ray에서 파인튜닝하고 GPT-4V 및 Gemini와 비교 평가한다.
세 단계 평가 파이프라인을 사용한다: 1단계 소견 생성, 2단계 임상 소견 생성, 3단계 다회전 생성(소견 → 임상 소견).
출력을 어휘 지표(ROUGE, BLEU)로 평가하고, GPT-4o를 사용한 사실 기반 평가의 정밀도/재현율로 원자적 사실을 분해한다.

실험 결과

연구 질문

RQ1베트남어 및 다국어 VLM이 ViX-Ray에서 학습될 때 흉부 X-레이 이미지로부터 임상적으로 관련 소견을 얼마나 잘 생성할 수 있는가?
RQ2베트남 의료 맥락에서 모델이 생성한 임상 소견의 정확도는 전문가 진단과 비교하여 어떠한가?
RQ3다회전(소견 다음 임상 소견) 파인튜닝이 임상 산출물의 사실 정확도와 어휘 품질을 향상시키는가?
RQ4오픈소스 베트남어 VLM은 베타 모델(GPT-4V, Gemini)과 비교하여 베트남 방사선 의학 과제에서 어떤 차이를 보이는가?

주요 결과

Qwen2.5-VL-7B가 평가 파이프라인의 모든 단계에서 최상의 전체 성능을 달성한다.
다국어 모델은 성능이 다양하게 나타나며, Qwen2.5-VL-7B가 종종 다른 모델을 능가하는 반면 InternVL2.5는 성능이 저조하다.
다회전 생성에서 Qwen2.5-VL-7B 및 MiniCPM-V와 같은 더 큰 모델은 어휘 품질과 사실 정확성을 향상시킨다.
GPT-4V 및 Gemini는 정밀도가 제한되고 환상현상이 높으며 때때로 임상 작업 콘텐츠 생성을 거부하기도 한다.
ViX-Ray 출력은 정확성의 큰 도전과 인구집단 특정 의료 VLM 벤치마킹의 필요성을 드러낸다.
단계별 및 다회전 파인튜닝은 오픈소스 베트남어 VLM의 임상적 유용성을 기준선에 비해 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.