QUICK REVIEW

[논문 리뷰] MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

Mengmeng Zhang, Xiaoping Wu|arXiv (Cornell University)|2026. 01. 11.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

MedGround는 마스크 지침 합성 및 검증 파이프라인을 도입하여 분할 마스크를 이미지–텍스트–박스 삼중으로 변환하고, MedGround-35K 데이터셋을 만들어 의학적 참조 바운딩과 비전-언어 모델의 일반화를 개선합니다.

ABSTRACT

Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.

연구 동기 및 목표

의료 VLM에서 유창한 언어가 정밀한 시각적 위치 지정을 결여한 인지·지각적 그라운딩 격차를 촉진한다.
전문가 분할 마스크를 고품질 이미지–텍스트–박스 그라운딩 삼중항으로 전환하는 확장 가능한 파이프라인을 제안한다.
임상 형태학 및 위치를 명시적 시각 증거와 정렬시키는 학습을 가능하게 한다.
MedGround-35K가 참조 그라운딩, 의미 해석의 명확화, 및 데이터셋 간 전이 성능을 어떻게 개선하는지 평가한다.

제안 방법

여덟 개의 공개 데이터세트에서 분할 마스크를 촘촘한 경계 상자로 변환하여 그라운딩 앵커로 삼는다.
마스크를 기반으로 한 기하학, 공간 단서 및 메타데이터를 계산하여 프롬프트 구성에 활용한다.
해부학, 모달리티 및 기하학에 조건화된 비전-언어 모델로 참조 질의를 합성한다.
형식/스키마, 기하학/의학 사전, 그리고 VLM 기반 그라운딩의 다단계 검증을 적용하여 모호한 샘플을 선별한다.
테스트 세트에 대한 인간 감사(audit)를 수행하여 신뢰도를 추정하고 감사 결과를 보고한다.

Figure 1: Motivation of MedGround. (a) Models trained on image-text pairs fail to "speak with substance" due to lack of grounding. (b) Segmentation-only training fails to achieve semantic understanding. (c) MedGround (Image-text-box triplets) activates the full potential of medical VLMs by bridging

실험 결과

연구 질문

RQ1MedGround-35K가 여러 모달리티에서 VLM의 미세한 의학 참조 그라운딩을 개선할 수 있는가?
RQ2임상적으로 근거가 있으며 형태학- 및 위치를 고려한 언어를 도입하면 의미 해석이 개선되는가?
RQ3MedGround 학습이 미지의 의학 그라운딩 작업 및 데이터세트로의 제로샷 전이에서 어느 정도까지 전이되는가?

주요 결과

MedGround-35K로 미세 조정하면 기본 및 특수화된 VLM에서 의학 참조 그라운딩에 일관된 이점이 나타난다.
세밀한 임상 의미론은 거친 라벨 감독에 비해 다중 대상 이미지에서 형태학- 및 위치 인식을 더 잘 가능하게 한다.
MedGround-35K는 의미론적 정렬을 개선하여 공존하는 소견의 구분을 더 잘 가능하게 한다.
MedGround-35K로 학습된 모델은 보지 못한 데이터셋에 대한 제로샷 일반화가 향상된다.

Figure 2: MedGround pipeline. (A) Convert segmentation masks into normalized ground-truth bounding box lists. (B) Use dataset-aware, mask-guided prompts to synthesize medically meaningful referring queries and select target box(es) as answers. (C) Perform multi-stage verification and cleaning (forma

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.