QUICK REVIEW

[論文レビュー] MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

Mengmeng Zhang, Xiaoping Wu|arXiv (Cornell University)|Jan 11, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

MedGround は segmentation マスクを画像–テキスト–ボックス三つ組へ変換するマスク誘導型合成と検証パイプラインを導入し、MedGround-35K データセットを作成して医療 refering grounding と vision–language モデルの一般化を向上させる。

ABSTRACT

Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.

研究の動機と目的

医療 VLM における fluent な言語と正確な視覚的局在化との間の認知–知覚 grounding ギャップを動機づける。
専門家のセグメンテーションマスクを高品質な画像–テキスト–ボックス grounding 三つ組へ変換するスケーラブルなパイプラインを提案する。
臨床的形態と位置を明示的な視覚証拠と整合させる訓練を可能にする。
MedGround-35K が referring grounding、semantic disambiguation、およびデータセット間の転移をどのように改善するかを評価する。

提案手法

八つの公開データセットからのセグメンテーションマスクを tight bounding box に変換して grounding アンカーとする。
マスク由来の幾何情報、空間手掛かり、メタデータを計算してプロンプト構築を導く。
解剖学、モダリティ、幾何に条件付けして vision–language モデルで referring クエリを合成する。
マルチ段階検証（フォーマット/スキーマ、幾何/医学的 priors、VLM に基づく grounding）を適用してあいまいなサンプルをフィルタリングする。
信頼性を推定するためのテストセットに対する人手監査を実施し、監査結果を報告する。

Figure 1: Motivation of MedGround. (a) Models trained on image-text pairs fail to "speak with substance" due to lack of grounding. (b) Segmentation-only training fails to achieve semantic understanding. (c) MedGround (Image-text-box triplets) activates the full potential of medical VLMs by bridging

実験結果

リサーチクエスチョン

RQ1MedGround-35K は複数モダリティにわたる医療 refering grounding の微細な点で VLM を改善できるか？
RQ2臨床的に grounding された、形態学と位置を意識した言語を組み込むことで semantic disambiguation は改善されるか？
RQ3MedGround の訓練は unseen 医療 grounding タスクやデータセットへの zero-shot 設定でどの程度転移するか？

主な発見

MedGround-35K で微調整を行うと、基盤となる VLM でも専門化した VLM でも医療 referring grounding に一貫した改善が得られる。
微細な臨床セマンティクスは coarse なラベル監視と比較して、複数ターゲット画像における形態や位置を意識した局在化を向上させる。
MedGround-35K は意味的整合性を改善し、共存する所見の disambiguation をより良くする。
MedGround-35K で訓練されたモデルは unseen データセットへの zero-shot 決定力が向上する。

Figure 2: MedGround pipeline. (A) Convert segmentation masks into normalized ground-truth bounding box lists. (B) Use dataset-aware, mask-guided prompts to synthesize medically meaningful referring queries and select target box(es) as answers. (C) Perform multi-stage verification and cleaning (forma

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。