QUICK REVIEW

[论文解读] Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Tongrui Zhang, Chenhui Wang|arXiv (Cornell University)|Mar 9, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

SEER 通过 grounded 的推理框架、SEER-Loop 的动态特性以及 SEER-Trace 数据集，稳定了可自由文本提示的 3D 医学图像分割，提升对语言变异的鲁棒性并降低性能波动。

ABSTRACT

Free-text promptable 3D medical image segmentation offers an intuitive and clinically flexible interaction paradigm. However, current methods are highly sensitive to linguistic variability: minor changes in phrasing can cause substantial performance degradation despite identical clinical intent. Existing approaches attempt to improve robustness through stronger vision-language fusion or larger vocabularies, yet they lack mechanisms to consistently align ambiguous free-form expressions with anatomically grounded representations. We propose Skill-Evolving grounded Reasoning (SEER), a novel framework for free-text promptable 3D medical image segmentation that explicitly bridges linguistic variability and anatomical precision through a reasoning-driven design. First, we curate the SEER-Trace dataset, which pairs raw clinical requests with image-grounded, skill-tagged reasoning traces, establishing a reproducible benchmark. Second, SEER constructs an evidence-aligned target representation via a vision-language reasoning chain that verifies clinical intent against image-derived anatomical evidence, thereby enforcing semantic consistency before voxel-level decoding. Third, we introduce SEER-Loop, a dynamic skill-evolving strategy that distills high-reward reasoning trajectories into reusable skill artifacts and progressively integrates them into subsequent inference, enabling structured self-refinement and improved robustness to diverse linguistic expressions. Extensive experiments demonstrate superior performance of SEER over state-of-the-art baselines. Under linguistic perturbations, SEER reduces performance variance by 81.94% and improves worst-case Dice by 18.60%.

研究动机与目标

解决 3D 医学分割中自由文本提示因语言变异而引发的不稳定性。
整理 SEER-Trace，一个将临床请求与基于图像地面证据、带技能标签的推理轨迹配对的数据集。
将 grounded 的视觉–语言推理形式化为与解剖证据对齐的可执行技能。
通过 SEER-Loop 将高回报推理蒸馏为可重复使用的技能，以实现持续自我提升。

提出的方法

通过汇聚标准 3D 分割基准测试和多样化临床请求以及带技能标签的轨迹来创建 SEER-Trace。
实现一个视觉–语言推理链，生成证据 e、推理 r，以及可执行答案 a，由冻结的分割系统 S 基于 Ĝ 使用。
在临床等价改写下优化一个稳定性感知目标，以提高准确性与一致性：J(θ)=E[(Eq′~Ω(q)) Dice(S(V,aθ(V,q′)),G) − λ Var(Dice(...))]。
通过有监督的微调进行预训练，使 VLM 与 SEER-Trace 操作对齐，随后进行基于组相对策略优化（GRPO）的组合奖励学习。
通过 SEER-Bank 引入 SEER-Loop，用于存储、检索和蒸馏高回报推理产物，使技能持续进化并提升对未见语言变异的鲁棒性。

实验结果

研究问题

RQ1如何将自由文本的临床请求 grounded 于解剖证据，以产生一致的分割结果？
RQ2显式、可执行的技能化推理是否能提高对 3D 医学图像分割中语言变异性的鲁棒性？
RQ3动态技能演化记忆（SEER-Bank）是否能持续提升推理质量和对未见提示的分割鲁棒性？
RQ4对不同分割骨干网络而言， grounding 推理与技能演化的迁移程度如何？
RQ5在语言扰动下，自由文本提示鲁棒性对 Dice、最差 Dice 与结果离散度有何影响？

主要发现

SEER 在标签提示和自由文本提示两种模式下均优于基线的分割性能。
在自由文本提示下，SEER 将性能方差降低 81.94%，最差情况 Dice 提高 18.60%（如摘要所述）。
在 PENGWIN strictly out-of-distribution 数据集上，SEER-Loop 配合 SEER-Bank 达到最高的平均 Dice（97.39）和最低的 Std（0.98）。
对 PENGWIN 的消融实验显示，原生 VLM 会降低性能，而对推理进行有监督的 grounded 调整后 Dice 提升到 95.92，Std 降至 3.84，结合 SEER-Loop 可进一步提升到 Dice 97.39 和 Std 0.98。
跨骨干网络的 MedSAM3 泛化显示，与不含 SEER 推理的基线相比，SEER 明显提升平均性能并降低离散度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。