QUICK REVIEW

[논문 리뷰] Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Tongrui Zhang, Chenhui Wang|arXiv (Cornell University)|2026. 03. 09.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

SEER는 동적 SEER-Loop와 SEER-Trace 데이터셋으로 grounded, 기술 기반 추론 프레임워크를 도입하여 자유 텍스트로 프롬프트 가능한 3D 의학 이미지 분할의 안정성을 높이고, 언어적 변동성에 대한 강건성을 개선하며 성능 변동성을 줄인다.

ABSTRACT

Free-text promptable 3D medical image segmentation offers an intuitive and clinically flexible interaction paradigm. However, current methods are highly sensitive to linguistic variability: minor changes in phrasing can cause substantial performance degradation despite identical clinical intent. Existing approaches attempt to improve robustness through stronger vision-language fusion or larger vocabularies, yet they lack mechanisms to consistently align ambiguous free-form expressions with anatomically grounded representations. We propose Skill-Evolving grounded Reasoning (SEER), a novel framework for free-text promptable 3D medical image segmentation that explicitly bridges linguistic variability and anatomical precision through a reasoning-driven design. First, we curate the SEER-Trace dataset, which pairs raw clinical requests with image-grounded, skill-tagged reasoning traces, establishing a reproducible benchmark. Second, SEER constructs an evidence-aligned target representation via a vision-language reasoning chain that verifies clinical intent against image-derived anatomical evidence, thereby enforcing semantic consistency before voxel-level decoding. Third, we introduce SEER-Loop, a dynamic skill-evolving strategy that distills high-reward reasoning trajectories into reusable skill artifacts and progressively integrates them into subsequent inference, enabling structured self-refinement and improved robustness to diverse linguistic expressions. Extensive experiments demonstrate superior performance of SEER over state-of-the-art baselines. Under linguistic perturbations, SEER reduces performance variance by 81.94% and improves worst-case Dice by 18.60%.

연구 동기 및 목표

언어적 변이로 인해 3D 의학 영상 분할에서 자유 텍스트 프롬프트의 불안정성을 해결한다.
임상 요청과 영상에 기반한 기술 태그가 달린 추론 흔장을 연결한 SEER-Trace 데이터셋을 큐레이션한다.
해부학적 증거에 맞춘 실행 가능한 기술로 grounded 시각-언어 추론을 형식화한다.
지속적 자기 개선을 위한 재사용 가능한 기술로 높은 보상 추론을 증류하는 SEER-Loop를 도입한다.

제안 방법

다양한 임상 요청과 기술 태그가 달린 흔장을 갖춘 표준 3D 분할 벤치마크를 집계하여 SEER-Trace를 만든다.
증거 e, 합리적 근거 r, 실행 가능한 답변 a를 산출하는 시각-언어 추론 체인을 구현하고, 이는 고정된 분할 시스템 S가 Ĝ를 생성하는 데 사용된다.
의학적으로 동등한 재서술들에 걸친 안정성 인식을 고려한 목적 함수를 최적화하여 정확도와 일관성을 향상시키고: J(θ)=E[(Eq′~Ω(q)) Dice(S(V,aθ(V,q′)),G) − λ Var(Dice(...))].
VLM을 SEER-Trace 작동에 맞추도록 지도 미세 조정으로 사전 학습하고, 이어서 복합 보상으로 GRPO(그룹 상대 정책 최적화)를 수행한다.
SEER-Bank를 통한 SEER-Loop 도입으로 높은 보상 추론 인공물을 저장, 검색, 증류하여 지속적 기술 진화와 보이지 않는 언어 변형에 대한 강건성을 가능하게 한다.

실험 결과

연구 질문

RQ1언어적 변이로 인해 3D 의학 영상에서 자유 텍스트 임상 요청을 해부학적 증거에 기반하여 어떻게 근거화하고 일관된 분할 결과를 도출할 수 있는가?
RQ2명시적이고 실행 가능한 기술 기반 추론이 3D 의학 영상 분할에서 언어적 변동성에 대한 강건성을 개선하는가?
RQ3동적인 기술 진화 메모리(SEER-Bank)가 보이지 않는 프롬프트 전반에 걸쳐 추론의 질과 분할 강건성을 지속적으로 향상시키는가?
RQ4다양한 분할 백본 간에 추론의 근거화 및 기술 진화가 어느 정도 전달되는가?
RQ5자유 텍스트 프롬핑의 강건성이 언어적 섭동 하에서 Dice, 최악 Dice, 결과 분산에 미치는 영향은 무엇인가?

주요 결과

SEER는 레이블 프롬핑과 자유 텍스트 프롬핑 모드 모두에서 베이스라인 대비 우수한 분할 성능을 달성한다.
자유 텍스트 프롬핑하에서 SEER는 성능 변동성을 81.94% 감소시키고 최악의 Dice를 18.60% 개선한다(초록에 기재).
PENGWIN strictly out-of-distribution 데이터셋에서 SEER-Loop와 SEER-Bank가 평균 Dice 97.39로 최고이며 Std 0.98로 최소이다.
PENGWIN에서의 소거 분석에서 vanilla VLM은 성능 저하를 보이고, 미세 조정된 grounded reasoning은 Dice를 95.92로 올리고 Std를 3.84로 낮추며, SEER-Loop가 Dice를 97.39, Std를 0.98로 추가 향상시킨다.
MedSAM3로의 Cross-backbone 일반화에서 SEER는 SEER 추론 없이 대비보다 평균 성능을 크게 높이고 분산을 줄인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.