QUICK REVIEW

[논문 리뷰] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Alejandra Beatriz Pérez, Anita Rau|arXiv (Cornell University)|2026. 03. 06.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

SUREON은 수술 추론 학습을 위한 대규모의 전문 내레이션 수술 비디오 데이터세트를 도입하고, 감독된 미세조정과 강화학습으로 복잡한 수술 질문에 해석 가능한 추론을 제시하는 두 가지 VLM 기반 모델(SureonVLM 및 SureonVLM-R1)을 제시합니다.

ABSTRACT

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

연구 동기 및 목표

수술 인식 및 개방 어휘 인식의 한계를 넘어서는 수술 추론의 필요성을 자극한다.
대규모의 내레이션 기반 데이터세트를 만들어 더 높은 수준의 수술 추론을 감독한다.
수술 맥락에서 추론을 수행하고 해석 가능한 설명을 생성할 수 있는 비전-언어 모델을 개발한다.

제안 방법

전문가 내레이션에서 시맨틱 그루딩 모먼트(SGMs)를 정의하여 시각 콘텐츠를 고정시키는 기준으로 삼는다.
지각(perception), 추론, 시간 이해, 안전성을 아우르는 12-카테고리의 질의/답변 분류 체계를 구성한다.
SGMs와 전사 기반 생성기/검증기를 갖춘 다중 에이전트 파이프라인으로 Q&A를 자동으로 생성하고 검증한다.
SUREON 클립, 표준 데이터세트, 공개 소스에서 1.5M 라벨링 프레임 / 460k 라벨링 클립을 포함한 학습 데이터를 모아 구성한다.
세 단계의 감독 미세조정(SFT)을 통해 SureonVLM을 단계별로 학습하고 모듈을 점진적으로 업데이트한다.
<think> 토큰과 복합 보상을 활용한 Group Relative Policy Optimization(GRPO)을 통해 추론 능력을 강화한다.

Figure 2: Example of SureonVLM-R1 on a Temporal Ordering question. Thinking tokens reveal reasoning connecting visual observations to the posed question.

실험 결과

연구 질문

RQ1시냅틱 수술 강의로 학습된 비전-언어 모델이 개방 어휘 인식 및 고차원의 수술 추론을 수행할 수 있는가?
RQ2추론 중심의 감독과 GRPO가 기저 VLM과 비교해 해석 가능한 다단계 수술 설명을 향상시키는가?
RQ3SureonVLM-R1이 표준 수술 인식 태스크 및 특화된 SUREON 벤치마크에서 일반 도메인 모델과 비교해 어떤 차이를 보이는가?
RQ4모델 출력에서 의도 추론처럼 보이는 행동(예: 시각 맥락으로부터 의도 추론)이 실증적으로 나타나는가?

주요 결과

SureonVLM 및 SureonVLM-R1은 SUREON 벤치마크에서 높은 정확도를 달성하고, 많은 카테고리에서 더 큰 일반 도메인 모델을 능가한다.
다지선다형 설정에서 SureonVLM 및 SureonVLM-R1은 평균 정확도가 대략 0.84–0.85에 도달하여 Qwen3-VL 및 다른 베이스라인들을 상회한다.
SureonVLM은 강력한 안전 행동 식별 및 의사 결정 추론을 달성하며, GPT-5.1 및 Gemini 3.1 Pro에 비해 뚜렷한 이점을 보인다.
SureonVLM-R1의 추론 흔적은 명시적 사고 토큰과 전문가 내레이션과의 정렬성을 보여 해석 가능한 추론을 뒷받침한다.
추론-적합한 회로(T+S) 및 개방형 학습(O)을 통한 점진적 수술 적응이 성능을 크게 향상시키고, CoT가 GRPO의 안정성에 도움을 준다.
SureonVLM은 표준 수술 인식 벤치마크에서 일반 도메인 모델보다 우수한 성능을 보이며, 추론 학습으로 인한 인지적 손실이 없음을 시사한다.

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.