QUICK REVIEW

[論文レビュー] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Alejandra Beatriz Pérez, Anita Rau|arXiv (Cornell University)|Mar 6, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

tldr: SUREON introduces a large-scale, expert-narrated surgical video dataset for learning surgical reasoning, and two VLM-based models (SureonVLM and SureonVLM-R1) trained with supervised fine-tuning and reinforcement learning to answer complex surgical questions with interpretable reasoning.

ABSTRACT

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

研究の動機と目的

Motivate the need for surgical reasoning beyond perception and open-vocabulary recognition.
Create a large-scale, narration-grounded dataset to supervise higher-level surgical reasoning.
Develop vision-language models that can perform reasoning and generate interpretable explanations in surgical contexts.

提案手法

Define Semantic Grounding Moments (SGMs) from expert narration to anchor visual content.
Construct a 12-category question taxonomy covering perception, reasoning, temporal understanding, and safety.
Automatically generate and validate Q&A pairs via a multi-agent pipeline with SGMs and transcript-based generators/validators.
Assemble training data including SUREON clips, standard datasets, and 1.5M labeled frames / 460k labeled clips from public sources.
Train SureonVLM via three-stage supervised fine-tuning (SFT) across stages, progressively updating modules.
Enhance reasoning via reinforcement learning using Group Relative Policy Optimization (GRPO) with <think> tokens and a composite reward.

Figure 2: Example of SureonVLM-R1 on a Temporal Ordering question. Thinking tokens reveal reasoning connecting visual observations to the posed question.

実験結果

リサーチクエスチョン

RQ1Can a vision-language model trained on narrated surgical lectures perform open-vocabulary perception and higher-level surgical reasoning?
RQ2Does reasoning-focused supervision and GRPO improve interpretable, multi-step surgical explanations compared with base VLMs?
RQ3How does SureonVLM-R1 compare to general-domain models on standard surgical perception tasks and on the specialized SUREON benchmark?
RQ4Is there tangible evidence of reasoning-like behavior (e.g., inferring intent from visual context) in model outputs?

主な発見

SureonVLM and SureonVLM-R1 achieve high accuracy on the SUREON benchmark, outperforming larger general-domain models in many categories.
In multiple-choice settings, SureonVLM and SureonVLM-R1 reach average accuracies around 0.84–0.85, surpassing Qwen3-VL and other baselines.
SureonVLM achieves strong safety action identification and decision reasoning, with notable gains over GPT-5.1 and Gemini 3.1 Pro.
Reasoning traces from SureonVLM-R1 demonstrate explicit thinking tokens and alignment with expert narration, supporting interpretable reasoning.
Ablation shows progressive surgical adaptation (T+S) and open-ended training (O) significantly boost performance, with CoT aiding GRPO stability.
SureonVLM outperforms general-domain models on standard surgical perception benchmarks, indicating no perceptual trade-off from reasoning training.

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。