[論文レビュー] StAR: Segment Anything Reasoner
tldr: StAR refines reinforcement learning with visual reasoning to improve segmentation from implicit queries, introduces ReasonSeg-X/R benchmarks, and enables test-time scaling to boost performance.
As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.
研究の動機と目的
- Objective 1: Address the bottlenecks in reinforcement learning with verifiable rewards (RLVR) for reasoning segmentation.
- Objective 2: Preserve base MLLM capabilities while enhancing visual reasoning for segmentation tasks.
- Objective 3: Introduce ReasonSeg-X/R benchmarks to evaluate diverse reasoning types and depths.
- Objective 4: Develop training and test-time strategies (REST, mask-level voting, LP) to maximize reasoning performance with limited data.
提案手法
- Method 1: Adopts a decoupled reasoning-segmentation pipeline where an MLLM generates a chain-of-thought and predicts bounding boxes and points, which SAM uses to produce masks.
- Method 2: Uses Group Relative Policy Optimization (GRPO) as the core RLVR algorithm with a minibatch of rollouts and group-wise advantage normalization.
- Method 3: Implements a multi-faceted reward design including SAM-level mask-IoU rewards and MLLM-level accuracy rewards, plus a batched Hungarian matching for assignment.
- Method 4: Employs parameter-efficient tuning (LoRA) with adjusted learning rates to preserve base model knowledge while enhancing reasoning.
- Method 5: Introduces Rollout-Expanded Selective-Tuning (REST) to increase exploration during Stage-2 training by selecting extreme advantages for updates.
- Method 6: Adds Label Prediction (LP) to require semantic labels alongside geometry to improve grounding and faithfulness.
- Method 7: Develops a mask-level majority voting (MV) strategy to cluster and select final masks from multiple parallel samples based on IoU clustering and cluster voting.
- Method 8: Constructs ReasonSeg-X as a four-type, depth-extended reasoning benchmark and ReasonSeg-R as a refined version of ReasonSeg to ensure mask-query correspondence and boundary accuracy.
実験結果
リサーチクエスチョン
- RQ1Research Question 1: How can RLVR bottlenecks be identified and mitigated to fully elicit a base model's latent visual reasoning capabilities?
- RQ2Research Question 2: What combination of parameter tuning, reward design, learning strategy, and answer format yields the best segmentation performance from implicit queries?
- RQ3Research Question 3: Does test-time scaling with parallel sampling improve segmentation accuracy across complex reasoning tasks?
- RQ4Research Question 4: Can ReasonSeg-X/R provide a comprehensive evaluation of reasoning depth and types for segmentation methods?
- RQ5Research Question 5: What is the impact of simple semantic labeling (LP) on grounding and faithfulness of segmentation results?
主な発見
- Key Finding 1: StAR outperforms base VisionReasoner and many baselines on ReasonSeg-X/R after Stage-2 training.
- Key Finding 2: Stage-1 StAR leverages base model reasoning capabilities without reasoning data, surpassing methods using the same base model.
- Key Finding 3: REST (Rollout-Expanded Selective-Tuning) enhances Stage-2 training efficiency and improves performance on complex reasoning tasks.
- Key Finding 4: Mask-level majority voting substantially improves final segmentation by aggregating across parallel responses.
- Key Finding 5: StAR with larger base models and test-time voting approaches approaches or matches performance of much larger models (e.g., SAM 3 Agent with 72B) on ReasonSeg-X.
- Key Finding 6: On MMR, StAR shows strong zero-shot performance, outperforming VisionReasoner and models trained on MMR for several metrics
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。