[论文解读] Segment Anything in High Quality
HQ-SAM 在 SAM 中增加了一个轻量级的高质量输出令牌(HQ)以及全局-局部特征融合,从而在几乎无开销的情况下实现高质量的零样本分割,并在一个 44k 掩码数据集上进行训练。
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 10 diverse segmentation datasets across different downstream tasks, where 8 out of them are evaluated in a zero-shot transfer protocol. Our code and pretrained models are at https://github.com/SysCV/SAM-HQ.
研究动机与目标
- Motivate improving mask quality for diverse objects beyond SAM’s coarse boundaries.
- Preserve SAM’s zero-shot generalization and promptable design while adding minimal adapters.
- Demonstrate data-efficient training on a compact, highly-annotated dataset.
- Show robust performance across multiple image and video segmentation benchmarks in zero-shot settings.
提出的方法
- Introduce a learnable HQ-Output Token injected into SAM’s mask decoder.
- Fuse HQ-Features derived from early and final ViT encoder layers and mask features for better detail.
- Train only the HQ-Output Token, its three-layer MLPs, and a fusion block while freezing SAM.
- Use a three-layer MLP to generate dynamic kernels for high-quality mask prediction.
- Combine HQ-Output Token predictions with SAM’s output via element-wise summation for final masks.
- Develop HQSeg-44K, a 44k-mask dataset from six sources to enable data-efficient training.
实验结果
研究问题
- RQ1Can HQ-SAM improve mask detail and boundary accuracy without harming SAM’s zero-shot performance?
- RQ2Is HQ-SAM training data-efficient, achieving high-quality masks with minimal additional parameters?
- RQ3Do global-local feature fusion and the HQ-Output Token provide measurable gains across diverse datasets and prompts?
- RQ4How does HQ-SAM compare to full fine-tuning or post-refinement approaches in zero-shot scenarios?
主要发现
- HQ-SAM yields higher-quality masks than SAM while preserving zero-shot capabilities across 10 diverse datasets.
- Training HQ-SAM on HQSeg-44K requires only 4 hours on 8 RTX 3090 GPUs with less than 0.5% parameter overhead.
- HQ-SAM achieves substantial gains in boundary-focused metrics (e.g., mBIoU improvements on several fine-grained datasets).
- Global-local fusion of early and final ViT encoder features plus mask features improves segmentation detail over using SAM features alone.
- Compared to fine-tuning or post-refinement baselines, HQ-SAM provides better zero-shot performance with much smaller parameter updates.
- Light HQ-SAM based on MobileSAM’s tiny encoder can achieve 41.2 FPS with modest overhead, improving COCO open-set metrics.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。