QUICK REVIEW

[论文解读] SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu|arXiv (Cornell University)|Oct 23, 2023

Domain Adaptation and Few-Shot Learning被引用 13

一句话总结

SAM-CLIP 通过带回放的多任务蒸馏将 SAM 与 CLIP 合并为一个 ViT 主干，实现零样本分类、实例分割，以及在减少内存和计算量的前提下的先进零样本语义分割。

ABSTRACT

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

研究动机与目标

推动将视觉基础模型合并，以结合语义与空间理解。
提出一种高效的、基于回放的蒸馏方法，以最小化忘记并将视觉基础模型合并。
展示 SAM-CLIP 作为单一主干，能够实现零样本分类、实例分割和语义分割。
表明合并后的模型能够产生更丰富的表示并具备新的零样本能力。
通过减少存储和计算需求来评估对边缘设备的适用性。

提出的方法

以 SAM 作为基础视觉基础模型，并通过多头架构将 CLIP 合并到其主干。
应用两阶段训练：对 CLIP 头进行头部探查，然后使用回放数据进行多任务蒸馏。
在回放数据上通过 KL-type cosine 损失和一个针对 SAM 的蒸馏损失蒸馏 CLIP 与 SAM 的知识。
冻结非图像模态的编码器，同时允许图像主干和头部以较小的学习率学习，以防止遗忘。
采用两数据集回放策略：D_CLIP 用于 CLIP 蒸馏，D_SAM 用于 SAM 蒸馏，结合优化目标 L_CLIP + λ L_SAM。
采用两种分辨率策略和分辨率自适应，以对齐 CLIP（较低分辨率）和 SAM（1024px）的训练。
提出一个推理流程，单一主干即可支持分类、实例分割和语义分割。

实验结果

研究问题

RQ1是否可以在不产生灾难性遗忘的情况下，将两个不同的视觉基础模型（SAM 和 CLIP）合并为一个单一主干？
RQ2基于回放的多任务蒸馏方法是否能有效转移知识并保留原有能力？
RQ3SAM-CLIP 是否能够实现零样本语义分割，并在多个基准测试中超越任务特定模型？
RQ4相比分别部署 SAM 和 CLIP，合并后的模型在边缘设备上是否更节省存储和计算？
RQ5合并模型产生了哪些表示，以及它们如何支持下游任务？

主要发现

SAM-CLIP 保留了 SAM 和 CLIP 的核心零样本能力，遗忘极小。
合并后的模型在零样本分类和实例分割方面与基线 VFM 相当。
在五个数据集上，SAM-CLIP 在零样本语义分割方面达到最新的性能水平。
头部探针显示，SAM-CLIP 的表示在语义和空间任务上都比单独的 SAM 或 CLIP 更丰富。
分辨率自适应训练使 CLIP 风格任务在 224/336/448px 下进行，而 SAM 任务在 1024px 下进行。
在 SAM-CLIP 内组合 CLIP 和 SAM 的头部可进一步提高零样本分割质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。