Skip to main content
QUICK REVIEW

[论文解读] Side Adapter Network for Open-Vocabulary Semantic Segmentation

Mengde Xu, Zheng Zhang|arXiv (Cornell University)|Feb 23, 2023
Multimodal Machine Learning Applications被引用 14
一句话总结

SAN 将一个轻量级侧边网络附加到冻结的 CLIP 模型,以联合生成掩码提议和 CLIP 感知的注意力偏置,从而实现端到端的开放词汇语义分割,并获得显著的效率与准确性提升。

ABSTRACT

This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. The code will be available at https://github.com/MendelXu/SAN.

研究动机与目标

  • 以视觉-语言预训练(CLIP)为基础,推动开放词汇语义分割。
  • 引入一个保持冻结的 CLIP 基础、并且可端到端训练的轻量级侧边网络。
  • 通过注意力偏置将掩码提议生成与基于 CLIP 的识别解耦。
  • 在最少额外参数和计算量下实现 CLIP 感知的掩码预测。
  • 在多个基准上展示最先进的性能并具备效率优势。

提出的方法

  • 将侧边适配网络(SAN)附加到冻结的 CLIP 模型上,形成两条分支:掩码提议生成和用于掩码识别的注意力偏置预测。
  • 使用不对称输入分辨率:低分辨率的 CLIP 特征用于 CLIP 基于的识别,高分辨率的 SAN 输入用于掩码提议。
  • 将来自 CLIP 的视觉标记融合到 SAN,并应用一个解耦头部以生成掩码提议和识别偏置。
  • 通过 S = M P^T 计算分割,其中 M 为掩码提议,P 为来自注意力偏置的类别分数。
  • 端到端训练,掩码预测损失(Dice 与 BCE)和掩码分类损失(交叉熵)。
  • 可选择对 CLIP 位置嵌入进行微调,并采用提示工程以提升零-shot 识别。
Figure 2 : Overview of our SAN . The red dotted lines indicate the gradient flow during training. In our framework, the frozen CLIP model still serves as a classifier, and the side adapter network generates mask proposals and attention bias to guide the deeper layers of the CLIP model to predict pro
Figure 2 : Overview of our SAN . The red dotted lines indicate the gradient flow during training. In our framework, the frozen CLIP model still serves as a classifier, and the side adapter network generates mask proposals and attention bias to guide the deeper layers of the CLIP model to predict pro

实验结果

研究问题

  • RQ1如何在不对大规模 CLIP 模型进行分割数据微调的情况下实现开放词汇语义分割?
  • RQ2一个轻量级的侧边网络能否利用冻结的 CLIP 特征以端到端方式产生 CLIP 感知的掩码提议和识别偏置?
  • RQ3特征融合深度、输入分辨率和解耦头对性能与效率的影响是什么?
  • RQ4在各基准上的准确性与效率方面,SAN 相对于两阶段或完全微调的 CLIP 基方法有何比较?
  • RQ5提示工程对开放词汇分割性能有何影响?

主要发现

方法VL-模型训练数据集集成。ADE-847PC-459ADE-150PC-59VOC
SimSegCLIP ViT-B/16COCOno.7.08.720.547.788.4
MaskCLIPCLIP ViT-L/14COCOno.8.210.023.745.9-
OvSeg*CLIP ViT-B/16COCOyes.7.111.024.853.392.6
SAN(ours)CLIP ViT-B/16COCOno.10.1 ±0.2312.6 ±0.4427.5 ±0.3453.8 ±0.5794.0 ±0.21
SAN ensembleCLIP ViT-B/16COCOyes.10.7 ±0.2213.7 ±0.3428.9 ±0.4255.4 ±0.1194.6 ±0.11
SAN(ours)CLIP ViT-L/14COCOno.12.4 ±0.2715.7 ±0.2632.1 ±0.4257.7 ±0.3494.6 ±0.42
SAN ensembleCLIP ViT-L/14COCOyes.13.7 ±0.1217.1 ±0.1833.3 ±0.2960.2 ±0.3195.5 ±0.16
  • 在 ViT-L/14 CLIP 下,SAN 在 ADE-847(12.4)、PC-459(15.7)、ADE-150(32.1)、PC-59(57.7)和 VOC(94.6)上实现了最先进的 mIoU,超越先前方法。
  • 在 ViT-B/16 上使用 SAN 在不进行完整 CLIP 微调的情况下获得 ADE-847 10.1 mIoU、PC-459 12.6、ADE-150 27.5、PC-59 53.8、VOC 94.0。
  • 将 SAN 与一个经过 COCO 调优的模型集成后,结果提升至 ADE-847 13.7、PC-459 17.1、ADE-150 33.3、PC-59 60.2、VOC 95.5。
  • SAN 仅需 8.4M 可训练参数和 64.3 GFLOPs,相较于竞争方法大幅减少。
  • 消融研究显示更深的 CLIP 特征融合和解耦头有助于性能提升;端到端的 CLIP 感知掩码预测对获得强结果至关重要。
  • 使用提示工程可带来可衡量的增益(在 ADE-150 和 ADE-847 上约提升 1.2 mIoU)。
Figure 3 : The architecture of the side adapter network. The side adapter network projects the input image to visual tokens and appends query tokens to them at the beginning. Further, it fuses the immediate features of the CLIP model in the middle of transformer layers. The query and visual features
Figure 3 : The architecture of the side adapter network. The side adapter network projects the input image to visual tokens and appends query tokens to them at the beginning. Further, it fuses the immediate features of the CLIP model in the middle of transformer layers. The query and visual features

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。