QUICK REVIEW

[论文解读] Language-driven Semantic Segmentation

Boyi Li, Kilian Q. Weinberger|arXiv (Cornell University)|Jan 10, 2022

Advanced Neural Network Applications被引用 163

一句话总结

LSeg 使用文本编码器（例如 CLIP）对任意标签描述进行嵌入，并训练一个密集图像编码器以将逐像素嵌入与这些文本嵌入对齐，从而实现零样本语义分割和无需重新训练的灵活标签集。

ABSTRACT

We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., "grass" or "building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., "cat" and "furry"). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. Code and demo are available at https://github.com/isl-org/lang-seg.

研究动机与目标

通过使语义分割中的固定标签集限制成为可灵活、语言驱动的标签表示来解决问题。
利用文本编码器对描述性标签进行嵌入，并训练图像编码器以将像素嵌入与这些标签嵌入对齐。
展示在没有额外训练样本的情况下进行零样本和少量示例分割的能力。
显示语言空间的语义相似性在无见类的视觉领域中的迁移。

提出的方法

使用预训练文本编码器（CLIP）对标签进行嵌入，产生一组标签嵌入，无论其数量或顺序如何。
使用密集预测Transformer图像编码器为输入图像产生逐像素嵌入。
通过像素嵌入与标签嵌入之间的向量内积计算像素级相关张量，并使用像素级softmax交叉熵损失来将真实像素与其标签对齐。
结合空间正则化模块（DepthwiseBlock 或 BottleneckBlock）在上采样并在保持标签顺序等变性的同时 refin e 预测。
训练过程中冻结文本编码器，只更新图像编码器，使得对任意标签集可以灵活合成零-shot分割图。

实验结果

研究问题

RQ1语言嵌入标签空间是否能够在不重新训练的情况下对新类别实现准确的零样本语义分割？
RQ2测试时替换或扩展标签集对分割质量和灵活性有何影响？
RQ3语言驱动的标签嵌入在多大程度上将语义相关的概念（如狗和宠物）对像素标注的引导对齐？
RQ4不同的文本编码器和骨干网络对零样本分割性能有何影响？
RQ5在标准基准测试上，LSeg 与固定标签和少样本分割基线有何比较？

主要发现

模型	骨干	方法	5^0	5^1	5^2	5^3	平均	FB-IoU
OSLSM		1-shot	33.6	55.2	40.9	33.5	40.8	61.3
co-FCN	VGG16	1-shot	36.7	50.6	44.9	32.4	41.1	60.1
AMP-2		1-shot	41.9	50.2	46.7	34.7	43.4	61.9
PANet	ResNet50	1-shot	44.0	57.5	50.8	44.0	49.1	-
PGNet		1-shot	56.0	66.9	50.6	50.4	56.0	69.9
FWB	ResNet101	1-shot	51.3	64.5	56.7	52.2	56.2	-
PPNet		1-shot	52.7	62.8	57.4	47.7	55.2	70.9
DAN		1-shot	54.7	68.6	57.8	51.6	58.2	71.9
PFENet		1-shot	60.5	69.4	54.4	55.9	60.1	72.9
RePRI		1-shot	59.6	68.6	62.2	47.2	59.4	-
HSNet		1-shot	67.3	72.3	62.0	63.1	66.2	77.6
SPNet	ResNet101	zero-shot	23.8	17.0	14.1	18.3	18.3	44.3
ZS3Net	zero-shot	39.1?	39.4	39.3	33.6	38.3	57.7
LSeg	ResNet101	zero-shot	52.8	53.8	44.4	38.5	47.4	64.1
LSeg	ViT-L/16	zero-shot	61.3	63.6	43.1	41.0	52.3	67.0

LSeg 在与现有零-shot 和少样本方法的基准比较中实现了具有竞争力的零-shot 性能。
使用更大的骨干网络 ViT-L/16，LSeg 实现了强劲的零-shot结果，可与某些少样本方法相媲美。
文本嵌入相对于固定标签分割时，对性能的下降仅为微小。
LSeg 可以通过改变输入标签集在无需重新训练的情况下即时合成零-shot分割模型。
一个空间正则化模块在不影响标签灵活框架的情况下改善逐像素预测。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。