QUICK REVIEW

[论文解读] Towards Open Vocabulary Learning: A Survey

Jianzong Wu, Xiangtai Li|arXiv (Cornell University)|Jun 28, 2023

Advanced Image and Video Retrieval Techniques被引用 8

一句话总结

本综述全面回顾了计算机视觉中的开放词汇学习，重点关注目标检测、分割、视频理解及3D场景理解。它将开放词汇学习定义为零样本学习和弱监督学习的泛化形式，利用视觉-语言预训练实现对新类别无需额外标注的识别，在COCO和ADE20K等基准上达到最先进性能。

ABSTRACT

In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective compared to weakly supervised and zero-shot settings. This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by comparing it to related concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Then, we review several closely related tasks in the case of segmentation and detection, including long-tail problems, few-shot, and zero-shot settings. For the method survey, we first present the basic knowledge of detection and segmentation in close-set as the preliminary knowledge. Next, we examine various scenarios in which open vocabulary learning is used, identifying common design elements and core ideas. Then, we compare the recent detection and segmentation approaches in commonly used datasets and benchmarks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To our knowledge, this is the first comprehensive literature review of open vocabulary learning. We keep tracing related works at https://github.com/jianzongwu/Awesome-Open-Vocabulary.

研究动机与目标

解决封闭集学习在现实应用中的局限性，即新物体类别频繁出现在训练集之外。
厘清开放词汇学习、零样本学习、开放集识别和分布外检测之间的区别。
系统性地综述并分析开放词汇检测与分割在多个基准和数据集上的最新进展。
评估视觉-语言模型（VLMs）和辅助语言监督（如图像字幕）在实现可扩展、无标注的新型类别泛化中的作用。
识别开放词汇学习中的开放挑战与未来研究方向，特别是在长尾、少样本和广义零样本设置下。

提出的方法

对开放词汇学习与相关范式（如零样本学习（ZSL）、开放集识别（OSR）和分布外（OOD）检测）进行分类与比较。
综述基于视觉-语言模型（VLMs）如CLIP和ALBEF的开放词汇检测与实例分割的最先进方法。
分析图像字幕和文本嵌入作为弱监督手段，以减少对昂贵边界框和掩码标注的依赖。
在COCO、LVIS、ADE20K和ScanNet等标准基准上，评估方法在受限和广义设置下的表现。
比较主干网络架构（如ResNeXt、Swin、ViT）和VLMs（如CLIP、Stable Diffusion）在性能和泛化能力方面的差异。
整合关于设计模式的见解，如提示学习、对比预训练和无掩码训练，在不同任务和数据集中的应用。

实验结果

研究问题

RQ1在假设和能力方面，开放词汇学习与零样本学习、开放集识别和分布外检测有何不同？
RQ2实现开放词汇检测与分割最先进性能的关键技术组件和设计模式是什么？
RQ3与传统ZSL相比，视觉-语言模型和辅助语言监督（如字幕）在提升对新类别的泛化能力方面有多大的改进作用？
RQ4不同的主干网络架构和VLMs如何影响开放词汇检测、分割和3D理解任务中的性能表现？
RQ5在实现真实应用中稳健、可扩展且泛化能力强的开放词汇学习方面，主要挑战和开放问题是什么？

主要发现

CGG方法在COCO实例分割任务中实现了46.8 APbase和29.5 APnovel，且未使用预训练VLM或额外数据，优于依赖外部监督的方法。
无掩码的OVIS方法在COCO上实现了27.4 APnovel，且无需掩码标注，仅依赖图像字幕即展现出对新类别的强大泛化能力。
ODISE-cap在ADE20K全景分割任务中取得了23.4的最高PQ分数，领先第二名0.8分。
PADing在COCO全景分割中对已见类别实现了41.5 PQ，而Freeseg在未见类别中取得了最高的29.8 PQ。
Open-VCLIP在UCF、HMDB和Kinetics-400三个视频分类基准上均表现最佳，凸显了VLM在视频识别中的有效性。
RegionPLC在3D语义分割中对新类别实现了优异的mIoU（nuScenes数据集上hIoU为65.1），表明其在3D场景中对未见类别的强泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。