QUICK REVIEW

[论文解读] The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Wei‐Yun Wang, Min Shi|arXiv (Cornell University)|Aug 3, 2023

Multimodal Machine Learning Applications被引用 13

一句话总结

引入 AS-1B，一个十亿区域的开放世界数据集，以及 All-Seeing Model (ASM)，一个具备定位感知的视觉-语言基础模型，用于全景识别与理解，具备强大的零样本能力。

ABSTRACT

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

研究动机与目标

通过构建一个具有丰富语义和描述的大规模区域级数据集，推进开放世界的全景视觉识别与理解。
创建一个统一的视觉-语言模型（ASM），能够对区域级信息进行推理并同时支持判别式和生成式任务。
在标准视觉和视觉-语言基准上展示改进的零样本和微调性能。

提出的方法

开发 AS-1B，一个拥有超过 1B 区域注释、3.5M 概念、132.2B 令牌和 3.3B VQA 对的数据集，通过数据–人类–模型循环创建。
提出一个定位感知的图像分词器，使用边界框、掩模和点集来提取基于区域的特征。
采用基于大语言模型的解码器，以共享权重实现对判别式和生成式视觉-语言任务的统一处理。
引入一个训练目标，将生成损失与区域文本对齐/对比损失结合起来，类似 CLIP，用于判别式任务。
实现一个区域文本对齐的精炼管线，使用 CLIP、CLIPSeg，之后再使用 ASM 进行准确的区域标注。

(a) Large Language Models (LLMs) possess extensive world knowledge and demonstrate impressive reasoning capabilities, but lack the ability to receive and comprehend visual information.

实验结果

研究问题

RQ1在区域级别的开放世界全景数据集是否能实现鲁棒的、区域感知的理解和生成？
RQ2一个统一的定位感知视觉-语言模型是否在零样本和微调设置下对多种视觉-语言任务具有泛化能力？
RQ3迭代数据–人类–模型循环对数据质量和模型性能的影响是什么？

主要发现

AS-1B 包含 1.2B 区域、3.5M 概念、132.2B 令牌和 3.3B VQA 对，能够实现广泛的开放世界语义。
ASM 在标准基准上实现了零样本和微调相比于先前模型的改进，包括区域级识别。
在零样本区域识别任务中，ASM 在 COCO 和 LVIS 的 mAP 分别超越 CLIP 10.4 和 14.3。
数据引擎通过将改进后的模型反馈到数据生成与标注中，迭代地提升数据质量。
该框架在一个单一架构内支持从区域文本检索到字幕生成和 VQA 等一系列任务。

(b) Visual Large Language Models (VLLMs) can process both text and images, but they can only capture the holistic visual information of the whole image and understand it based on LLMs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。