QUICK REVIEW

[論文レビュー] The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Wei‐Yun Wang, Min Shi|arXiv (Cornell University)|Aug 3, 2023

Multimodal Machine Learning Applications被引用数 13

ひとこと要約

AS-1Bは十億規模の領域データセットとAll-Seeing Model (ASM)を紹介する。場所を認識できる視覚言語基盤モデルで、パンオプティック認識と理解をゼロショット能力とともに強力に実現。

ABSTRACT

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

研究の動機と目的

オープンワールドのパンオプティック視覚認識と理解を、リッチな意味論と記述を備えた大規模な領域レベルのデータセットを構築することで進展させる。
領域レベルの情報を推論し、識別タスクと生成タスクの双方をサポートする統一的な vision-language モデル（ASM）を作成する。
標準的な視覚及び視覚言語ベンチマークにおけるゼロショットおよびファインチューニング性能の改善を示す。

提案手法

データ-人-モデルループを通じて作成された、10億件超の領域アノテーション、350万の概念、132.2Bトークン、3.3BのVQAペアを含むAS-1Bデータセットを開発する。
境界ボックス、マスク、ポイント集合を用いて領域ベースの特徴を抽出する場所認識型画像トークナイザーを提案する。
共有重みを用いて識別的・生成的な vision-language タスクを統一的に扱うために、LLMベースのデコーダを採用する。
生成損失と領域-テキスト整合/対比損失を組み合わせた訓練目的を導入し、識別タスクにCLIPに類似したものとする。
正確な領域タグ付けのために、CLIP、CLIPSeg を用い、後に ASM を用いた領域-テキスト整合精錬パイプラインを実装する。

(a) Large Language Models (LLMs) possess extensive world knowledge and demonstrate impressive reasoning capabilities, but lack the ability to receive and comprehend visual information.

実験結果

リサーチクエスチョン

RQ1領域レベルのオープンワールドパンオプティックデータセットは、堅牢で領域認識的な理解と生成を可能にするか。
RQ2統一された位置認識型vision-languageモデルは、ゼロショットおよびファインチューニング設定で複数のvision-languageタスクへ一般化できるか。
RQ3データ-人-モデルの反復ループがデータ品質とモデル性能に与える影響は何か。

主な発見

AS-1B は 1.2B リージョン、3.5M コンセプト、132.2B トークン、3.3B VQA ペアを含み、広範なオープンワールド意味論を可能にする。
ASM は、領域レベルの認識を含む標準ベンチマークで、先行モデルを上回るゼロショットおよびファインチューニングの改善を達成する。
ASM は、COCOとLVISのゼロショット領域認識タスクでそれぞれ10.4と14.3 mAPの差でCLIPを上回る。
データエンジンは、改善したモデルをデータ生成とラベリングへ戻すことでデータ品質を反復的に向上させる。
このフレームワークは、単一のアーキテクチャ内で領域-テキスト検索からキャプション生成、VQAまで多様なタスクをサポートする。

(b) Visual Large Language Models (VLLMs) can process both text and images, but they can only capture the holistic visual information of the whole image and understand it based on LLMs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。