QUICK REVIEW

[論文レビュー] Vision-Language Models for Vision Tasks: A Survey

Jingyi Zhang, Jiaxing Huang|arXiv (Cornell University)|Apr 3, 2023

Multimodal Machine Learning Applications被引用数 34

ひとこと要約

Vision-Language Models (VLMs) を用いた視覚認識の系統的レビュー。アーキテクチャ、目的、データセット、転移学習、知識蒸留を網羅し、ベンチマークと今後の方向性を提示。

ABSTRACT

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

研究の動機と目的

従来の視覚認識パラダイムから vision-language pre-training への進化を説明する。
VLM アーキテクチャ、目的、およびゼロショット機能を評価する下流タスクを要約する。
VLMs のための大規模画像-テキストデータセットと評価ベンチマークをレビューする。
VLM の事前学習、転移学習、および知識蒸留の手法を分類する。
視覚認識における VLM 研究の課題と今後の方向性を強調する。

提案手法

CNN ベースおよび Transformer ベースの画像エンコーダを用いて画像特徴を分類する。
標準の Transformer ベース言語エンコーダを用いてテキスト特徴を分類する。
事前学習の目的を対照学習、生成、アライメントのカテゴリに整理し、正式な損失関数（例：InfoNCE、L_IT、L_RW、L_MIM、L_MLM、L_MCM）を用いる。
ゼロショット予測、線形プロービング、下流タスク（分類、検出、セグメンテーション、検索、アクション認識）を評価する方法を説明する。
事前学習用データセット（例：CLIP、ALIGN、LAION）と評価用データセット（例：ImageNet、COCO、PASCAL VOC）を提供する。
転移学習および知識蒸留のアプローチを調査し、VLM を下流の視覚タスクへ適用する。

実験結果

リサーチクエスチョン

RQ1VLM は大規模な画像-テキストデータから視覚-言語の相関を学習し、視覚認識タスク全体でゼロショット予測を実現するか。
RQ2どのネットワークアーキテクチャと事前学習目的が、VLM におけるクロスモーダル表現の学習に最も効果的か。
RQ3事前学習と評価に使用されるデータセットは何か、それらはゼロショットおよび線形プローブ設定の性能にどのように影響するか。
RQ4検出やセグメンテーションといった下流タスクに VLM を適用する際、転移学習と知識蒸留の手法はどれが最も有効か。
RQ5視覚認識のための Vision-Language Model 研究の主な課題と今後の方向性は何か。

主な発見

Vision-Language Models はウェブ規模データからの画像-テキスト相関を学習することで、複数の視覚認識タスクに対してゼロショット予測を可能にする。
VLM の事前学習は、画像エンコーダとテキストエンコーダを組み合わせ、対照学習・生成・アライメントにまたがる損失を用いてクロスモーダル表現を学習する。
大規模な画像-テキストデータセット（例：CLIP、ALIGN、LAION）および補助データは、VLM の訓練と多様なタスクでの評価をサポートする。
転移学習と知識蒸留は、ゼロショット利用を超えた下流タスクへ VLM を適用するための重要な方向性である。
この調査はデータセット間のベンチマークを提供し、VLM ベースの視覚認識における課題と将来の研究方向を議論している。
CLIP などの著名な VLM のゼロショット性能は、36 の視覚認識タスクにおいて強力な結果を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。