QUICK REVIEW

[論文レビュー] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang|arXiv (Cornell University)|Oct 2, 2023

Multimodal Machine Learning Applications被引用数 12

ひとこと要約

CLIPSelf は Dense features 自己蒸留を通じて CLIP Vision Transformers を微調整し、領域表現を画像レベル表現と整合させ、領域-テキスト対を持たない状態でオープン語彙物体検出とセグメンテーションを改善する。

ABSTRACT

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

研究の動機と目的

CLIP ViT での領域-言語整合性を分析し、検出とセグメンテーションを含むオープン語彙密な予測タスクを動機づける。
CLIP ViT の密な特徴が領域認識に及ぼす性能が低い理由を調査し、領域から画像への自己蒸留ソリューションを提案する。
領域テキスト対を用いず、密なマップからの領域表現を画像の切り抜きと整合させる CLIPSelf を開発する。
OV-COCO、OV-LVIS、およびオープン語彙セグメンテーションのベンチマークで最先端の結果を示す。

提案手法

最後の ViT ブロックの自己注意を破棄して最後のブロックから密な特徴マップを抽出し、領域の空間特徴マップを生成する。
画像をランダムな m x n パッチのグリッドに分割し、これらのパッチを自己蒸留の領域として用いる。
固定された Teacher CLIPViT を維持しつつ Student CLIPViT を微調整する。対応するパッチの Teacher 画像表現と Student の領域埋め込み間のコサイン類似度を最大化して訓練する。
密マップからのプーリング（RoIAlign）によって領域埋め込みを計算し、コサイン類似度損失を用いて Teacher の画像表現に整合させる。
領域-to-画像の整合を最大化するよう ViT の全ての自己注意層を更新し、Student の large 入力サイズは領域認識を改善する。
微調整済みの CLIPViTs をオープン語彙検出（凍結したバックボーン上の二段検出器）、セマンティックセグメンテーション（Cat-Seg 初期化）、パノプティックセグメンテーション（ODISE 推論段階）に適用する。

実験結果

リサーチクエスチョン

RQ1ViT ベースの CLIP モデルは、オープン語彙の密なタスクにおいて局所領域表現と言語をどれだけ適切に整合させられるか？
RQ2画像レベルの CLIP 表現からの自己蒸留は、領域-テキスト対なしで密な領域表現を改善できるか？
RQ3密な領域埋め込みを画像の切り抜きに整合させることは、オープン語彙物体検出とセグメンテーションの性能を向上させるか？
RQ4CLIPSelf は異なる ViT サイズやトレーニングデータ（例：CC3M）に対して頑健で、窓付きアテンションのバリアントと互換性があるか？
RQ5領域提案を使うのとパッチベースの領域を使うのとで、オープン語彙検出とセグメンテーションの相対的な利点は何か？

主な発見

CLIPViTの密 representations は、画像の切り抜きと比較して領域レベル認識のパフォーマンスが低く、領域-to-画像整合を動機づける。
ランダムな m x n 画像パッチを用いた自己蒸留法である CLIPSelf は、ベースライン ViT CLIP モデルより領域およびパノプティックマスク分類精度を大幅に向上させる。
Teacher の画像表現を指針として用いることで、Student ViT は対応する画像切り抜きと整合する領域埋め込みを生成するよう学習し、Open-Vocabulary 検出とセグメンテーションの性能を向上させる。
バックボーンを CLIPSelf 増強 ViT に置換すると、オープン語彙物体検出で OV-COCO および OV-LVIS で最先端の結果を達成し、オープン語彙セグメンテーションとパノプティックセグメンテーションのベンチマークを改善する。
CLIPSelf は領域-テキスト対アプローチ（ノイジーな領域-to-テキストマッチング）を上回り、局所窓アテンションのバリアントでも有効で、CC3M データで訓練されていても効果的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。