QUICK REVIEW

[論文レビュー] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin|arXiv (Cornell University)|Apr 28, 2021

Multimodal Machine Learning Applications参考文献 39被引用数 280

ひとこと要約

ViLD は、オープンボキャブラリ画像分類器から知識を蒸留し、オープンボキャブラリ物体検出を可能にする2段階検出器へと統合する。これにより、新規カテゴリの精度が高く、データセット間の転移性を実現する。

ABSTRACT

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

研究の動機と目的

任意のテキスト入力で説明されるオブジェクトを検出する際、新規カテゴリの豊富な検出アノテーションがないという課題に対処する。
事前学習済みのオープンボキャブラリ画像分類器を教師として活用し、2段階検出器を監督する。
ViLD の構成要素（ViLD-text と ViLD-image）を開発し、領域埋め込みをテキスト埋め込みおよび画像埋め込みと整合させる。
LVIS でのオープンボキャブラリ検出性能と、他の検出データセットへの転移可能性を示す。

提案手法

2段階検出器の標準分類器を、事前学習済みオープンボキャブラリモデルからのテキスト埋め込みに置換して ViLD-text を形成する。
事前学習済み画像エンコーダからの画像埋め込みを、L1 損失を用いて Mask R-CNN の領域埋め込みへ蒸留して ViLD-image を形成する。
ViLD-text と ViLD-image を結合した共同訓練目的 L_ViLD = L_ViLD-text + w * L_ViLD-image を採用する。
推論時には、ベースカテゴリと新規カテゴリで同じテキスト埋め込みを使用し、オープンボキャブラリ検出（C_B ∪ C_N）を可能にする。
オプションとして、モデルアンサンブル（ViLD-ensemble または ViLD-text+CLIP）を適用してベース/新規カテゴリの性能を向上させる。
蒸留を、CLIP、ALIGN などの異なる教師モデルで実施し、ファインチューニングなしでの転移性を示す。

実験結果

リサーチクエスチョン

RQ1オープンボキャブラリ画像分類器からの知識蒸留は、効果的なオープンボキャブラリ物体検出を実現できるのか？
RQ2テキストベースと画像ベースの蒸留信号は、新規カテゴリ検出の補完としてどのように作用するのか？
RQ3より強力な教師モデル（例：ALIGN）の影響は、オープンボキャブラリ検出性能にどのように現れるのか？
RQ4ViLD 学習済み検出器は、ファインチューニングなしで他の検出データセットへどの程度転移できるのか？

主な発見

ViLD は LVIS 上で ResNet-50 背景の新規カテゴリ AP_r が 16.1 を達成し、監視済み対照より 3.8 AP_r 上回る。
より強力な教師モデル ALIGN の使用は LVIS の新規カテゴリで 26.3 AP_r を達成。
ViLD は finetuning なしで PASCAL VOC (72.2 AP50)、COCO (36.6 AP)、Objects365 (11.8 AP) へ直接転移する。
ViLD は COCO で従来のオープンボキャブラリ検出器より 4.8 AP_r、総合で 11.4 AP の改善を示す。
ViLD-text（CLIP テキスト埋め込み使用）は、GloVe（10.1 対 3.0）と比較して新規カテゴリ AP_r を著しく改善する。
ViLD はテキストベースの蒸留と画像ベースの蒸留を組み合わせた ViLD-text + ViLD-image により、新規カテゴリの性能を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。