QUICK REVIEW

[論文レビュー] Learning Object-Language Alignments for Open-Vocabulary Object Detection

Chuang Lin, Peize Sun|arXiv (Cornell University)|Nov 27, 2022

Multimodal Machine Learning Applications被引用数 36

ひとこと要約

VLDet は画像-テキストペアから region-word アライメントを直接学習することにより、 grounding アノテーションなしにオープンボキャブラリの物体検出を実現します。COCO および LVIS の open-vocabulary ベンチマークで新規クラス検出の最先端を達成します。

ABSTRACT

Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.

研究の動機と目的

境界ボックスのアノテーションなしに新規カテゴリを認識できるオープンボキャブラリ検出を動機づける。
高価な grounding データを回避する言語監督付き学習パラダイムを提案する。
region-word アライメントを、二部対称マッチングで解ける集合一致問題として定式化する。
見たことのないクラスへ一般化するため、2段の検出器に画像-テキストの監督を組み込む。

提案手法

画像領域を領域特徴の集合として、キャプションの名詞を語彙埋め込みの集合として表現する。
領域特徴と語彙埋め込みの内積で region-word アライメントスコアを計算する。
ハンガリアンアルゴリズムを用いて region-word アサインメントを解き、各画像-キャプションペアごとに1対1の region-word マッチングを得る。
二部マッチング結果に条件付けられた region-word クロスエントロピーロスで訓練する。
追加の監督のため、全体の画像とキャプションを特別な region/word ペアとして扱う image-text アライメントロスを含める。
Faster R-CNN ベースラインで分類器ヘッドを置換するため、キャプションの語とキャプション中の語を埋め込む固定テキストエンコーダとして CLIP を使用する。

実験結果

リサーチクエスチョン

RQ1 grounding アノテーションなしに画像-テキストペアからオープンボキャブラリ検出を直接学習できるか？
RQ2 region-word アライメントを集合一致問題として定式化することは新規カテゴリ検出性能を改善するか？
RQ3オブジェクト語彙のサイズとマッチング戦略はオープンボキャブラリ一般化にどう影響するか？
RQ4 region-word アライメントに加えて image-text アライメント損失を組み込むとどうなるか？
RQ5提案手法は再訓練なしで他データセット・領域へどれだけ移用できるか？

主な発見

方法	新規AP	ベースAP	全体AP
Base-only	1.3	52.8	39.3
OVR-CNN (Zareian et al., 2021)	22.8	46.0	39.9
Detic (Zhou et al., 2022)	27.8	47.1	42.0
RegionCLIP (Zhong et al., 2022)	26.8	54.8	47.5
ViLD (Gu et al., 2021)	27.6	59.5	51.3
PB-OVD (Gao et al., 2021)	30.8	46.1	42.1
Our (VLDet)	32.0	50.6	45.8

VLDet は open-vocabulary COCO (32.0) と LVIS (21.7) において新規クラス mAP の最先端を達成する。
COCO 上では、Novel-class 検出において PB-OVD および関連手法を上回り、訓練には COCO Caption データのみを使用。
CC3M データを用いた LVIS では、Novel-class マスク AP が RN50 バックボーンで 21.7、SwIn-B で 26.3 と他のベースラインを上回る。
一対一の region-word アサインメント（Hungarian）は one-to-many（Sinkhorn）より Novel-class AP で優れている（OV-COCO 32.0 vs 29.1; OV-LVIS 21.7 vs 18.5）。
より大きなオープン語彙（キャプションからの全名詞）を使用すると、事前定義カテゴリ名に制限する場合より未見クラスへの一般化が改善される。
region-word アライメントと image-text アライメント損失を共同最適化すると、いずれか一方だけより性能が向上する。
再訓練なしで VOC および LVIS への転移を示し、ドメインをまたぐ頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。