QUICK REVIEW

[論文レビュー] CountGD: Multi-Modal Open-World Counting

Niki Amini-Naieni, Tengda Han|arXiv (Cornell University)|Jul 5, 2024

Data Management and Algorithms被引用数 5

ひとこと要約

CountGD は、テキスト、視覚的サンプル、または両方を受け付けてターゲットオブジェクトを指定する単一段階のオープンワールド物体カウントモデルで、両方のモダリティを使用した場合に最先端の精度を達成します。

ABSTRACT

The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at https://www.robots.ox.ac.uk/~vgg/research/countgd/.

研究の動機と目的

推論時に再訓練なしで柔軟なターゲット指定を可能にするオープンワールドカウントを動機づける。
検出ではなくカウントをサポートするよう、ビジョン-言語ファウンデーションモデルを拡張する。
テキスト・視覚サンプルのいずれか、または両方を用いたマルチモーダルプロンプティングを実現し、カウント精度を向上させる。
テキストとサンプルプロンプトの相互作用と、それらがカウントのサブセットに与える影響を調査する。

提案手法

GroundingDINO の上に CountGD を構築し、視覚サンプルをテキスト風トークンとして埋め込むモジュールと、単一段階でのカウントを可能にするモジュールを追加する。
Swin Transformer で画像をエンコードし、マルチスケールの特徴マップを生成する。
BERT ベースのテキストエンコーダでテキストをエンコードし、自己注意とクロス注意を用いた特徴強化器を介して視覚サンプルトークンと融合する。
視覚サンプルを RoIAlign でプールしたトークンとして表現し、それらを特徴強化器内のテキストトークンと融合させる。
融合された視覚サンプル-テキストトークンと最も整合する画像パッチトークンのセットを選択して、クロスモダリティのクエリを形成する。
クロスモダリティ・デコーダを用いてクエリと融合特徴の類似スコアを計算し、しきい値で最終的なカウントを出力する。
画像エンコーダと言語エンコーダを凍結し、追加した投影層（視覚サンプルトークン抽出、特徴強化、クロスモダリティデコーダ）だけを訓練し、ロスは局所化項と分類項を組み合わせる。

実験結果

リサーチクエスチョン

RQ1オープンワールドのカウントは、テキストプロンプト、視覚サンプル、またはその組み合わせのいずれかで効果的に実行できるか。
RQ2単一段階アーキテクチャで視覚サンプルをテキストと統合することは、単一モードのアプローチよりカウント精度を改善するか。
RQ3テキストとサンプルプロンプトは、カラーや位置のフィルタリングなど、カウントを洗練または制約する際にどのように相互作用するか。
RQ4ファインチューニングなしで CountGD のデータセット間の転移性はどれほどか（例: ゼロショット CARPK および CountBench の一般化）？

主な発見

視覚サンプルとテキストの両方を用いた CountGD は、FSC-147 におけるオープンワールドベンチマークで最先端のカウント精度を確立した。
テキストのみを用いた CountGD は、最近のテキスト専用オープンワールド・カウント法と競合できる性能を達成した。
CARPK に対するゼロショット評価で、CARPK の微調整なしに CountGD が最高性能を示した。
アブレーションにより、マルチモーダル訓練と推論がユニモーダルより優れており、視覚サンプルは FSC-147 ではテキストのみより強い信号を提供することが多いことが示された。
定性的分析は、テキストがサンプルベースのターゲットを洗練または強化する相互作用（例: 色や位置の制約）を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。