QUICK REVIEW

[論文レビュー] Text-to-Image Diffusion Models are Zero-Shot Classifiers

Kevin Clark, Priyank Jaini|arXiv (Cornell University)|Mar 27, 2023

Multimodal Machine Learning Applications被引用数 16

ひとこと要約

本論文は、テキスト-画像拡散モデル（例: Imagen、Stable Diffusion）をゼロショットの画像分類器として利用できることを示しており、ノイズ除去をクラス尤度の代理指標とみなすことで CLIP と競合する精度を達成し、テクスチャの手がかりや属性結合に対して強い頑健性を示すことを示している；さらにアプローチを実用的にするための効率化技術も導入している。

ABSTRACT

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.

研究の動機と目的

テキスト-画像拡散モデルがゼロショット分類に適した転送可能な表現を学習するかを調査する。
多様なゼロショット画像分類タスクにおいて拡散モデルを CLIP と定量的に比較する。
テクスチャ-形状の衝突に対する拡散モデルの頑健性と属性結合を実行する能力を調べる。
拡散モデルを用いた実用的なゼロショット分類を実現するための効率性向上を開発する。

提案手法

クラスラベルをテキストプロンプトに変換し、拡散モデルを用いて再重み付き変分下界を対数尤度の代理として画像をスコア付けする。
ゼロショット分類器の決定を、確率的なノイズ除去ステップにおいて拡散損失 L_Diffusion を最小化するクラスを選ぶこととして定義する。
時刻ステップと前方ノイズサンプルに対してモンテカルロサンプリングを用いて拡散損失の期待値を推定する。
ノイズをクラス間で共有して同じノイズ付き画像をすべての候補クラスでスコア付けすることで分散を減らし、効率を改善する。
ペアt検定を用いてオンラインであり得ないクラスを除外し、もっともらしいクラスにより多くのサンプルを割り当てる逐次排除手順を形成する。

実験結果

リサーチクエスチョン

RQ1テキスト-画像拡散モデルは多様なデータセットに対して効果的なゼロショット分類器として機能するか。
RQ2タスク横断で、拡散ベースのゼロショット分類器は精度と頑健性の点で CLIP とどう比較されるか。
RQ3拡散モデルは従来の識別モデルを誤らせるテクスチャやスタイルの手がかりに対して頑健性を示すか。
RQ4拡散モデルは CLIP が示す以上の属性結合と構成的推論能力を示すか。

主な発見

拡散モデルは幅広いデータセットで CLIP と競合するゼロショット分類精度を達成する。
Imagen と Stable Diffusion はテクスチャの手がかりに対して強い頑健性を示し、Cue-Conflict データセットで最先端の性能を達成している。
拡散モデルは合成データで属性結合を実行できる場合があるが、CLIP はできない。
提案された効率化手法（共有ノイズとプルーニング）は計算量を大幅に削減し、ゼロショット評価をより速くするが、依然として通常の識別モデルより遅い。
本研究は生成的事前学習が識別タスクに適した強力なビジョン-言語表現を生み出すことを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。