QUICK REVIEW

[論文レビュー] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu|arXiv (Cornell University)|Aug 20, 2024

Brain Tumor Detection and Classification被引用数 6

ひとこと要約

Transfusion は、テキストの離散データと画像の連続データの両方を扱える単一のトランスフォーマーを訓練し、テキストの次-token予測と画像の拡散を共同最適化することで、離散化された画像のベースラインと比較して強力なマルチモーダルなスケーリングと効率性を実現します。

ABSTRACT

We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

研究の動機と目的

離散的な（テキスト）と連続的な（画像）モダリティの情報損失なく処理・生成できる統一モデルの動機づけ。
単一の Transformer における言語モデル(loss)と拡散(loss)を組み合わせることで、画像を離散化するよりもスケールが向上することの実証。
モダリティ別のエンコーディング/デコoding 層とパッチによる画像圧縮が、性能と効率を改善できること。
マルチモーダル性能を左右する主要要素を特定するためのスケーリング法則とアブレーションの提供。

提案手法

テキストを離散トークンとして、画像をVAE 由来の潜在パッチとして表現する。
1つの Transformer を2つの損失で訓練する：テキストの LM 損失と画像パッチの DDPM 拡散損失を、L.Transfusion = L_LM + λ·L_DDPMとして結合する。
モダリティ別の埋め込み/デコード層を使用し、画像には線形エンコーダ/デコーダまたは U-Net ブロックを用いる。
連続的注意機構をシーケンス全体で適用し、パッチ間でのパッチ対パッチ通信を可能にする画像内の双方向注意を用いる。
推論時には、BOI/EOI トークンに遭遇したときにテキスト生成（LMモード）と画像拡散（拡散モード）を切り替える。

Figure 1 : A high-level illustration of Transfusion. A single transformer perceives, processes, and produces data of every modality. Discrete (text) tokens are processed autoregressively and trained on the next token prediction objective. Continuous (image) vectors are processed together in parallel

実験結果

リサーチクエスチョン

RQ11 つの Transformer が画像の離散量子化を行わずに、テキストと画像の両方をモデリング・生成できるか？
RQ2統一型マルチモーダルモデルにおける LM と拡散目的の相互作用と、モデルサイズ間のスケーリング特性はどうなるか？
RQ3パッチエンコード、画像内の相互注意、画像ノイズ付与などのどの設計選択がマルチモーダル性能に最も影響を与える？
RQ4Transfusion はテキストと画像タスクの効率と品質の点で、Chameleon 風の離散化ベースラインと比較してどうか？

主な発見

モデル	C4 PPL	Wiki PPL	Llama Eval Acc	MS-COCO CIDEr	MS-COCO FID	CLIP
Transfusion (7B)	7.72	4.28	61.5	27.2	16.8	25.5
Chameleon (7B)	8.41	4.69	59.1	18.0	29.6	24.3

Transfusion は、同程度のデータと計算量で、テキストのみ・画像関連タスクの両方で Chameleon よりスケールする。
テキストから画像生成では、Transfusion は Chameleon とほぼ同等の性能を、約1/3の計算量で実現し、FID を約2倍低減できる（FLOPs が制御されている場合）。
画像からテキストおよびテキストからテキストのタスクでは、Transfusion は強力な結果を出し、基準のパフォーマンスに対して大幅に少ない FLOPs（例：テキスト対テキストで FLOPs の 21.8%）で達成または追従できる。
アブレーションにより、画像内の双方向注意が有益であること、画像エンコード/デコードのための U-Net ダウン/アップブロックが、モダリティ間のパッチ圧縮を大きく損失少なく実現できることが示された。
7B パラメータ、2T マルチモーダルトークンへスケールアップすると、同程度のスケールの現代拡散モデルや言語モデルと同等の画像・テキスト生成能力が得られる。

Figure 3 : We convert images to and from latent representations using a pretrained VAE, and then into patch representations with either a simple linear layer or U-Net down blocks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。