QUICK REVIEW

[論文レビュー] Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team|arXiv (Cornell University)|May 16, 2024

Multimodal Machine Learning Applications被引用数 14

ひとこと要約

Chameleon は、画像とテキストの interleaved sequences を処理する初期融合型のトークンベースのマルチモーダルモデルのファミリーであり、ゼロから訓練された状態で interleaved sequences を処理する能力を持ち、最先端の vision-language 結果と強力な text-only パフォーマンスを達成し、長文の混在モーダル生成において顕著な成功を収めている。

ABSTRACT

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

研究の動機と目的

統一された、早期融合型マルチモーダルモデルを開発し、任意の interleaved sequence の画像とテキストを推論・生成できるようにする。
トークンベースの混合モーダル学習を大規模に安定して訓練・整合させる。
視覚と言語のタスクで最先端の性能を示しつつ、テキストのみのベンチマークでも競争力を保つ。
長文の混合モーダル生成と人間評価によるオープンエンドタスクの能力を示す。

提案手法

新しい 512x512 の画像トークナイザを介して画像を離散トークンとして表現し、8192 のコードブックから 1024 トークンを生成する。
スケールの大きい interleaved text, image, and code data を用いて、ゼロから統一トランスフォーマを訓練する（約 10T tokens）。
QK-Norm（query-key normalization）や norm-reordering のようなアーキテクチャ的革新を適用し、混合モーダル設定での訓練を安定化させる。
ソフトマックスのロジットを制御するための z-loss 正則化とともに、ステージ別の強調を持つ two-stage pre-training データミックスを用いる（第一段階は大規模な無監督データ、第二段階はより高品質のデータ）。
能力の整合性と安全性を図るため、カテゴリ（Text、Code、Visual Chat、Image Generation、Interleaved Generation、Safety）全体を対象とした supervised fine-tuning データ（SFT）でファインチューニングを行う。
モダリティ認識付きマスキングとトークンレベルの制御を伴うインタラクティブな混合生成のストリーミング推論パイプラインを開発する。

Figure 1 : Chameleon represents all modalities — images, text, and code, as discrete tokens and uses a uniform transformer-based architecture that is trained from scratch in an end-to-end fashion on $\sim$ 10T tokens of interleaved mixed-modal data. As a result, Chameleon can both reason over, as we

実験結果

リサーチクエスチョン

RQ1単一の早期融合トークンベースモデルは、画像とテキストの混合シーケンスを jointly 推論し生成できるか？
RQ2大規模な混合モーダル学習を安定化させるために、どのような訓練・アーキテクチャの調整が必要か？
RQ3混合モーダルモデルは、 vision-language タスクと長文生成において、専門化されたマルチモーダルモデルやテキストのみのモデルとどう比較されるか？
RQ4 diverse modalities にわたる supervision-finetuning をスケールさせたときの安全性と整合性の結果はどうなるか？

主な発見

Chameleon-34B は、視覚質問応答および画像キャプション作成のベンチマークで最先端の性能を達成。
Chameleon は、Mixtral 8x7B および Gemini-Pro と比較して、常識推論と読解タスクでテキストのみのベンチマークと同程度以上を達成。
長文の混合モーダル生成の人間評価では、Chameleon-34B が Gemini-Pro および GPT-4V を著しく上回る好評を得た（60.4% 対 Gemini-Pro、51.6% 対 GPT-4V）。
単一モデル内での非自明な画像生成を示し、オープンなマルチモーダル文書モデリングを可能にする。
安定した訓練戦略を用いて 34B パラメータへとスケールし、複数のベンチマークでより大規模モデルと競争力のある性能を達成。

Figure 2 : Sample interleaved image and text generation from Chameleon. The corresponding images are generated in locations marked by <img> .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。