QUICK REVIEW

[論文レビュー] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark|arXiv (Cornell University)|Sep 25, 2024

Semantic Web and Ontologies被引用数 8

ひとこと要約

Molmo は、独自データや合成リレーに依存せずに訓練されたオープンファミリーの vision-language モデルを導入し、音声から収集した PixMo の dense captions を用い、オープンな重みとデータで最先端と競合する結果を達成します。

ABSTRACT

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

研究の動機と目的

独自データや合成リレーに依存せず、最新の VLM の重み・データ・コードを公開することによって、オープンな科学的進歩を促進する。
独占的 VLM からの蒸留を避けるため、音声を介して収集された高品質な dense captions のために PixMo データを導入する。
エンドツーエンドのオープントレーニング・パイプラインが、学術ベンチマークと人間の好みにおいて競争力のある性能を達成できることを示す。
in-the-wild QA や 2D ポインティングデータ、文書に基づくタスクを含む多様なファインチューニングデータの混合を提供して、VLM の能力を拡張する。

提案手法

投影コネクタを介して、事前学習済みの視覚エンコーダとデコーダ専用の LLM を組み合わせて、シンプルなアーキテクチャを組み立てる。
PixMo-Cap 上でエンドツーエンドに訓練し、合成 VLM データに依存せず dense caption 生成を行う。
PixMo-AskModelAnything、PixMo-Points、PixMo-CapQA、PixMo-Docs、PixMo-Clocks を含む監督付きデータセットの混合でファインチューニングし、さまざまな学術データセットも含める。
RLHF を回避し、キャプション訓練後は標準的な監督付きファインチューニングに依拠する。
11 の学術ベンチマークと大規模な人間の好み Elo 研究で評価する。

実験結果

リサーチクエスチョン

RQ1オープンウェイトの VLM が、独自 VLM からの合成データに依存せずに競争力のある性能を達成できるか？
RQ2音声ベースの dense-caption データ収集戦略は、多様な下流タスクに適した高品質なマルチモーダルモデルを生み出すか？
RQ3オープン VLM は、学術ベンチマークと人間の嗜好において、主要な独自系とどのように比較されるか？
RQ4多様な PixMo データの混合（ポインティングデータを含む）が、カウントや grounding のようなマルチモーダル能力に与える影響は何か？],
RQ5key_findings_value_type_aliases
RQ6key_findings_value_type_aliases

主な発見

MolmoE-1B (OLMoE-1B-7B MoE) は、学術ベンチマークと Elo ベースの人間の好みでほぼ GPT-4V と同等である。
Molmo-7B-O および Molmo-7B-D は、ベンチマークと人間のランキングで GPT-4V と GPT-4o の間の性能を示す。
Molmo-72B (Qwen2-72B バックボーン) は、最も高い学術ベンチマークスコアを達成し、Elo で 2 位にランクイン、GPT-4o の後。
Molmo ファミリーは Gemini 1.5 Pro/Flash や Claude 3.5 Sonnet などの多くの独自システムを上回る。
Molmo-72B は実世界での強い実行可能性を示し、AndroidControl 課題で低レベル 88.7%、高レベル 69.0% の精度を達成。
この評価には、27モデルにわたる 325k のペアワイズ比較を含む大規模な人間の好み研究が含まれ、学術ベンチマークと広く整合する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。