[论文解读] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo 引入了一个开放家族的视觉-语言模型,该模型在不依赖专有数据的情况下训练,使用从语音中收集的 PixMo 密集描述,并在开放权重和数据下达到与最先进方法相媲美的结果。
Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.
研究动机与目标
- 通过发布最新视觉语言模型的权重、数据和代码,在不依赖专有数据或合成中继的情况下,推动开放的科学进步。
- 引入 PixMo 数据,用于通过语音收集的高质量密集描述,以避免从专有 VLM 中进行蒸馏。
- 证明端到端的开放训练流程可以在学术基准和人类偏好方面达到具竞争力的性能。
- 提供多样化的微调数据组合,包括野外问答、2D 指向数据和文档为基础的任务,以拓展 VLM 的能力。
提出的方法
- 通过投影连接器将预训练的视觉编码器与解码器为主的语言模型结合,组装一个简单的架构。
- 在 PixMo-Cap 上进行端到端训练,用于密集描述生成,而不依赖合成 VLM 数据。
- 在混合的监督数据集上进行微调,包括 PixMo-AskModelAnything、PixMo-Points、PixMo-CapQA、PixMo-Docs、PixMo-Clocks,以及各种学术数据集。
- 避免 RLHF,在完成描述训练后依赖标准的监督微调。
- 用 11 个学术基准和一个规模较大的基于 Elo 的人类偏好研究进行评估。
实验结果
研究问题
- RQ1Can open-weight VLMs achieve competitive performance without relying on synthetic data from proprietary VLMs?
- RQ2Does a speech-based, dense-caption data collection strategy yield high-quality multimodal models suitable for diverse downstream tasks?
- RQ3How do open VLMs compare to leading proprietary systems on academic benchmarks and human preferences?
- RQ4What is the effect of a diverse PixMo data mix (including pointing data) on multimodal capabilities like counting and grounding?],
- RQ5key_findings_value_type_aliases
主要发现
- MolmoE-1B (OLMoE-1B-7B MoE) nearly matches GPT-4V on academic benchmarks and Elo-based human preference.
- Molmo-7B-O and Molmo-7B-D perform between GPT-4V and GPT-4o on benchmarks and human rankings.
- Molmo-72B (Qwen2-72B backbone) achieves the highest academic benchmark score and ranks second in Elo, behind GPT-4o.
- Molmo family outperforms many proprietary systems such as Gemini 1.5 Pro/Flash and Claude 3.5 Sonnet in the reported evaluations.
- Molmo-72B shows strong real-world action potential, achieving 88.7% low-level and 69.0% high-level accuracy on AndroidControl task.
- The evaluation includes a large-scale human preference study with 325k pairwise comparisons across 27 models, aligning with academic benchmarks broadly.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。