QUICK REVIEW

[论文解读] DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu|arXiv (Cornell University)|Mar 8, 2024

Multimodal Machine Learning Applications被引用 43

一句话总结

DeepSeek-VL 是一个开源的视觉-语言模型，具有混合高分辨率编码器、三阶段训练管线，以及 1.3B 和 7B 变体，面向现实世界的 VL 任务和真实用户交互。

ABSTRACT

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

研究动机与目标

创建一个适用于真实世界场景的多功能开源视觉-语言模型（网页、PDF、图表、OCR、知识内容）。
设计一个在固定 token 预算内处理高分辨率图像的架构，以实现高效推理。
开发一种训练策略，在保持强大的语言能力的同时实现鲁棒的多模态理解。
公开提供 1.3B 和 7B 变体以促进进一步的研究和实际应用。

提出的方法

使用混合视觉编码器（SigLIP-L 384x384 与 SAM-B 1024x1024）来为语言模型生成 576 个 token。
引入视觉-语言适配器，通过两层 MLP 将视觉特征与语言模型连接起来，随后是最终嵌入阶段。
在保持语言能力的同时通过多模态目标进行预训练，确保语言数据比例（至少 70%），并采用模态热身策略。
三阶段训练管线：阶段 1 在固定编码器和 LLM 的情况下训练 VL 适配器；阶段 2 联合 VL 预训练，模态比例保持平衡；阶段 3 监督微调以提升对话能力。
将实验规模从 1.3B 扩展到 7B 模型，包括引导指令数据以稳定训练并提升指令遵循能力。

实验结果

研究问题

RQ1如何在开源组件下构建一个高分辨率、面向现实世界的 VL 模型？
RQ2哪种训练策略在提升强健的多模态理解的同时能够保留语言能力？
RQ3混合视觉编码器相比单一编码设计是否能在 OCR 和图表等细粒度任务上提升性能？
RQ41.3B 规模的实验能否有效迁移到 7B 模型以实现现实世界的基准？

主要发现

DeepSeek-VL 家族在相同模型规模下，在广泛的视觉-语言基准上达到最先进或具有竞争力的表现。
混合视觉编码器能够处理 1024x1024 的图像，且保持固定的 token 预算（576 tokens）以实现高效推理。
模态热身和语言+多模态训练比例的平衡缓解了语言遗忘，同时提升了多模态能力。
公开发布 1.3B 和 7B 变体旨在促进现实世界 VL 任务中的研究和实际部署。
训练管线在多模态预训练过程中强调保留语言技能，并依赖覆盖网页、文档和图表等多元数据的混合数据来源。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。