QUICK REVIEW

[논문 리뷰] DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu|arXiv (Cornell University)|2024. 03. 08.

Multimodal Machine Learning Applications인용 수 43

한 줄 요약

DeepSeek-VL은 하이브리드 고해상도 인코더, 3단계 학습 파이프라인, 1.3B 및 7B 변형을 갖춘 오픈소스 비전-언어 모델로, 실제 VL 작업과 실사용자 상호작용을 목표로 한다.

ABSTRACT

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

연구 동기 및 목표

실세계 시나리오에 적합한 다목적의 오픈소스 비전-언어 모델을 생성한다 (웹 페이지, PDF, 차트, OCR, 지식 콘텐츠).
고정 토큰 예산 내에서 고해상도 이미지를 처리하도록 효율적인 추론을 위한 아키텍처를 설계한다.
강력한 다중모달 이해를 가능하게 하면서 언어 능력을 보존하는 학습 전략을 개발한다.
추가 연구 및 실용적 활용을 촉진하기 위해 공개적으로 이용 가능한 1.3B 및 7B 모델 변형을 제공한다.

제안 방법

하이브리드 비전 인코더(SigLIP-L은 384x384용, SAM-B는 1024x1024용)를 사용하여 언어 모델에 576 토큰을 생성한다.
시각 특징과 언어 모델을 연결하기 위한 비전-언어 어댑터를 두 계층 MLP를 통해 도입하고, 그 뒤에 최종 임베딩 단계를 따른다.
다중모달 목표로 프리트레이닝하되, 언어 능력을 보존하기 위해 상당한 언어 데이터 비율(최소 70%)을 유지하고 모달리티 워밍업 전략을 적용한다.
세 단계 학습 파이프라인: Stage 1에서 고정된 인코더와 LLM으로 VL 어댑터를 학습; Stage 2에서 균형 잡힌 모달리티 비율로 VL 프리트레이닝을 공동 수행; Stage 3에서 대화 능력을 위한 감독 학습 미세조정.
1.3B에서 7B 모델까지의 스케일링 실험을 수행하고, 학습 안정화 및 지시 따르기를 개선하기 위해 instruction-tuning 데이터를 포함한다.

실험 결과

연구 질문

RQ1오픈 소스 구성요소로 고해상도이고 실세계 친화적인 VL 모델을 어떻게 구축할 수 있는가?
RQ2강력한 다중모달 이해를 가능하게 하면서 언어 능력을 보존하는 학습 전략은 무엇인가?
RQ3하이브리드 비전 인코더가 OCR 및 차트와 같은 미세한 작업에서 단일 인코더 설계에 비해 성능을 향상시키는가?
RQ41.3B 규모의 실험이 실제 벤치마크를 위한 7B 모델로 효과적으로 이전될 수 있는가?

주요 결과

The DeepSeek-VL 가족은 동일한 모델 크기에서 광범위한 비주얼-언어 벤치마크에서 최첨단 또는 경쟁력 있는 성능을 달성한다.
하이브리드 비전 인코더를 통해 1024x1024 이미지의 처리를 고정 토큰 예산(576 토큰)으로 가능하게 하여 효율적인 추론을 가능하게 한다.
모달리티 워밍업과 균형 잡힌 언어+다중모달 학습 비율은 언어 망각을 완화하고 다중모달 능력을 향상시킨다.
공개적으로 1.3B 및 7B 변형을 공개하는 것은 실세계 VL 작업에서의 연구 촉진과 실용적 배치를 촉진하는 것을 목표로 한다.
학습 파이프라인은 다중모달 프리트레이닝 동안 언어 능력을 보존하는 데 중점을 두고 웹, 문서, 차트를 포함한 다양한 데이터 혼합에 의존한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.