QUICK REVIEW

[论文解读] VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan, Wenpo Song|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

VisionPangu 是一个紧凑的1.7B多模态模型，通过将轻量级视觉编码器与语言骨架对齐，结合来自DOCCI和LLaVA-NeXT高质量监督，实现详细的图像描述。

ABSTRACT

Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.

研究动机与目标

在不使用大规模模型的前提下，激励构建有能力的多模态助手。
改进细粒度且语义连贯的图像描述。
利用高质量、长文本形式的监督来引导跨模态对齐。
证明高效架构在图像描述任务上可以与更大模型竞争。

提出的方法

使用从InternVL派生的视觉编码器，进行密集视觉表示的微调。
将视觉编码器与OpenPangu-Embedded-1B语言模型通过一个轻量级的MLP投影头进行对接。
进行两阶段指令微调：1）特征对齐，冻结组件；2）全参数微调。
混合来自LLaVA-NeXT的通用多模态指令遵循监督与来自DOCCI的密集、长文本描述监督。
在以投影后的视觉特征H_v为条件的自回归多模态目标下进行训练。

实验结果

研究问题

RQ1一个1.7B参数的紧凑多模态模型如何实现详细、长文本的描述？
RQ2高质量监督（DOCCI）和指令微调是否提升视觉叙事的语义连贯性？
RQ3轻量级投影层加上改良的视觉编码器是否能够达到或接近更大模型的描述质量？

主要发现

VisionPangu 在详细描述基准测试中，在紧凑模型中取得了最佳的BLEU、METEOR和ROUGE-L分数（BLEU 0.2859，METEOR 0.4708，ROUGE-L 0.3759）。
在标准多模态基准测试（MMMU、MMbench、POPE、MME）上尽管只有1.7B参数，模型仍具备竞争力。
通过DOCCI的密集描述监督相比基于patch的描述，提升了叙事丰富性与整体语义 grounding。
两阶段训练（冻结视觉编码器的特征对齐和全参数SFT）实现了有效的跨模态交互，且计算需求不过度。
该方法表明，紧凑骨干在高质量监督与高效架构设计结合时，可以与更大模型抗衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。