QUICK REVIEW

[论文解读] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Haoran Wei, Lingyu Kong|arXiv (Cornell University)|Dec 11, 2023

Multimodal Machine Learning Applications被引用 8

一句话总结

Vary 引入一个两阶段方法，通过一个极小的自回归模型生成新的视觉词汇并与 CLIP-VIT 融合，从而提升对细粒度感知（OCR、文档/图表理解）的能力，同时保留原有能力。

ABSTRACT

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

研究动机与目标

为 LVLM 的视觉词汇瓶颈在密集或非英语感知任务中提供动机与解决方案。
提出一种两阶段方法来生成并将新视觉词汇与基于 CLIP 的词汇进行整合。
证明词汇扩展能够在保持核心 LVLM 功能的同时提升对细粒度感知的表现。

提出的方法

两阶段管线：（1）使用一个词汇网络加上一个极小的解码器自回归 Transformer 训练，生成新的视觉词汇；（2）将新词汇与原始 CLIP-VIT 词汇融合，在 LVLM 训练过程中对两者词汇进行冻结。
在 SAM-ViTDet 特征上用卷积层构建新的词汇网络，以对齐形状到 CLIP-VIT，并产生 256×1024 的展平 token。
用文档和图表数据（密集 OCR 与渲染）作为正样本，使用自回归的“图像到文本”生成将自然图像作为负样本来训练 Vary-tiny。
通过并行化将新词汇整合到 Vary-base，与原始 CLIP-VIT 词汇并行，在词汇冻结的情况下训练 LVLM，同时更新输入嵌入和大语言模型（LLM）。
使用合成数据生成（文档的 LaTeX 渲染、图表渲染）以及通过 GPT-4 获取的高质量图表数据来丰富 Vary-base 的训练。

实验结果

研究问题

RQ1视觉词汇扩展是否能够在超越 CLIP-VIT 的限制下提高 LVLM 的细粒度感知？
RQ2如何在不覆盖现有知识的前提下，有效生成并整合新的视觉词汇？
RQ3在文档 OCR、Markdown 转换和图表理解等任务上，带有词汇扩展的 LVLM 是否能在保持通用能力的同时表现更好？

主要发现

Vary-tiny 在中英文均实现了密集 OCR 能力，中文的编辑距离为 0.266，英文为 0.197。
Vary-base 在英文纯文档 OCR 上与 Nougat 相当，并在提示下实现 Markdown/LaTeX 风格的转换。
在 8 万 SFT 数据下，Vary-base 获得 DocVQA 的 ANLS 78.2 且验证集 76.3；在 66.5 万 SFT 数据下，ChartQA 的平均分达到 66.1。
Vary-base 与 Qwen-7B 的组合在 MMVet 顶级得分上达到 36.2%，在其他 MMVet 指标中根据设置介于 38.9% 至 38.7% 之间。
在相似设置下，Vary 相比基线 LLaVA-1.5 使通用 MMVet 表现提升约 2.4 个点。
总体而言，扩展视觉词汇量在维持核心 LVLM 能力的同时带来对细粒度感知的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。