QUICK REVIEW

[论文解读] InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong, Pan Zhang|arXiv (Cornell University)|Apr 9, 2024

Multimodal Machine Learning Applications被引用 7

一句话总结

介绍 InternLM-XComposer2-4KHD，一种能够处理从 336 到 4K HD 分辨率的大型视觉语言模型，采用动态补丁配置和全局-局部输入设计，在 7B 参数下实现具有竞争力的结果，并在多个 HD-OCR 基准测试中超越一些闭源 API。

ABSTRACT

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The InternLM-XComposer2-4KHD model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

研究动机与目标

在支持从 336 到 4K 的多样输入分辨率的同时，扩展 LVLM 的分辨率能力至 4K HD 及以上。
开发一种动态基于补丁的图像分割与训练策略，保持纵横比并实现更高分辨率的理解。
通过定向的预训练和微调来提升高分辨率 OCR 和文档理解。
在广泛的基准测试集上展示相对于闭源 API 和现有开源 LVLM 的具竞争力的性能。

提出的方法

在 InternLM-XComposer2 中扩展一个 Vision Encoder（ViT-L/14）和一个 7B LLM（InternLM2-7B），通过 Partial LoRA 连接以实现高效对齐。
引入 Dynamic Image Partition：将输入调整大小并填充为一个 336px 裁块的网格，可调至 HD-25/HD-55 与 4KHD，同时保持图像纵横比。
实现 Global-Local Format：同时处理全局的 336x336 视图与基于局部补丁的特征，将它们合并为统一表示。
在每个补丁行末添加一个可学习的新行符号（换行 token），以清晰划分二维结构并减少训练歧义。
在视觉编码器微调后冻结 LLM 的前提下进行预训练，通过语义、世界知识和能力数据的混合对齐将视觉令牌对齐到 LLM；采用低秩的 Partial LoRA，训练策略包括 LLDR 衰减和分阶段学习率。
使用混合分辨率策略进行微调（高分辨任务使用 HD-55；其他任务使用动态分辨率），以在 HD-OCR 和通用视觉语言任务上优化性能。

实验结果

研究问题

RQ1在像 OCR、图表、信息图等高分辨任务中，增加训练和推理分辨率对性能有何影响？
RQ2自动布局的动态补丁配置是否能够在保持纵横比的同时，将 LVLM 能力从 336px 扩展到 4K？
RQ3全局视图与局部补丁以及换行 token 对 LVLMs 的二维图像理解有何影响？
RQ4IXC2-4KHD 在包括 HD-OCR 任务在内的广泛基准测试中，与闭源 API 与开源 LVLMs 相比表现如何？

主要发现

IXC2-4KHD 在 7B 参数下取得具有竞争力的结果，在 16 个基准中的 10 个上达到或超过 GPT-4V 和 Gemini Pro。
该模型在开源 LVLM 中的 16 个基准中获得 6 个的 SOTA 成绩，在若干任务上接近闭源 API。
训练达到 4K HD 分辨率在 HD-OCR 任务上带来持续提升，在测试范围内未观察到饱和。
在评估基准上 DocVQA 90.0 和 ChartQA 81.0 展示了强大的 OCR 和图表阅读能力，优于若干基线。
InfographicVQA 达到 68.6%，远超最近的开源文档级模型；OCRBench 达成 67.5%。
该模型支持 4KHD 输入（3840x1600），在推理时分辨率高于训练时也显示出稳健的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。