QUICK REVIEW

[论文解读] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang|arXiv (Cornell University)|Apr 25, 2024

Reservoir Engineering and Simulation Methods被引用 16

一句话总结

InternVL 1.5 是一个开源的多模态大语言模型，通过提升视觉编码、支持高分辨率输入、扩大双语数据，弥合与商业模型如 GPT-4V 的差距，在若干基准测试上达到最先进的结果。

ABSTRACT

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$ imes$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

研究动机与目标

缩小开源与专有多模态模型在多模态理解方面的能力差距。
通过对 InternViT-6B 的持续学习，增强更强的视觉编码器在视觉特征提取方面的能力。
在保持高效性的同时，实现高分辨率、基于平铺的图像处理，最高可达 4K。
通过高质量双语数据集和翻译管线，提升英语-汉语双语多模态性能。

提出的方法

通过将强大的视觉编码器（InternViT-6B）通过 MLP 投影器与 LLM（InternLM2-20B）集成，采用 ViT-MLP-LLM 架构。
通过将图像平铺为 448x448 的补丁来实现动态高分辨率输入（训练时 1–12 块瓦片；测试时最多 40 块瓦片），以达到 4K 分辨率。
使用覆盖英文和中文注释的多样化高质量双语数据集进行预训练，覆盖 OCR 和其他多模态任务。
使用数据翻译管线将英文数据集转换为中文（以及潜在的其他语言），以提升多语言能力。
在完成初步的视觉与投影训练后，对整个模型（26B 参数）进行微调，以优化多模态性能。

实验结果

研究问题

RQ1开源的 MLLM 在 OCR、通用多模态、数学和多轮对话基准测试方面，能在多大程度上接近领先商业模型的表现？
RQ2视觉编码器的强度、高分辨率动态输入以及双语数据质量的哪种组合，在多模态理解和双语能力方面能带来最大的提升？
RQ3具备这些改进的开源模型，能否在文档和 OCR 为主的任务上超越竞争对手，同时保持强健的多语言性能？

主要发现

InternVL 1.5 在18项多模态基准测试中，与开源和专有模型相比具有竞争力的表现。
该模型在18项基准测试中的8项达到最先进的结果，包括与 OCR 相关的任务，如 ChartQA 和 OCRBench。
在 OCR 和文档导向任务中，InternVL 1.5 在若干数据集上可以超越领先的商业模型，并表现出强大的中文语言能力。
动态高分辨率处理实现近 4K 的输入处理，在较低瓦片数量下也能保持强健的性能，同时计算成本并不过于高昂。
对视觉编码器（InternViT-6B）的持续学习在实际中将视觉表示提升到可与更大规模的 LLM 相媲美的水平。
在数学推理任务中，InternVL 1.5 在 MathVista 相关基准测试上优于若干竞争对手，包括 GPT-4V。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。