QUICK REVIEW

[论文解读] VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang, Zhe Chen|arXiv (Cornell University)|May 18, 2023

Multimodal Machine Learning Applications被引用 131

一句话总结

VisionLLM 将图像视为一种外来语言，使用基于LLM的开放式解码器，结合语言引导的图像分词器和统一的语言指令，执行开放式以视觉为中心的任务，具有强泛化能力，在 COCO 上实现超过 60% mAP，并在视觉-语言任务上取得具竞争力的结果。

ABSTRACT

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.

研究动机与目标

激发对开放式以视觉为中心的任务处理的需求，类似于自然语言处理中的LLMs。
提出一个统一框架，将视觉任务与语言指令对齐，以实现灵活的自定义。
开发一个语言引导的图像分词器，产生语言感知的视觉标记。
引入一个基于LLM的开放式任务解码器，根据指令执行任务。
展示在可配置粒度下对多种以视觉为中心的任务的泛化能力。

提出的方法

引入覆盖仅视觉和视觉-语言任务的统一语言指令。
设计一个语言引导的图像分词器，通过跨注意力和多尺度变换器将视觉特征与语言提示融合，生成 M 个图像令牌。
将一个 LLM（Alpaca-7B with LoRA）扩展为包含面向视觉的令牌和输出格式视为查询的解码方案，以处理各种任务。
添加离散定位令牌和语义无关的类别令牌，以在统一的令牌生成框架中实现开放式预测。
分两阶段训练：(i) 使用固定的LLM 对视觉骨干和分词器进行预训练，聚焦于带随机类别的检测；(ii) 与跨任务的统一监督共同训练。
采用 LoRA 进行高效微调，并利用交叉熵损失同时对视觉和语言输出进行监督。

(a) Vision generalist models [ 59 , 61 , 83 ] are constrained by the format of pre-defined tasks.

实验结果

研究问题

RQ1是否可以通过语言指令有效地利用基于LLM的开放式解码器来处理多样的以视觉为中心的任务？
RQ2在没有特定任务头的情况下，通过语言提示控制任务定制（目标对象和输出格式的程度）到何种程度？
RQ3语言引导的图像分词器如何影响跨模态对齐和在检测、分割、定位、描述和 VQA 等任务中的表现？
RQ4在统一的视觉-语言框架中，单任务训练与多任务训练之间的权衡是什么？
RQ5将输出格式视为查询的解码如何影响视觉任务的效率和性能？

主要发现

VisionLLM 在包括对象检测、实例分割、视觉定位、图像描述和 VQA 在内的多种以视觉为中心的任务中，利用语言指令实现了强大性能。
以 ResNet-50 主干的 VisionLLM 在检测上达到 44.6 mAP 和 64.0 AP50，此外还有 48.1 AP75 及相关指标；而使用更强的 InternImage-H 主干，在 COCO 上达到 60.2 mAP，接近最先进的检测模型。
该模型在视觉定位方面表现出色，在 RefCOCO 验证集上，ResNet-50 的 P@0.5 为 80.6，InternImage-H 为 86.7。
在图像描述方面，VisionLLM 的 BLEU-4 约为 31.0–32.1，CIDEr 约为 112–114，跨主干，显示出与视觉-语言基线的竞争力。
该框架支持精细化定制：改变目标类别（最多 80 个）和输出点数量（8–24），同时保持合理的 AP 得分。
一个带有文本编码器（BERT）和跨注意力的语言引导图像分词器，在对齐和分词方面优于替代方法。

(b) Visual prompt tuning [ 26 , 64 , 62 ] are inconsistent with the format of LLMs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。