QUICK REVIEW

[论文解读] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin|arXiv (Cornell University)|Mar 8, 2023

Multimodal Machine Learning Applications被引用 187

一句话总结

Visual ChatGPT 将 ChatGPT 与多种视觉基础模型通过提示管理器集成，以通过语言处理图像理解、生成以及多步视觉任务。它使得可迭代的、多模型的视觉推理在不重新训练多模态代理的情况下成为可能。

ABSTRACT

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called extbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.

研究动机与目标

通过可编程的 Prompt Manager，利用现有的视觉基础模型（VFMs）扩展 ChatGPT 的视觉理解和生成能力。
使 ChatGPT 能够发出多轮、多模型工作流，调用 VFMs 通过自然语言提示执行复杂的视觉任务。
通过将视觉信号转换为语言并管理模型输入/输出和历史记录，提升多模态任务的鲁棒性和连贯性。

提出的方法

(a) 让 ChatGPT 了解 VFMs 及其输入/输出格式
(b) 将视觉数据转换为语言
(c) 处理模型历史、优先级和冲突。
引入一个流水线，ChatGPT 按顺序调用多个 VFMs（例如先进行深度估计，再进行深度转图像，最后进行风格迁移）以完成复杂的用户请求。
定义系统原则和严格的推理格式，以防止编造并指导工具使用，包括对文件名敏感性和类似 CoT 的多步推理。
将中间的 VFM 输出表示为链式文件名，以保留来源并便于后续步骤。
支持一组广泛的22个 VFMs，其输入/输出通过结构化提示方案描述，使零样本的多轮协作成为可能。

实验结果

研究问题

RQ1一个语言模型（ChatGPT）如何在不重新训练多模态模型的情况下，有效编排多样化的 VFMs 来执行多步视觉任务？
RQ2哪些提示工程策略（系统提示、VFM 提示、查询提示）能够实现可靠、可解释且可扩展的多模型视觉工作流？
RQ3类似调度器的提示管理器是否能够在复杂的图像编辑与生成任务中确保正确的 VFM 使用、数据格式和来源追溯？

主要发现

Visual ChatGPT 通过迭代的 VFM 协调，使基于语言的图像理解与生成成为可能。
提示管理器有效地将非语言信号映射到语言，定义 VFM 的能力，并在模型之间管理输入/输出与历史记录。
案例研究和定性分析显示，系统原则、VFM 提示和查询提示如何影响多轮、多模型视觉任务的成功。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。