QUICK REVIEW

[论文解读] A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu|arXiv (Cornell University)|Jun 23, 2023

Topic Modeling被引用 85

一句话总结

一份综述，整理并总结多模态大型语言模型（MLLMs）的进展，详细介绍核心技术如 M-IT、M-ICL、M-CoT 和 LAVR，并概述挑战与发展方向。

ABSTRACT

Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

研究动机与目标

定义并形式化多模态大型语言模型（MLLMs）及相关概念。
提供对 MLLMs 的全面分类，覆盖四大类：Multimodal Instruction Tuning (M-IT)、Multimodal In-Context Learning (M-ICL)、Multimodal Chain-of-Thought (M-CoT) 以及 LLM-Aided Visual Reasoning (LAVR)。
概括在 MLLMs 中使用的关键技术、数据策略、桥接方法和评估方法。
突出挑战并提出该领域的前景研究方向。

提出的方法

给出 MLLMs 及其指令/交互范式的正式表述。
将现有工作归类为四大类（M-IT、M-ICL、M-CoT、LAVR），并讨论它们的体系结构和数据需求。
描述数据收集方法（基准适配、自我指令、混合组合）以及模态桥接（可学习接口与专家模型）。
解释用于 M-IT 的对齐预训练与多模态数据构建，包括指令模板与评估方法。
总结在 M-CoT 和 LAVR 中的学习范式（微调、少样本、零样本）与生成模式（填充式 vs. 预测），并讨论评估框架（闭集 vs. 开集）。

Figure 1 : Comparisons of three typical learning paradigms. The image is from [ 16 ] .

实验结果

研究问题

RQ1在基于 LLM 的系统中，哪些核心范式能够实现多模态推理？
RQ2数据构建和模态桥接如何影响 MLLMs 在 M-IT、M-ICL、M-CoT 和 LAVR 的表现？
RQ3哪些评估策略适用于多模态指令微调和视觉推理系统？
RQ4推进 MLLMs 向更通用能力发展的主要挑战和潜在方向是什么？

主要发现

MLLMs 利用四种主要技术：Multimodal Instruction Tuning (M-IT)、Multimodal In-Context Learning (M-ICL)、Multimodal Chain-of-Thought (M-CoT) 和 LLM-Aided Visual Reasoning (LAVR)。
用于 M-IT 的数据构建包括基准适配、自我指令和混合组合，以创建多模态指令数据。
模态桥接通常通过可学习接口或专家模型来实现，将视觉内容转换为文本以供 LLMs 使用。
评估区分闭集和开集任务，并针对开放式多模态任务提供额外基准以及人类/AI 评分方法。
本综述指出 MLLMs 的多项未来方向与持续挑战，包括可扩展性、对齐、鲁棒性以及多模态推理能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。