QUICK REVIEW

[论文解读] Efficient Multimodal Large Language Models: A Survey

Yizhang Jin, Jian Li|arXiv (Cornell University)|May 17, 2024

Natural Language Processing Techniques被引用 14

一句话总结

对高效多模态 LLMs 的全面综述，详细介绍架构、高效的视觉与语言组件、训练/数据基准以及应用，含有分类法和未来方向。

ABSTRACT

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.

研究动机与目标

由于高昂的训练和推理成本，激发对资源高效 MLLMs 的需求。
提供一个系统的高效 MLLMs 分类法，覆盖架构、视觉、LLMs、训练、数据、基准和应用。
总结具有代表性的高效 MLLMs 及其组件，以指导研究与部署。
强调局限性与未来方向，促进边缘友好型 MLLMs 的进展。

提出的方法

将现有文献组织为六类：架构、高效视觉、高效 LLMs、训练、数据与基准、以及应用。
描述高效 MLLMs 的每个组件：视觉编码器、视觉-语言投影、以及小型语言模型。
综述令牌压缩、紧凑架构与高效结构的技术（如 MoE、Mamba、推理加速）。
比较用于高效 MLLMs 的视觉编码器、投影方法以及轻量级 LLM 主干的变体。
讨论用于预训练与评估的数据和基准，并列举实际应用。

实验结果

研究问题

RQ1哪些架构与组件能够在不大幅影响性能的前提下实现资源高效的 MLLMs？
RQ2哪些视觉编码器、投影策略和紧凑型 LLM 能实现最佳的效率-精确度权衡？
RQ3哪些训练策略、数据和基准支撑高效 MLLMs，以及它们如何扩展？
RQ4当前高效 MLLMs 在边缘和资源受限环境中的实际应用与局限性是什么？

主要发现

通过使用紧凑的 LLM 主干（通常小于 30 亿参数）和轻量级的视觉-语言投影，高效 MLLMs 降低资源消耗。
多种视觉编码器和跨模态融合策略都能产生具有竞争力的结果，单一编码器并未在所有任务中长期占优。
视觉令牌压缩、多视图输入和多尺度信息融合在保持性能的同时显著降低计算量。
像 MoE 和 Mamba 这样的高效结构，以及推理加速技术，使多模态推理具备可扩展性与更快速度。
全面的分类法和 GitHub 仓库组织了前沿方法，促进持续更新与可重复性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。