QUICK REVIEW

[论文解读] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang|arXiv (Cornell University)|Jul 30, 2023

Topic Modeling被引用 52

一句话总结

SEED-Bench 引入一个大规模、基于真实数据的 19K 题多项选择基准，用于在 12 个图像/视频维度上评估多模态大模型（MLMs）的生成理解，采用自动化题目生成和人工验证流程。

ABSTRACT

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding. By revealing the limitations of existing MLLMs through evaluation results, we aim for SEED-Bench to provide insights for motivating future research. We will launch and consistently maintain a leaderboard to provide a platform for the community to assess and investigate model capability.

研究动机与目标

提供一个可扩展且客观的评估，覆盖图像和视频模态下的多模态 LLMs 的生成理解。
量化在 12 个不同的空间与时间理解维度上的表现。
提供一个排行榜平台，用以比较 18 个模型并指导未来的研究。

提出的方法

在 12 个评估维度上，基于人工标注生成带有真实答案标注的 19K 道多项选择题。
从图像中自动提取可视信息（说明文字、实例描述、文本），并使用提示生成问题及四个选项，其中一个为真实答案。
使用多种 LLM 过滤掉无需视觉输入即可回答的问题。
让人工标注者选择正确选项并将问题分配到评估维度。
通过计算在问题给定下每个候选选项的似然性来对答案排序并选择最高似然的选项来评估模型。

实验结果

研究问题

RQ1当前 MLLMs 在一组全面的空间与时间理解任务中的能力如何？
RQ2仅图像、视频以及混合多模态模型在 12 个 SEED-Bench 维度上的表现如何比较？
RQ3一个大规模、基于真实标注的 MC 题基准在测试阶段能否提供稳定、客观的评估，而不需要人工/GPT？
RQ4关于不同模型族（ImageLLMs、VideoLLMs、LLMs）在视觉和时序推理方面的强项/弱点，有哪些洞察？

主要发现

SEED-Bench 显示大多数 MLLMs 在这 12 个维度上的表现有限，尤其在细粒度时序理解方面存在显著差距。
InstructBLIP 在空间维度上的平均表现领先，并且在时序维度上也超过了一些 VideoLLMs。
VideoLLMs 在时序理解上并不始终优于 ImageLLMs，表明在细粒度视频推理方面还有改进空间。
大多数模型在文本识别和空间关系理解方面存在困难，凸显了在 OCR 富集和关系推理任务中的空白。
评估显示某些模型（如 InstructBLIP、VPGTrans）在特定维度如视觉推理或动作识别方面表现出色，但在多项任务上的整体性能仍低于 LLM 基线的峰值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。