QUICK REVIEW

[论文解读] VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

Yonatan Bitton, Hritik Bansal|arXiv (Cornell University)|Aug 12, 2023

Multimodal Machine Learning Applications被引用 11

一句话总结

VisIT-Bench 引入一个动态视觉-语言指令遵循基准，覆盖 592 个测试查询，来自 70 个指令族，拥有人工验证的参考和基于 Elo 的排行榜，用于在真实世界场景中评估多模态聊天机器人。

ABSTRACT

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

研究动机与目标

创建一个基准，反映指令遵循模型在真实世界视觉-语言使用中的场景。
通过涵盖识别到开放式生成的 70 个指令族来覆盖广泛的任务。
提供人工验证的参考和自动评估，以与人类判断保持一致。
实现一个动态排行榜，以随时间跟踪多模态聊天机器人的进展。

提出的方法

将 70 个指令族整理为 592 个测试查询，每个实例附带指令条件的描述。
生成指令条件的描述，以揭示评估所需的任务特定指导。
获取人工验证的 GPT-4 参考输出，并通过人工验证筛选。
以一对一的人类判断和 Elo 评定来评估模型输出。
开发一个基于 GPT-4 的自动评估（GPT4-no-ref），其与人类偏好相关。
发布数据、代码以及一个动态排行榜，供社区基准测试使用。

实验结果

研究问题

RQ1当前的视觉-语言指令遵循模型在真实世界的开放式任务上表现如何？
RQ2指令条件化的描述能否实现与人类判断一致的可靠自动评估？
RQ3在 VisIT-Bench 上，单图像与多图像任务中，最先进模型的相对表现如何？
RQ4基于 GPT-4 的自动评估在对模型输出排序方面与人类判断相比如何？

主要发现

VisIT-Bench 显示模型与人类参考之间存在显著差距，最佳模型在对参考的对比中仅有 27.4% 的胜率（单图像结果）。
通过 5K 对比的人类判断获得的 Elo 排名能够区分模型（例如 LLaMA-Adapter-v2 在某些对比中对参考领先）。
指令条件化的描述至关重要；使用详细的描述（相比 BLIP-2 描述）显著提升正确的指令遵循（91% 对 31%）。
一个基于 GPT-4 的自动评估（GPT4-no-ref）与人类判断的相关性最高，在多数投票中重建准确度高（例如当所有标注者同意时达到 93%）。
VisIT-Bench 提供一个动态排行榜，随新模型和实例评估更新，用以跟踪多模态指令遵循的进展。
数据集和排行榜的公开发布使社区驱动的基准测试和方法开发成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。