QUICK REVIEW

[论文解读] Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang|arXiv (Cornell University)|Oct 17, 2023

Multimodal Machine Learning Applications被引用 23

一句话总结

Set-of-Mark (SoM) 提示将语义上有意义的图像区域与可解释的标记叠加，在零-shot 设置下显著提升 GPT-4V 在细粒度任务上的视觉定位能力。

ABSTRACT

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

研究动机与目标

激发并解决 GPT-4V 在细粒度视觉定位能力方面的差距。
引入 Set-of-Mark 提示以在不进行微调的情况下释放区域级定位能力。
展示 SoM 在一系列视觉任务和基准上的有效性。

提出的方法

Partition images into semantically meaningful regions using off-the-shelf segmentation models (e.g., MaskDINO, SEEM, SAM, Semantic-SAM).
Overlay each region with distinguishable marks (numbers, alphabets, boxes, masks) to produce a marked image I^m.
Generate marks and allocate their locations using a conflict-aware algorithm that favors smaller regions first and uses distance transforms to place marks.
Prompt GPT-4V with either plain text or interleaved prompts that reference the marked regions to ground visual content textually and spatially.
Optionally allow user-driven or auto-generated marks and leverage new chat windows to prevent context leakage in zero-shot evaluation.
Evaluate SoM across vision tasks such as open-vocabulary segmentation, referring segmentation, phrase grounding, video object segmentation, and related grounding benchmarks.

实验结果

研究问题

RQ1Can SoM prompts enable GPT-4V to ground visual content location-by-location without model fine-tuning?
RQ2How do different mark types (numbers, boxes, masks) and mark allocation strategies affect grounding performance across tasks?
RQ3What is the impact of using ground-truth segmentation masks versus predicted masks on grounding accuracy?
RQ4To what extent does SoM bridge GPT-4V performance gaps with specialist models on fine-grained grounding tasks?

主要发现

SoM substantially enhances GPT-4V grounding, outperforming several state-of-the-art specialists on certain zero-shot tasks (e.g., RefCOCOg).
Using a set of visually interpretable marks enables GPT-4V to produce region-grounded text and map marks to corresponding image regions (r_k ↔ m_k ↔ text_k).
Adding boxes to marks further improves performance on phrase grounding tasks, and utilizing ground-truth masks markedly boosts referring segmentation results (e.g., +14.5 mIoU in RefCOCOg).
SoM enables zero-shot performance approaching or surpassing some fully-finetuned specialist models on selected tasks, and yields best tracking performance on DAVIS2017 when combining multiple frames.
Qualitative analyses reveal dataset annotation noise and non-centered mark placement can influence grounding, highlighting areas for improved mark allocation and prompting strategy.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。