[论文解读] Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark (SoM) 提示将语义上有意义的图像区域与可解释的标记叠加,在零-shot 设置下显著提升 GPT-4V 在细粒度任务上的视觉定位能力。
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
研究动机与目标
- 激发并解决 GPT-4V 在细粒度视觉定位能力方面的差距。
- 引入 Set-of-Mark 提示以在不进行微调的情况下释放区域级定位能力。
- 展示 SoM 在一系列视觉任务和基准上的有效性。
提出的方法
- Partition images into semantically meaningful regions using off-the-shelf segmentation models (e.g., MaskDINO, SEEM, SAM, Semantic-SAM).
- Overlay each region with distinguishable marks (numbers, alphabets, boxes, masks) to produce a marked image I^m.
- Generate marks and allocate their locations using a conflict-aware algorithm that favors smaller regions first and uses distance transforms to place marks.
- Prompt GPT-4V with either plain text or interleaved prompts that reference the marked regions to ground visual content textually and spatially.
- Optionally allow user-driven or auto-generated marks and leverage new chat windows to prevent context leakage in zero-shot evaluation.
- Evaluate SoM across vision tasks such as open-vocabulary segmentation, referring segmentation, phrase grounding, video object segmentation, and related grounding benchmarks.
实验结果
研究问题
- RQ1Can SoM prompts enable GPT-4V to ground visual content location-by-location without model fine-tuning?
- RQ2How do different mark types (numbers, boxes, masks) and mark allocation strategies affect grounding performance across tasks?
- RQ3What is the impact of using ground-truth segmentation masks versus predicted masks on grounding accuracy?
- RQ4To what extent does SoM bridge GPT-4V performance gaps with specialist models on fine-grained grounding tasks?
主要发现
- SoM substantially enhances GPT-4V grounding, outperforming several state-of-the-art specialists on certain zero-shot tasks (e.g., RefCOCOg).
- Using a set of visually interpretable marks enables GPT-4V to produce region-grounded text and map marks to corresponding image regions (r_k ↔ m_k ↔ text_k).
- Adding boxes to marks further improves performance on phrase grounding tasks, and utilizing ground-truth masks markedly boosts referring segmentation results (e.g., +14.5 mIoU in RefCOCOg).
- SoM enables zero-shot performance approaching or surpassing some fully-finetuned specialist models on selected tasks, and yields best tracking performance on DAVIS2017 when combining multiple frames.
- Qualitative analyses reveal dataset annotation noise and non-centered mark placement can influence grounding, highlighting areas for improved mark allocation and prompting strategy.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。