QUICK REVIEW

[论文解读] Fine-Grained Visual Prompting

Lingfeng Yang, Yueze Wang|arXiv (Cornell University)|Jun 7, 2023

Multimodal Machine Learning Applications被引用 8

一句话总结

FGVP 引入像素级精确的视觉提示，使用语义掩码（通过 SAM）与 Blur Reverse Mask 策略，在现成的视觉-语言模型下提升零-shot 参考表达理解与部分检测能力。

ABSTRACT

Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.

研究动机与目标

通过现成的 VLMs 细化视觉提示，激励零-shot 实例级理解。
系统比较提示格式（裁剪、框、圆、掩码）及其变体（模糊、灰度、彩色、线条）。
提出 Blur Reverse Mask 提示法，作为鲁棒的背景模糊策略以降低噪声。
利用 SAM 生成细粒度掩码（有无检测器均可）以用于零-shot 任务。
在 referring expression 基准上演示 SOTA 零-shot 性能，并在 PACO 上提升部分检测能力。」],
method_levels__":null,
method 整理项: 【定义一个零-shot 框架，其中图像提示 I_Phi 由视觉提示 VP(I, Phi) 产生，文本 T 通过 CLIP 等 VLMs 匹配。
使用 SAM 从框提案 Phi 生成语义掩码 M，并获得细粒度提示 I_Phi = FGVP(I, M)。
探索一个无检测器的零-shot 流程，通过网格式关键点 G 来提示 SAM 并应用 NMS 获取掩码，然后推导最小外接框。
评估多种提示格式（裁剪、框、圆、掩码）及其变体（线条、颜色、灰度、模糊），包括 Blur Reverse Mask。
加入后处理选项（Relations、Subtraction）用于 referring expression 任务，以及用于部分检测的匈牙利匹配。】

提出的方法

定义一个零-shot 框架，其中图像提示 I_Phi 通过视觉提示 VP(I, Phi) 生成，文本 T 通过 VLMs（如 CLIP）进行匹配。
使用 SAM 从框提案 Phi 生成语义掩码 M，并获得细粒度提示 I_Phi = FGVP(I, M)。
探索一个无检测器的零-shot 流程，通过网格关键点 G 提示 SAM，并应用 NMS 获取掩码，然后推导最小外接框。
评估多种提示格式（crop、box、circle、mask）及其变体（line、color、grayscale、blur），其中包含 Blur Reverse Mask。
将后处理选项（Relations、Subtraction）用于 referring expression 任务，并对部分检测采用匈牙利匹配。

实验结果

研究问题

RQ1细粒度语义掩码是否比粗糙的视觉提示在零-shot 定位与识别上有更大提升？
RQ2哪种提示设计（如 Blur Reverse Mask）在 referring expression 与部分检测任务上能获得最佳零-shot 性能？
RQ3在 RefCOCO、RefCOCO+、RefCOCOg 及 PACO 数据集上，FGVP 的表现相对于以往的零-shot 方法如何？

主要发现

Blur Reverse Mask 提示在所有评估数据集上实现了最佳的综合零-shot 性能。
FGVP 总体上超越了如 RedCircle 与 CPT/ReCLIP 等方法，平均提升约 3.0% 到 4.6%，在 RefCOCO+ testA 上最大提升达到 12.5%。
FGVP 在 refering expression 基准（RefCOCO、RefCOCO+、RefCOCOg）上达到零-shot 的最新效果。
在 PACO 上，FGVP 展现出比以往视觉 prompting 方法更强的部分检测准确性。
在无框提案的零-shot 设置中，Blur Reverse Mask 提示仍然能优于某些粗糙提示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。