[論文レビュー] Fine-Grained Visual Prompting
tldr: FGVP は、semantic masks(SAM 経由)を用いたピクセル精度のビジュアルプロンプトと Blur Reverse Mask 戦略を用いて、オフ・ザ・シェルフの vision-language モデルによるゼロショット参照表現理解とパーツ検出を向上させる。
Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.
研究の動機と目的
- Motivate zero-shot instance-level understanding with off-the-shelf VLMs by refining visual prompts.
- Systematically compare prompt formats (crop, box, circle, mask) and their variants (blur, grayscale, color, lines).
- Propose Blur Reverse Mask prompting as a robust, background-blurring strategy to reduce noise.
- Leverage SAM to generate fine-grained masks with or without detectors for zero-shot tasks.
- Demonstrate SOTA zero-shot performance on referring expression benchmarks and improved part detection on PACO.
提案手法
- Define a zero-shot framework where image prompts I_Phi are produced by visual prompting VP(I, Phi) and text T are matched via VLMs like CLIP.
- Use SAM to generate semantic masks M from box proposals Phi and obtain fine-grained prompts I_Phi = FGVP(I, M).
- Explore a zero-shot pipeline without detectors by prompting SAM with grid-wise keypoints G and applying NMS to obtain masks, then derive the smallest enclosing boxes.
- Evaluate multiple prompt formats (crop, box, circle, mask) and their variants (line, color, grayscale, blur) including Blur Reverse Mask.
- Incorporate post-processing options (Relations, Subtraction) for referring expression tasks and Hungarian matching for part detection.]
- research_questions: ["Can fine-grained semantic masks improve zero-shot localization and recognition compared to coarse visual prompts?","What prompting design (e.g., Blur Reverse Mask) yields the best zero-shot performance across referring expression and part detection tasks?","How does FGVP perform relative to prior zero-shot methods on RefCOCO, RefCOCO+, RefCOCOg and PACO datasets?"]
- key_findings:[
実験結果
リサーチクエスチョン
- RQ1Can fine-grained semantic masks improve zero-shot localization and recognition compared to coarse visual prompts?
- RQ2What prompting design (e.g., Blur Reverse Mask) yields the best zero-shot performance across referring expression and part detection tasks?
- RQ3How does FGVP perform relative to prior zero-shot methods on RefCOCO, RefCOCO+, RefCOCOg and PACO datasets?
主な発見
- Blur Reverse Mask prompting achieves the best overall zero-shot performance across evaluated datasets.
- FGVP surpasses prior methods such as RedCircle and CPT/ReCLIP by an average of 3.0% to 4.6%, with a maximum improvement of 12.5% on RefCOCO+ testA.
- FGVP achieves state-of-the-art zero-shot performance on referring expression benchmarks (RefCOCO, RefCOCO+, RefCOCOg).
- On PACO, FGVP demonstrates stronger part detection accuracy than previous visual prompting methods.
- In zero-shot settings without box proposals, Blur Reverse Mask prompting can still outperform certain coarse prompts.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。