QUICK REVIEW

[论文解读] Visual Prompting via Image Inpainting

Amir Bar, Yossi Gandelsman|arXiv (Cornell University)|Sep 1, 2022

Multimodal Machine Learning Applications被引用 50

一句话总结

该论文表明将视觉任务视为网格图像修复，使用学习的 MAE-VQGAN，在大型未标记 figure 数据集上训练，使得测试时视觉提示能够在无需微调的情况下执行各种图像到图像任务。

ABSTRACT

How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting - literally just filling in a hole in a concatenated visual prompt image - turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated - 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc.

研究动机与目标

Motivate prompt-based adaptation of pre-trained visual models to downstream tasks without fine-tuning or architecture changes.
Propose framing downstream tasks as image inpainting on a visual prompt grid that contains examples and a query.
Create a large unlabeled figures dataset to train robust inpainting models for prompting tasks.
Evaluate prompting on multiple vision tasks to assess generalization and task coverage.
Investigate how prompting design choices affect performance and how data distribution impacts results.

提出的方法

Build an inpainting model called MAE-VQGAN by combining masked auto-encoding (MAE) with a VQGAN codebook to predict visual tokens for masked regions.
Train MAE-VQGAN on a curated Computer Vision Figures dataset (88k unlabeled figures from ArXiv) and on ImageNet data to learn grid-like inpainting capable representations.
Form visual prompts by concatenating one or more input-output task examples with a new query image into a grid-like image, and mask the region to be inpainted.
Define a simple, hard-coded function g that constructs the visual prompt x_vp from the examples S and query x_q; inpainting then fills the masked region to yield the target output.
Optionally apply prompt ensembling by generating multiple prompts and averaging predictions to improve robustness.
Analyze prompting design choices (layout, colors, masking style) and demonstrate performance gains with more examples and ensembling.

实验结果

研究问题

RQ1Can a single, pre-trained visual model adapt to multiple downstream image-to-image tasks via test-time visual prompting without fine-tuning?
RQ2Is image inpainting a viable core mechanism for visual prompting when models are trained on task-agnostic grid data?
RQ3How does the choice of training data (Figures dataset vs. ImageNet) affect prompting performance across tasks?
RQ4What is the impact of prompt design (layout, colors, number of examples) on prompting quality and robustness?
RQ5How does visual prompting compare to traditional fine-tuning and few-shot baselines for segmentation and detection tasks?

主要发现

MAE-VQGAN trained on the Figures dataset yields strong performance on foreground segmentation and single-object detection when used with visual prompts.
Prompting models trained on Figures significantly outperform those pretrained only on ImageNet and surpass several unspecialized baselines in multiple tasks.
Adding more examples in the visual prompt generally improves segmentation mIOU and detection accuracy across Pascal-5i and Pascal VOC datasets.
Prompt design choices (e.g., vertical vs. horizontal layout, black/white masks) impact prompting quality, with some layouts yielding higher attention to relevant regions.
Prompt ensembling (averaging predictions from multiple prompts) further improves results and stabilizes performance across tasks.
MAE-VQGAN trained on Figures produces sharper completions and better task performance than VQGAN or BEiT baselines, particularly for detection and segmentation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。