QUICK REVIEW

[论文解读] A Unified Sequence Interface for Vision Tasks

Ting Chen, Saurabh Saxena|arXiv (Cornell University)|Jun 15, 2022

Multimodal Machine Learning Applications被引用 49

一句话总结

本论文提出一个单一的编码器–解码器模型，通过共享基于标记的接口和任务提示，将四个核心视觉任务（目标检测、实例分割、关键点检测和图像描述）统一为像素到序列的问题，在没有任务特定头部的情况下实现了具有竞争力的结果。

ABSTRACT

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

研究动机与目标

激励并展示多样的视觉任务可以在统一的像素到序列接口中表达。
开发一个适用于多任务的单一模型架构和损失函数，且无需任务特定头部。
证明任务提示可以将相同的输出序列适配不同的任务要求。
评估多任务训练是否在 COCO 上保持对各任务的竞争力表现。

提出的方法

使用共享词汇将每个任务表示为离散标记的序列（边界框、多边形、关键点或描述）。
使用带有视觉骨干网络的编码器–解码器架构，Transformer解码器受任务提示条件化。
通过将提示与输出拼接成单一序列来训练，在损失中对提示标记的权重设为零。
通过自回归生成后的逐任务去标记解码以获得输出。
通过数据混合或批次混合来组合任务；贪心地调整任务权重，使其和为一。
推理阶段使用核心采样生成输出标记；去标记化可恢复框、掩模、关键点或描述。

实验结果

研究问题

RQ1是否可以用单一像素到序列模型在无任务特定头部的情况下解决目标检测、实例分割、关键点检测和图像描述？
RQ2在 COCO 上，统一模型在多任务上的性能与专用基线相比如何？
RQ3任务提示和训练混合策略对多任务学习效果有何影响？
RQ4增加图像尺寸或改变训练权重是否能提升多任务性能？

主要发现

目标检测	实例分割	关键点检测	图像描述
Faster R-CNN	-	-	-
Faster R-CNN+	-	-	-
DETR	-	-	-
Mask R-CNN	39.8	37.1	63.1	-
Mask R-CNN (non-local)	45.0	40.3	66.5	-
Transformer-based captioner	-	-	-	34.3
Pix2Seq v2 single task (640×640)	43.8	37.3	68.0	33.9
Pix2Seq v2 single task (1024×1024)	45.6	38.7	67.4	34.0
Pix2Seq v2 multi-tasks (640×640)	44.2	36.9	65.0	34.3
Pix2Seq v2 multi-tasks (1024×1024)	46.5	38.2	64.8	34.9

多任务模型在四个任务上与任务特定基线相比具有竞争力的结果，且无需专门的架构。
较大的输入尺寸通常提升所有任务的性能，除了关键点检测，因为它受益于任务专用裁剪。
通过对实例分割进行多序列采样并对得到的掩模取平均可以提升预测。
在所有任务上训练的单一模型（结合适当的任务权重）可以处理这些任务，性能接近单任务变体。
该架构使用一个共享词汇表（35k）和一个单一解码器，通过提示输出任务特定的输出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。