QUICK REVIEW

[论文解读] Siamese Masked Autoencoders

Agrim Gupta, Jiajun Wu|arXiv (Cornell University)|May 23, 2023

Domain Adaptation and Few-Shot Learning被引用 17

一句话总结

SiamMAE 将 Masked Autoencoders 扩展到视频，通过使用非对称掩码和一个连体编码器，在视频对象分割、姿态关键点传播和语义部件传播等零样本视觉对应任务上达到最先进水平，且无需大量数据增强或基于跟踪的前提。

ABSTRACT

Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction ($95\%$) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.

研究动机与目标

在自监督方式下学习来自视频的视觉对应的动机。
提出一个简单而有效的将 MAE 扩展到视频的方案，侧重于运动与物体边界。
在实现强劲下游性能的同时，避免依赖数据增强或基于跟踪的前提。

提出的方法

取样两帧视频；保持过去帧未掩码，未来帧的补丁中有 95% 被掩码（非对称掩码）。
用一个独立工作的连体 ViT 编码器处理两帧。
用基于交叉注意力的解码器解码，预测未来帧中缺失的补丁。
使用对被掩码补丁的像素重建的 L2 损失进行训练；不使用时序位置嵌入。
探索编码器/解码器变体；发现连体编码器 + 交叉自解码器配合非对称掩码可获得最佳性能。
证明非对称掩码和交叉注意解码器在不进行大规模数据增强的情况下也能学习鲁棒的密集对应。

Figure 1 : Siamese Masked Autoencoders. During pre-training we randomly sample a pair of video frames and randomly mask a huge fraction ( $95\%$ ) of patches of the future frame while leaving the past frame unchanged. The two frames are processed independently by a siamese encoder parametrized by a

实验结果

研究问题

RQ1预测性、非对称掩码自编码在视频帧上训练，是否能在不使用对比增强的情况下学习细粒度视觉对应？
RQ2编码器/解码器设计如何影响对视频中物体中心的时序对应的学习？
RQ3SiamMAE 表征在视频对象分割、姿态关键点传播与语义部件传播中的下游收益如何？

主要发现

SiamMAE 在三个下游任务上超过了最新的自监督方法：视频对象分割、姿态关键点传播和语义部件传播。
较小的补丁尺寸（ViT-S/8）结合 SiamMAE 显著提升结果，在某些情况下甚至超越了更大、在 ImageNet 上训练的模型。
使用非对称掩码（过去帧全输入，未来帧高掩码）配合连体编码器与交叉自解码器有效学习物体运动和边界，类似于亲和性机制。
SiamMAE 在零样本性能上具有竞争力，且无需数据增强或基于跟踪的前提。
尽管没有 CLS 损失引导，注意力图中出现了明显的对象边界分界。

Figure 2 : Visualizations on the Kinetics-400 [ 93 ] validation set (masking ratio $90\%$ ). For each video sequence, we sample a clip of $8$ frames with a frame gap of $4$ and show the original video (top), SiamMAE output (middle), and masked future frames (bottom). Reconstructions are shown with $

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。