QUICK REVIEW

[논문 리뷰] Siamese Masked Autoencoders

Agrim Gupta, Jiajun Wu|arXiv (Cornell University)|2023. 05. 23.

Domain Adaptation and Few-Shot Learning인용 수 17

한 줄 요약

SiamMAE는 비대칭 마스킹과 시암 쌘 인코더를 사용하여 비디오에 Masked Autoencoders를 확장하고, Heavy한 증강 또는 추적 기반 프리텍스트 없이도 비디오 객체 분할, 자세 키포인트 전파, 의미 파트 전파에서 제로샷 시각적 대응에서 최첨단 성능을 달성합니다.

ABSTRACT

Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. These frames are processed independently by an encoder network, and a decoder composed of a sequence of cross-attention layers is tasked with predicting the missing patches in the future frame. By masking a large fraction ($95\%$) of patches in the future frame while leaving the past frame unchanged, SiamMAE encourages the network to focus on object motion and learn object-centric representations. Despite its conceptual simplicity, features learned via SiamMAE outperform state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks. SiamMAE achieves competitive results without relying on data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse.

연구 동기 및 목표

비디오에서 시각적 대응을 자기지도 방식으로 학습하려는 동기.
모션과 물체 경계에 초점을 맞춘 MAE의 비디오로의 간단하고 효과적인 확장을 제안.
데이터 증강이나 추적 기반 프리텍스트에 의존하지 않으면서도 강력한 하류 성능을 달성하는 것.

제안 방법

두 프레임을 샘플링하되 과거 프레임은 마스킹 없이 두고 미래 프레임 패치를 95% 마스킹(비대칭 마스킹)합니다.
프레임을 독립적으로 작동하는 시암쌖 ViT 인코더로 처리합니다.
미니 패치를 예측하기 위해 교차 어텐션 기반 디코더를 사용해 미래 프레임의 누락 패치를 디코드합니다.
마스킹된 패치의 픽셀 재구성에 대해 L2 손실로 학습하며 시간 위치 임베딩은 사용하지 않습니다.
인코더/디코더 변형을 탐구하고 시암쌖 인코더 + 교차-자기 디코더와 비대칭 마스킹이 가장 좋은 성능을 냈습니다.
비대칭 마스킹과 교차 어텐션 디코더가 Heavy한 데이터 증강 없이도 강력한 밀도 대응을 학습한다는 것을 보여줍니다.

Figure 1 : Siamese Masked Autoencoders. During pre-training we randomly sample a pair of video frames and randomly mask a huge fraction ( $95\%$ ) of patches of the future frame while leaving the past frame unchanged. The two frames are processed independently by a siamese encoder parametrized by a

실험 결과

연구 질문

RQ1대조적 증강 없이도 비디오 프레임에서 학습된 예측적 비대칭 마스킹 자동인코딩이 미세한 시각적 대응을 학습할 수 있는가?
RQ2인코더/디코더 설계 선택이 비디오의 객체 중심 시간적 대응 학습에 어떤 영향을 미치는가?
RQ3SiamMAE 표현의 비디오 객체 분할, 자세 키포인트 전파, 의미 파트 전파에 대한 하류 이점은 무엇인가?

주요 결과

SiamMAE는 세 가지 하류 작업에서 최신의 자기지도 방법들을 능가합니다: 비디오 객체 분할, 자세 키포인트 전파, 의미 파트 전파.
더 작은 패치 크기(ViT-S/8)로 SiamMAE의 성능이 크게 향상되며 경우에 따라 ImageNet에서 학습된 더 큰 모델보다 뛰어난 성능을 보입니다.
시암쌖 인코더와 교차-자기 디코더를 갖춘 비대칭 마스킹은 물체 움직임과 경계를 효과적으로 학습시키며 친화성(affinity) 메커니즘처럼 작동합니다.
데이터 증강이나 추적 기반 프리텍스트 없이도 SiamMAE가 제로샷 성능을 경쟁력 있게 달성합니다.
CLS 손실 없이도 주의(attention) 맵에서 물체 경계의 구분이 나타납니다.

Figure 2 : Visualizations on the Kinetics-400 [ 93 ] validation set (masking ratio $90\%$ ). For each video sequence, we sample a clip of $8$ frames with a frame gap of $4$ and show the original video (top), SiamMAE output (middle), and masked future frames (bottom). Reconstructions are shown with $

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.