QUICK REVIEW

[논문 리뷰] VideoMaMa: Mask-Guided Video Matting via Generative Prior

Sangbeom Lim, Seoung Wug Oh|arXiv (Cornell University)|2026. 01. 20.

Image Enhancement Techniques인용 수 0

한 줄 요약

VideoMaMa는 입력 이진 마스크를 고품질 알파 매트로 변환하기 위해 pretrained video diffusion priors를 활용하여 제로샷 실제 세계 일반화 및 확장 가능한 pseudo-labeling을 가능하게 합니다. 또한 MA-V 주석에서 구축된 대규모 실제 비디오 매팅 데이터셋을 도입합니다.

ABSTRACT

Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

연구 동기 및 목표

비디오 매팅에서 합성 데이터와 실제 데이터 간 도메인 격차를 줄이기 위해 diffusion 모델의 생성 프리저를 활용한다.
거친 분할 마스크에서 픽셀 단위의 알파 매트를 생성하는 마스크 가이드 매팅 모델을 개발한다.
세그먼트 마스크로부터 대규모 비디오 매팅 주석을 생성하는 확장 가능한 파이프라인을 만든다.
대규모 의사 라벨링 데이터가 실제 환경의 영상에서 매팅의 강인성을 향상시킨다는 것을 입증한다.

제안 방법

VideoMaMa를 Stable Video Diffusion 위에 마스크 조건 잠재 입력으로 구성하여 한 단계의 순전파에서 알파 매트를 생성하도록 구축한다.
VAE를 통해 비디오 프레임, 입력 마스크, 알파 매트를 공유 잠재 공간으로 인코딩하여 효율적인 시공간 처리를 가능하게 한다.
마스크 증강(다각형 저하 및 다운샘플링 저하)을 도입하여 복사-붙여넣기 동작을 방지하고 외관 주도 매팅을 촉진한다.
두 단계 학습 전략을 채택한다: (i) 고해상도 공간 계층을 학습하여 미세한 디테일을 확보, (ii) 저해상도에서 시간적 일관성을 확보하기 위해 시간 계층을 학습한다.
확산 피처를 DINOv3 표현과 정렬시켜 의미 지식을 주입하여 경계 위치 확인과 추적을 개선한다.
픽셀 수준 매팅 손실과 라플라시안-에지 보존 구성 요소를 함께 학습하여 선명한 경계를 유도한다.
프레임, 마스크, 노이즈 잠재를 연결한 두 타워 추론 체제를 적용해 알파 잠재를 예측하고 이후 VAE 디코딩을 수행한다.]
research_questions: ["사전 학습된 diffusion priors를 사용하여 제로샷 실제 환경 설정에서 거친 마스크로부터 고품질 비디오 매트를 생성하는 방법은 무엇인가?","마스크 조건부 확산 모델을 두 단계로 학습시켜 영상 매팅에서 공간적 상세성과 시간적 일관성을 모두 달성할 수 있는가?","의미적 특징 정렬(DINO v3 등)이 비디오 매팅의 매트 품질과 경계 처리에 도움이 되는가?","대규모 의사 라벨링 데이터(MA-V)가 실제 영상으로 미세 조정할 때 다운스트림 비디오 매팅 모델의 성능을 향상시키는가?"]
key_findings:[

Figure 2 : Overview of VideoMaMa architecture. RGB frames and guide masks are processed through video diffusion U-Net layers to generate high-quality video mattes. Semantic injection with DINO features is applied during training.

실험 결과

연구 질문

RQ1How can pretrained diffusion priors be used to produce high-quality video mattes from coarse masks in a zero-shot real-world setting?
RQ2Can a mask-conditioned diffusion model be trained in two stages to achieve both high spatial detail and temporal coherence in video matting?
RQ3Does semantic feature alignment (e.g., with DINOv3) improve matte quality and boundary handling in video matting?
RQ4Can large-scale pseudo-labeled data (MA-V) improve downstream video matting models when fine-tuned on real-world footage?

주요 결과

VideoMaMa achieves strong zero-shot generalization to real-world videos despite being trained only on synthetic data.
MA-V provides over 50k real-world videos with high-quality matting annotations, enabling effective training of matting models.
SAM2-Matte trained on MA-V outperforms the same model trained on existing matting datasets in robustness on in-the-wild videos.
Large-scale pseudo-labeling with VideoMaMa substantially boosts matting performance, and MA-V improves both matting quality and tracking robustness when used for fine-tuning.
VideoMaMa demonstrates robustness across diverse mask sources, including synthetic degradations and model-generated masks (e.g., SAM2).
Two-stage training plus semantic injection (DINO features) are beneficial, improving boundary and temporal consistency.

Figure 3 : Examples of mask augmentation methods. Polygon and Downsampling degradation are applied at weak and strong augmentation levels.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.