QUICK REVIEW

[논문 리뷰] Adaptive 1D Video Diffusion Autoencoder

Yao Teng, Minxuan Lin|arXiv (Cornell University)|2026. 02. 04.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

일차원 확산 비디오 자동인코더(One-DVA)는 가변 길이의 1D 잠재 토큰을 갖는 트랜스포머 기반 인코더와 확산 기반 픽셀-스페이스 디코더를 사용하여 적응형 비디오 압축 및 다운스트림 잠재 확산 모델에 적합한 고품질 재구성을 가능하게 한다.

ABSTRACT

Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.

연구 동기 및 목표

고정 비트율 인코더를 넘어 적응형, 토큰 효율적인 비디오 압축의 동기를 부여한다.
질의 메커니즘을 통해 가변 길이 1D 잠재를 생성하는 트랜스포머 기반 인코더를 개발한다.
재구성 품질을 향상시키기 위해 픽셀 공간 확산 디코더를 도입한다.
인코더 초점과 확산 기반 재구성을 균형 있게 훈련하기 위해 두 단계로 훈련한다.
생성용을 위한 잠재 확산 모델링을 가능하게 하기 위해 잠재 표현을 정렬한다.

제안 방법

인코더는 시공간 임베딩으로부터 구조적 표현과 1D 잠재 표현을 추출하기 위해 1D 학습 가능 질의를 갖는 비전 트랜스포머를 사용한다.
가변 길이 드롭아웃 메커니즘(matryoshka에서 영감을 받음)이 학습 중에 1D 잠재 길이를 동적으로 조정한다.
구조적 잠재와 1D 잠재 모두를 조건으로 영상을 재구성하는 픽셀 공간 확산 트랜스포머 디코더이다.
확산 기반 훈련은 생성 품질을 최적화하기 위해 플로 매칭 확산 손실을 사용한다.
잠재 공간 정렬은 1D 잠재가 구조적 잠재 공간과 일치하도록 규제하여 공동 LDM 모델링을 가능하게 한다.
LDM 샘플링 잠재를 사용한 디코더 미세조정은 생성 아티팩트를 완화한다.

Figure 2 : Overview: our One-DVA consists of an encoder, a diffusion decoder and a latent dropout module. The encoder utilizes a vision transformer with 1D queries to extract input video features and outputs low-dimensional latents. The latent dropout module dynamically adjusts the length of 1D late

실험 결과

연구 질문

RQ1동일한 압축비에서 적응형 1D 잠재 인코딩이 고정 비율 비디오 자동인코더의 재구성 품질과 대등한가?
RQ2서로 다른 움직임 및 질감 복잡성을 가진 비디오들에서 충실도를 보존하면서 가변 길이 1D 잠재 인코딩이 토큰 효율성을 향상시키는가?
RQ3확산 기반 디코딩이 재구성 품질을 향상시키고 비디오 생성을 위한 다운스트림 잠재 확산 모델을 지원하는가?
RQ4잠재 공간 정렬 및 디코더 미세조정이 One-DVA 잠재를 사용한 고품질 텍스트-비디오 및 클래스로-비디오 생성을 가능하게 하는가?
RQ5어떤 훈련 전략(두 단계 대 엔드투엔드)이 더 나은 재구성 충실도와 생성 준비성을 제공하는가?

주요 결과

방법	반복 수	rFVD (↓)	PSNR (↑)
CogVideoX	4×8×8	68.17	34.97	0.94	0.033
HunyuanVideo	4×8×8	51.47	35.54	0.94	0.023
Wanx2.1	4×8×8	62.25	34.95	0.94	0.024
Wanx2.2	4×16×16	60.18	35.23	0.94	0.023
Magi1	4×8×8	70.07	36.25	0.95	0.035
Ours	4×16×16	56.96	36.48	0.95	0.025
Ours ( Avg 55.8% 1D )	4×16×16 / 55.8%	70.28	35.42	0.94	0.029
Ours ( Con 55.8% 1D )	4×16×16 / 55.8%	72.42	35.40	0.94	0.029
Ours ( 0% 1D )	/	149.97	32.80	0.91	0.057

One-DVA는 동일한 압축비에서 3D-CNN VAE에 필적하는 재구성 성능을 달성한다.
가변 길이 1D 잠재는 적응형 압축을 가능하게 하며, 더 긴 잠재는 모션이 풍부한 영역의 더 풍부한 디테일을 포착한다.
확산 기반 디코딩은 재구성 품질을 향상시키고 비디오 생성을 위한 다운스트림 잠재 확산 모델을 지원한다.
잠재 공간 정렬 및 디코더 미세조정은 LDM 샘플링 잠재로부터 생성할 때 아티팩트를 줄인다.
절차 분석에서 두 단계 훈련 체계가 재구성 충실도 측면에서 엔드투엔드 훈련보다 나은 성능을 보였다.
각 비디오당 1D 잠재 길이를 결정하는 채점 메커니즘이 고정 길이보다 더 나은 성능을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.