QUICK REVIEW

[논문 리뷰] FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Yuren Cong, Mengmeng Xu|arXiv (Cornell University)|2023. 10. 09.

Video Analysis and Summarization인용 수 10

한 줄 요약

FLATTEN은 흐름 가이드 주의(attention)를 확산 기반의 텍스트-투-비디오 편집에 도입하여 광학 흐름 추적 경로를 따라 패치 수준의 일관성을 학습 없이 가능하게 하고 TGVE 벤치마크에서 최첨단 결과를 얻는다.

ABSTRACT

Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

연구 동기 및 목표

프레임 간 시간적 일관성을 활용하여 텍스트-투-비디오 편집에서 시각적 불일치를 해결한다.
사전 학습된 T2I 확산 모델과 호환되는 학습 없는 흐름 가이드 주의 메커니즘(FLATTEN)을 도입한다.
광학 흐름 궤적을 통해 프레임 간 일관성을 향상시키는 동시에 프레임별 특징 분포를 보존한다.
FLATTEN이 편집 품질을 향상시키고 기존 T2V 편집 방법에 플러그인으로 적용될 수 있음을 입증한다.

제안 방법

사전 학습된 텍스트-투-이미지 확산 U-Net을 시간 축으로 확대하여 T2V 편집 프레임워크를 만든다.
밀집한 시공간 주의를 흐름가이드 주의(FLATTEN)로 교체하고 광학 흐름에 의해 안내되는 패치 궤적을 사용한다.
RAFT로 추정된 흐름을 잠재 공간 해상도로 다운샘플링하여 패치 궤적을 계산하고 같은 궤적에서 Q/K/V를 수집하여 주의의 Q/K/V로 사용한다.
학습 없이 시간적 일관성을 향상시키기 위해 DDIM 역산 및 샘플링 중에 FLATTEN을 적용한다.
이미지 편집 관행에 따라 샘플링 중에 확산 특징을 주입하여 프레임별 일관성을 향상시킨다.
새로운 학습 가능 매개변수를 도입하지 않으며; FLATTEN은 기존의 프로젝션 계층과 주의 블록을 재사용한다.

실험 결과

연구 질문

RQ1모델 학습 없이 광학 흐름 가이드가 텍스트-투-비디오 편집에서 프레임 간 일관성을 어떻게 개선할 수 있는가?
RQ2확산 기반 T2V 편집에 흐름 가이드 주의를 통합하는 것이 기준과 비교해 시각적 일관성과 텍스트 충실도를 향상시키는가?
RQ3FLATTEN을 다른 확산 기반 T2V 방법에 플러그인으로 적용하여 성능을 높일 수 있는가?
RQ4DDIM 역산 중에 FLATTEN을 적용하는 것과 샘플링 시점에만 적용하는 것의 영향은 무엇인가?

주요 결과

방법	CLIP-F ↑	PickScore ↑	CLIP-T ↑	E warp ↓	S_edit ↑
TGVE-D - FLATTEN (ours)	92.49	20.95	28.05	4.92	57.01
TGVE-V - FLATTEN (ours)	96.75	20.63	26.70	3.16	84.49

FLATTEN은 TGVE-D 및 TGVE-V 벤치마크에서 텍스트 정렬(textual alignment) 및 편집 품질에서 새로운 최첨단 성능을 달성한다.
TGVE-D에서 FLATTEN은 CLIP-F 92.49, PickScore 20.95, CLIP-T 28.05, E warp 4.92, S_edit 57.01를 기록하고; TGVE-V에서 CLIP-F 96.75, PickScore 20.63, CLIP-T 26.70, E warp 3.16, S_edit 84.49.
FLATTEN은 보고된 방법들 중에서 CLIP-T 및 S_edit 점수에서 최상위 또는 동률을 제공하고, CLIP-F를 경쟁력 있게 유지하는 한편 기준 대비 E warp를 감소시킨다.
ControlVideo에 FLATTEN을 플러그인으로 적용하면 시각적 일관성이 향상되어 E_warp를 6.81에서 4.78로 감소시키고 S_edit를 40.70에서 56.42로 올린다.
절제 분석은 DSTA와 FLATTEN을 결합하는 것이(접근 II) S_edit에서 강한 향상을 주고 단독 사용에 비해 시간적 불일치를 줄임을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.