QUICK REVIEW

[논문 리뷰] Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang|arXiv (Cornell University)|2023. 03. 08.

Generative Adversarial Networks and Image Synthesis인용 수 13

한 줄 요약

Video-P2P는 사전 학습된 이미지 확산 모델을 교차 어텐션 제어와 텍스트 기반 편집이 가능한 방식으로 실제 세계의 비디오에 적용하여, 향상된 시간적 일관성으로 로컬 및 글로벌 편집을 달성한다.

ABSTRACT

This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.

연구 동기 및 목표

확산 모델을 사용해 실제 세계의 비디오에서 텍스트 기반 편집을 가능하게 하는 방법의 필요성과 동기를 제시한다.
프레임 간 시간적 일관성을 유지하는 반전(inversion) 및 어텐션 제어 파이프라인을 개발한다.
주변 내용의 변화를 유발하지 않으면서 로컬 편집(예: 단어 교환)을 달성하기 위한 메커니즘을 제안한다.
실제 비디오에서 접근법의 실용성과 효과를 입증하고 기존 방법과 비교한다.

제안 방법

사전 학습된 이미지 확산 모델을 프레임 일관 반전을 가능하게 하는 Text-to-Set (T2S) 모델로 변환한다.
저메모리 비용으로 정확한 비디오 반전이 되도록 공유된 무조건 임베딩을 최적화한다.
출처 프롬프트와 대상 프롬프트에 서로 다른 가이던스를 사용하고 주의 맵을 융합하는 분리된 가이던스 어텐션 제어 전략을 도입한다.
비디오 반전을 지원하도록 T2S 모델에서 프레임 어텐션과 시간 어텐션을 미세 조정한다.
추론 중 어텐션 맵을 교환하거나 다듬어 프롬프트 간 편집을 수행한다.
자세와 장면을 보존하면서 단어 교환, 프롬프트 정제, 어텐션 재가중화를 가능하게 하는 교차 어텐션 제어를 적용한다.

실험 결과

연구 질문

RQ1사전 학습된 이미지 확산 모델을 비디오의 자세하고 시간적으로 일관된 편집에 맞게 적응시킬 수 있는가?
RQ2비디오 설정에서 재구성 및 편집 가능성을 모두 지원하도록 반전과 어텐션 제어를 어떻게 설계할 수 있는가?
RQ3소스 프롬프트와 대상 프롬프트에 대해 분리된 가이던스를 사용하는 것이 비디오의 교차 어텐션 편집 품질을 향상시키는가?
RQ4관련 없거나 시간적 일관성을 해치지 않으면서 로컬 편집을 어느 정도까지 달성할 수 있는가?

주요 결과

Video-P2P는 교차 어텐션 제어를 통해 로컬 및 글로벌 비디오 편집을 가능하게 한다.
비디오 반전을 위한 공유된 무조건 임베딩은 재구성 품질을 향상시키고 메모리 비용은 작다.
소스 최적화 임베딩과 타깃 초기화 임베딩을 결합한 분리된 가이던스 전략은 편집 가능성과 안정성을 향상시킨다.
두 분기에서 얻은 어텐션 맵을 통합하면 편집 품질과 시간적 일관성이 향상된다.
Video-P2P는 정성적 평가 및 사용자 연구에서 기존 방법에 비해 원래 자세와 장면의 보존 측면에서 우수함을 보여준다.
수치 분석은 프레임 간 구조 보존 및 의미적 일관성이 대안들에 비해 향상되었음을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.