QUICK REVIEW

[논문 리뷰] GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models

Sai Sree Harsha, Ambareesh Revanur|arXiv (Cornell University)|2024. 04. 18.

Generative Adversarial Networks and Image Synthesis인용 수 6

한 줄 요약

GenVideo는 대상 이미지와 모양 인지 InvEdit 마스크를 사용한 잠재 보정으로 대상 모양이 소스와 다르더라도 프레임 간 일관된 편집을 보장합니다.

ABSTRACT

Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.

연구 동기 및 목표

텍스트만으로 충분하지 않을 때 타깃 이미지를 시각적 가이드로 활용해 정확한 비디오 편집을 목표로 한다.
타깃 객체의 모양과 크기가 소스 객체와 다를 때 편집을 가능하게 한다.
편집 중 프레임 간 시간적 일관성을 유지한다.
마스크 가이드 기반 추론 프레임워크를 이미지 조건 디퓨전 모델에 적응 가능하도록 제공한다.

제안 방법

소스 비디오에 대해 inflated SD-unCLIP 모델을 미세 조정하여 타깃 이미지와 텍스트 조건을 수용한다.
DDIM 스텝 간 소스와 타깃 디노이즈의 차이를 비교해 타깃 이미지와 모양 인지 InvEdit 마스크를 생성한다.
잠재 융합 방식으로 UNet 추론 중 마스크된 영역에 타깃 이미지 임베딩을 주입한다.
추론 중 잠재 노이즈 보정 전략을 적용해 프레임 간 시간적 일관성을 개선한다.
InvEdit 마스크로 이끌어 잠재 블렌딩을 적용해 배경을 보존하거나 선택적으로 수정한다.

Figure 2 : Overview of GenVideo . Inflated attention layers are finetuned during source video finetuning. During inference, InvEdit predicts a region to edit and latent correction uses that mask to improve the inter-frame temporal consistency. $\mathcal{M}_{\phi}$ - “no mask”.

실험 결과

연구 질문

RQ1타깃 이미지 가이덕이 소스 객체와 모양/크기가 다른 경우에도 정확한 편집을 가능하게 하는가?
RQ2InvEdit가 동영상 편집에 대해 모양 인지형으로 정확한 마스크 로컬라이제이션을 제공하는가?
RQ3모양 변경 편집에서 프레임 간 시간적 일관성을 향상시키는 잠재 보정 전략이 가능한가?

주요 결과

CLIP-T	DINO	시간	텍스트	이미지	시각
0.238	0.236	0.957	3.6	3.3	4.2
0.234	0.189	0.980	4.3	4.3	3.7
0.231	0.216	0.985	3.3	3.8	2.1
0.235	0.262	0.951	3.9	3.6	3.4
0.234	0.195	0.949	4.0	4.1	5.0
0.241	0.374	0.967	1.7	1.8	2.3

GenVideo는 사용자 연구에서 타깃 텍스트 및 타깃 이미지를 정렬하는 데 기존 최첨단 베이스라인을 능가한다.
InvEdit 마스크는 편집의 정밀하고 모양 인지형 로컬라이제이션을 가능하게 하여 적절한 배경 보존을 보인다.
잠재 보정은 프레임 간 특징 대응을 이용해 잠재를 혼합함으로써 프레임 간 시간적 일관성을 개선한다.
GenVideo는 모양 변화 대상에 대한 제로샷 이미지 편집을 시연함: 예를 들어 자동차에서 버스로의 변환에서 일관성 유지.
정량적 지표에서 GenVideo가 CLIP-T와 DINO 점수는 더 높고, 텍스트 및 이미지 정렬에 대한 사용자 순위 합계가 baselines보다 낮다.

Figure 3 : InvEdit approach - the mask is generated by first iteratively computing noise differences across multiple timesteps for the source denoising branch and target denoising branch. Then, these differences are averaged and binarized to obtain the InvEdit mask.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.