QUICK REVIEW

[논문 리뷰] VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

Han Lin, Abhay Zala|arXiv (Cornell University)|2023. 09. 26.

Multimodal Machine Learning Applications인용 수 8

한 줄 요약

VideoDirectorGPT은 LLM 기반의 계획 단계로 다중 장면 비디오 계획을 만들고 레이아웃 가이드형 비디오 생성기(Layout2Vid)로 단일 프롬프트에서 시간적으로 일관된 긴 비디오를 생성하며, 매개변수의 일부만 업데이트하여 학습 효율성을 확보한다.

ABSTRACT

Recent text-to-video (T2V) generation methods have seen significant advancements. However, the majority of these works focus on producing short video clips of a single event (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules. This prompts an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which includes the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities. Next, guided by this video plan, our video generator, named Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities across multiple scenes, while being trained only with image-level annotations. Our experiments demonstrate that our proposed VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with consistency, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. Detailed ablation studies, including dynamic adjustment of layout control strength with an LLM and video generation with user-provided images, confirm the effectiveness of each component of our framework and its future potential.

연구 동기 및 목표

하나의 텍스트 프롬프트에서 LLM을 활용해 다중 장면 비디오 콘텐츠를 계획한다.
T2V 생성에서 명시적 공간 레이아웃 제어와 장면 간 시간적 일관성을 가능하게 한다.
이미지 수준 주석만으로 레이아웃 가이드 비디오 제너레이터를 효율적으로 학습한다.
레이아웃 정확도와 움직임이 향상되면서도 오픈 도메인 품질을 유지함을 보여준다.
레이아웃 가이드 강도와 사용자가 제공한 이미지의 통합을 위한 동적 가이드 경로를 제공한다.

제안 방법

두 단계 파이프라인: (i) GPT-4로 장면 설명, 2D 레이아웃이 있는 엔티티, 배경, 일관성 그룹화를 포함한 비디오 계획 생성; (ii) 비디오 계획에 의해 가이드되는 Layout2Vid로 grounded 비디오 생성.
비디오 계획은 다중 장면 설명, 2D 경계 상자가 있는 엔티티, 배경, 장면 간 일관성 그룹화의 네 가지 구성 요소를 포함한다.
Layout2Vid는 ModelScopeT2V를 기반으로 대부분의 매개변수를 고정하고 Guided 2D Attention(매개변수의 13%)만 학습하여 이미지 수준 주석에서 레이아웃 제어를 가능하게 한다.
엔티티 정합은 장면 간 동일성을 유지하기 위해 이미지+텍스트 임베딩을 결합하고, 바운딩 박스 푸리에 피처와 함께 CLIP 이미지 및 텍스트 기능을 사용한다.
두 단계의 디노이즈 제거: Guided 2D Attention으로 초기 레이아웃 가이드 단계 followed by 표준 단계, alpha가 레이아웃 가이드 디노이즈 비율을 제어한다.
학습 효율성: Layout2Vid는 0.64M 이미지 수준 레이아웃 주석에서 학습되었으며 8대 A6000 GPU에서 50k 스텝에 최적화된다.

실험 결과

연구 질문

RQ1LLM이 생성한 비디오 계획이 긴 형식의 비디오 생성에서 다중 장면 간 일관성과 제어를 개선할 수 있는가?
RQ2이미지 수준 주석으로 학습된 레이아웃 가이드 비디오 제너레이터가 시각적 품질을 유지하면서 장면 간 시간적 일관성을 달성하는가?
RQ3레이아웃 가이드 강도에 대한 동적 제어가 비디오 품질과 레이아웃 충실도에 어떤 영향을 미치는가?
RQ4사용자가 제공한 예시 이미지가 레이아웃 기반의 비디오 생성에 반영될 수 있는가?
RQ5장면 간 엔티티 아이덴티티 유지를 위한 이미지+텍스트 임베딩의 영향은 무엇인가?

주요 결과

모델	객체	개수	공간	크기	전체 정확도(%)	이동 방향 정확도(%)
ModelScopeT2V	89.8	38.8	18.0	15.8	40.8	30.5
VideoDirectorGPT (Ours)	97.1	77.4	61.1	47.0	70.6	46.5

VideoDirectorGPT는 단일 장면 생성에서 강력한 baseline(ModelScopeT2V)과 비교하여 객체 수, 공간 관계, 규모에 걸친 레이아웃 제어를 뛰어넘는다.
프레임워크는 렌도에 의해 주도되는 시간적 다이내믹스를 보여주며 객체 이동 방향 정확도를 크게 향상시킨다.
오픈 도메인 MSR-VTT에서 VideoDirectorGPT는 레이아웃과 다중 장면 일관성 기능을 추가하면서도 경쟁력 있는 시각 품질과 텍스트-비디오 정렬을 유지한다.
Layout2Vid는 이미지 수준 레이아웃 주석을 사용한 효율적 학습(매개변수의 13%만 업데이트)을 가능하게 하며 비디오 생성 품질을 유지한다.
엔티티 정합에 이미지와 텍스트 임베딩을 함께 사용하는 것이 텍스트 전용 정합보다 시간적 일관성을 더 잘 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.