QUICK REVIEW

[论文解读] VideoStudio: Generating Consistent-Content and Multi-Scene Videos

Fuchen Long, Zhaofan Qiu|arXiv (Cornell University)|Jan 2, 2024

Video Analysis and Summarization被引用 6

一句话总结

VideoDrafter 使用一个 LLM 驱动的多场景脚本来引导扩散式视频生成，通过实体参考图像和两种扩散模型实现场景和视频生成的一致性多场景视频，超越 SOTA 基线。

ABSTRACT

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference. Source code is available at \url{https://github.com/FuchenUSTC/VideoStudio}.

研究动机与目标

使用 LLM 将提示转换为结构化的多场景视频脚本，以捕捉跨场景的逻辑。
识别并利用跨场景的共用实体以维持外观一致性。
为每个实体生成参考图像以连接场景并引导视频生成。
通过扩散模型在提示、参考图像与镜头移动的条件下生成场景视频。
展示相较于前沿方法的更高视觉质量和内容一致性。

提出的方法

三阶段框架： (1) 使用一个 LLM（ChatGLM3-6B）将多场景脚本生成分解为场景提示、前景/背景和镜头移动。
(2) 通过 Stable Diffusion 生成共用实体的参考图像，并使用 U2-Net 分割来分离前景/背景，进而生成实体参考图像。
(3) 使用两条扩散分支进行视频场景生成：VideoDrafter-Img 基于事件提示和实体参考创建场景参考图像；VideoDrafter-Vid 基于场景参考图像、动作词汇表和镜头移动生成片段，具有时序注意力和帧扭曲以反映镜头运动。

实验结果

研究问题

RQ1一个 LLM 生成的多场景脚本如何提升跨场景的逻辑连贯性？
RQ2实体级参考图像是否能确保多场景视频中的跨场景内容一致性？
RQ3基于扩散的场景与视频模型在脚本与参考条件下，是否优于现有的单场景和多场景视频生成方法？
RQ4引入时间动态和镜头移动对视频质量与一致性有何影响？

主要发现

VideoDrafter 在多个基准测试上实现了比前沿模型更高的视觉质量和内容一致性。
引入实体参考图像提升跨场景的一致性和与提示的对齐。
两阶段扩散方法（场景参考图像生成和视频生成）有效地在各场景之间保持实体的一致性。
人类评估显示在使用 LLM 驱动的脚本和参考图像时，视觉质量、逻辑连贯性和内容一致性有所提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。