QUICK REVIEW

[论文解读] Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang, Yufan Feng|arXiv (Cornell University)|Sep 25, 2023

Generative Adversarial Networks and Image Synthesis被引用 11

一句话总结

Free-Bloom 是一个零-shot、无需训练的流水线，使用一个大型语言模型（LLM）作为导演来生成语义帧序列，使用预训练的 Latent Diffusion Model（LDM）作为动画师来生成高质量、时间上连贯的视频，并通过训练-free 的双路径插值实现更高帧率。

ABSTRACT

Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of "moving images", we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

研究动机与目标

推动零样本文本到视频生成，具数据和成本高效性。
利用LLMs生成语义连续的帧提示序列。
在不进行训练的情况下，改编预训练的LDM以产生时序和一致性都保持的视频帧。
引入插值与注意力机制以提升时间分辨率和帧的保真度。

提出的方法

使用 LLM 作为导演，从输入文本提示中生成帧提示的序列。
修改潜在扩散模型，使其能够跨帧联合噪声采样，并在去噪过程中实现步进感知的注意力转变，以实现连贯的帧生成。
在潜在空间实现训练-free 的双路径插值，以在保持语义和连续性的前提下创建中间帧。
应用步进感知的注意力转变，在去噪过程中将注意力从上下文帧（第一帧/先前帧）转移到当前帧。
可选地通过 DDIM 反演和基于 LDM 的扩展，扩展为个性化和从图像到视频的扩展。

实验结果

研究问题

RQ1在没有视频数据或训练的情况下，零-shot 流水线能否在 LLM 的驱动下生成语义连贯且时序连贯的视频？
RQ2在文本提示序列下，如何改造 LDM 以产生具有相同一致性和时序一致性的视频帧？
RQ3无需训练的双路径插值是否在保持语义保真度的同时提升时间帧率？
RQ4联合帧噪声采样和步进感知注意力对视频质量和连贯性有何影响？

主要发现

方法	无需训练	CLIP 指标 ↑	保真度 ↑	时间↑	语义 ↑	排序 ↓
VideoFusion		0.483	3.436	3.889	3.267	2.317
LVDM	-	0.480	3.289	3.650	3.242	2.567
T2V-Zero	✓	0.479	3.486	2.783	3.025	3.033
Ours	✓	0.477 / 0.482*	4.133	3.267	3.867	2.083

Free-Bloom 可以在不依赖视频数据或训练的情况下，从提示生成高质量、具有语义意义的视频。
联合噪声采样和步进感知注意力转移提升帧之间的时序和一致性。
双路径插值在提升时间分辨率的同时保留上下文和语义内容。
定量结果显示与零-shot及训练基线相比，在基于 CLIP 的指标上具有竞争力且用户研究排名较有利。
该方法在生成连贯叙事序列的同时维持单帧的保真度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。