QUICK REVIEW

[论文解读] Diffusion Models for Video Prediction and Infilling

Tobias Höppe, Arash Mehrjou|arXiv (Cornell University)|Jun 15, 2022

Generative Adversarial Networks and Image Synthesis被引用 50

一句话总结

RaMViD 将扩展扩散模型到视频，通过3D卷积和随机掩码，能够在统一架构中同时进行视频预测、填充和上采样，并在多个基准测试中取得具有竞争力的结果。

ABSTRACT

Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling, and upsampling. Due to our simple conditioning scheme, we can utilize the same architecture as used for unconditional training, which allows us to train the model in a conditional and unconditional fashion at the same time. We evaluate RaMViD on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. High-resolution videos are provided at https://sites.google.com/view/video-diffusion-prediction.

研究动机与目标

通过扩散模型实现具有时间一致性的视频生成，用于预测与填充。
引入基于随机掩码的条件机制，统一无条件、条件和混合训练。
在 BAIR 上展示最先进的性能，并在 Kinetics-600 和 UCF-101 的预测与填充任务中取得强劲结果。

提出的方法

使用 3D 卷积的扩散模型架构引入 Random-Mask Video Diffusion (RaMViD)。
通过对无条件帧进行掩码并将条件帧注入网络输入，对任意子集帧进行条件化。
通过随机化掩码进行训练，使同一架构能够在条件学习与无条件学习之间切换。
采用线性扩散日程和在分辨率为 16 和 8 时带自注意力的 U-Net 进行视频建模。
将条件扩散目标定义为仅重构未知（未掩码）的帧，同时保持条件帧不变。
通过选择条件集合 C 并对未知帧 U 进行采样，使推理可用于预测、填充和上采样。

实验结果

研究问题

RQ1扩散模型是否可以有效扩展到视频领域，用于预测和填充？
RQ2随机掩码是否提供一种简单、有效的条件机制，在扩散采样过程中实现条件帧与无条件帧的统一？
RQ3不同条件掩码设置（例如条件帧的数量与位置）对预测和填充的性能有何影响？
RQ4与先前方法相比，RaMViD 在标准视频预测与完成基准上的表现如何？
RQ5该模型是否能够进行无条件视频生成和自回归长序列采样？

主要发现

RaMViD 在 BAIR 上针对在给定条件帧的前提下预测 11–15 帧时实现了最先进的 Fréchet Video Distance (FVD)。
RaMViD 在 Kinetics-600 的预测任务中达到或超过竞争方法，参数量约为 308M，具有竞争力。
在不同数据集的复杂度下，无条件生成在 RaMViD 上是可行的；提高无条件比率 pU 的效果取决于数据集的复杂度，可能改善或恶化性能。
RaMViD 在以起始帧和结束帧作为条件时表现出有效的视频填充，在各种条件设置下获得具有竞争力的 FVD。
自回归采样可以将序列长度扩展到训练 horizon 之外，尽管在较长序列上质量可能会慢慢下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。