QUICK REVIEW

[论文解读] Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

Huaizu Jiang, Deqing Sun|arXiv (Cornell University)|Nov 30, 2017

Advanced Vision and Imaging参考文献 20被引用 43

一句话总结

本文提出 Super SloMo，一种用于高质量、可变长度视频插值的端到端卷积神经网络，可在两帧输入之间生成多个中间帧。通过使用基于 U-Net 的光流计算与优化网络联合建模运动估计与遮挡推理，并引入软可见性图，该方法在 Middlebury、UCF101 和高帧率 Sintel 等多个数据集上均达到最先进性能，同时由于时间无关的参数设计，能够并行生成任意数量的中间帧。

ABSTRACT

Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. We use 1,132 video clips with 240-fps, containing 300K individual video frames, to train our network. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.

研究动机与目标

开发一种在两帧输入视频之间生成高质量、多帧中间帧的方法，实现任意帧率提升。
在单一端到端可训练网络中联合建模运动估计与遮挡推理，以减少运动边界处的伪影。
设计一种时间无关的架构，可并行生成任意数量的中间帧，克服递归单帧插值的局限性。
在高帧率视频数据上进行模型训练，以提升在多样化视频插值任务中的泛化能力与性能。

提出的方法

基于 U-Net 的光流计算网络估计两帧输入之间的双向光流。
将双向光流线性组合，以近似每个所需时间步的中间光流。
第二个 U-Net 对近似光流进行优化，并预测软可见性图以处理遮挡。
使用优化后的光流场对输入帧进行特征扭曲，并在线性融合前应用可见性图以排除被遮挡像素。
整个网络在 1,132 段高帧率（240 fps）视频剪辑上进行端到端训练，总计 300,000 帧。
模型的时间不变参数使其无需重新训练即可并行生成任意数量的中间帧。

实验结果

研究问题

RQ1单一端到端深度学习模型能否有效在两帧输入视频之间生成多帧中间帧，同时保持高空间与时间一致性？
RQ2如何有效建模运动边界与遮挡，以减少视频插值中的伪影？
RQ3时间无关的网络架构能否实现任意数量中间帧的并行生成，从而克服递归计算瓶颈？
RQ4联合优化光流估计与可见性预测是否能带来优于独立或顺序方法的插值质量？

主要发现

在 Middlebury 数据集上，Super SloMo 超过所有基线方法，在 8 个序列中的 6 个序列上取得最佳 PSNR 与 SSIM，包括合成的 Urban 与立体 Teddy 序列。
在 UCF101 数据集上，Super SloMo 在所有指标上均持续优于非神经网络与基于 CNN 的方法，展现出在复杂运动区域的强劲性能。
在 slowflow 数据集上，Super SloMo 取得最佳 PSNR 与 SSIM 分数，FlowNet2 仅在 SSIM 与 L1 误差上领先，表明其整体质量更优。
在高帧率 MPI Sintel 数据集上，Super SloMo 显著优于所有其他方法，其 PSNR 分数在每个插值时间步均持续高于基线方法。
在无监督光流学习方面，Super SloMo 在 KITTI 2012 基准测试中实现平均端点误差（EPE）为 13.0，相较先前最先进方法 DVF 提升 11%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。