QUICK REVIEW

[论文解读] Video Frame Interpolation via Adaptive Separable Convolution

Simon Niklaus, Long Mai|arXiv (Cornell University)|Aug 5, 2017

Advanced Vision and Imaging参考文献 39被引用 72

一句话总结

一个神经网络为每个像素估计密集的两个一维核对，以执行可分离的、空间自适应卷积用于视频帧插值，从而在降低内存需求的同时实现整帧合成，并提供使用感知损失以获得更好视觉质量的选项。

ABSTRACT

Standard video frame interpolation methods first estimate optical flow between input frames and then synthesize an intermediate frame guided by motion. Recent approaches merge these two steps into a single convolution process by convolving input frames with spatially adaptive kernels that account for motion and re-sampling simultaneously. These methods require large kernels to handle large motion, which limits the number of pixels whose kernels can be estimated at once due to the large memory demand. To address this problem, this paper formulates frame interpolation as local separable convolution over input frames using pairs of 1D kernels. Compared to regular 2D kernels, the 1D kernels require significantly fewer parameters to be estimated. Our method develops a deep fully convolutional neural network that takes two input frames and estimates pairs of 1D kernels for all pixels simultaneously. Since our method is able to estimate kernels and synthesizes the whole video frame at once, it allows for the incorporation of perceptual loss to train the neural network to produce visually pleasing frames. This deep neural network is trained end-to-end using widely available video data without any human annotation. Both qualitative and quantitative experiments show that our method provides a practical solution to high-quality video frame interpolation.

研究动机与目标

在不进行显式光流估计的情况下，推动端到端的高质量帧插值。
降低用于大运动的空间自适应核的内存和计算需求。
提出一个全卷积网络，能够同时为所有像素预测可分离的一维核。
实现将感知损失融入到训练中，以提升插值帧的视觉质量。

提出的方法

用可分离的一维核替代完整的二维自适应核，以近似每个输出像素的二维核。
使用一个全卷积的编码-解码网络为每个像素预测四组一维核（两帧，两个方向）。
将预测的一维核作为局部卷积应用于输入帧，以一次性合成中间帧。
使用L1损失或感知损失（基于VGG的特征重建）进行训练，以提升锐度和细节。
通过复制填充处理边界，并通过在解码器中选择双线性上采样来缓解棋盘伪影。
通过尝试核大小（51）和池化层（五层）来权衡运动处理和感受野。

实验结果

研究问题

RQ1可分离的一维核在减少内存需求的同时，是否能够近似完整的二维空间自适应核以用于帧插值？
RQ2与纯像素级损失相比，端到端训练并加入感知损失是否能为插值帧带来更高的感知质量？
RQ3所提出的可分离卷积方法在质量和速度方面与最先进的基于光流的和AdaConv方法相比如何？
RQ4哪些核大小和网络结构选择最能处理大运动并在1080p下保持整帧合成？
RQ5该方法在遮挡、运动不连续和亮度变化等具有挑战性的场景中是否鲁棒？

主要发现

可分离的一维核方法将每个核的内存从n^2降至2n，从而在一次前向传播中实现1080p全帧插值。
L1损失在数值性能方面表现强劲，在Middlebury数据集上取得了最先进的结果，尤其是在运动不连续的区域。
加入感知损失（L_F）提升了视觉锐度和高频细节，在定性和用户研究结果中有所体现。
该方法在1080p插值方面比AdaConv快得多（超过20倍），且通常产生更令人满意的视觉效果。
在解码器中使用双线性上采样有助于缓解某些上采样方法引起的棋盘伪影。
定量结果显示在MAE和SSIM方面与最先进方法具有竞争力，L1模型在保留评估中的整体表现最好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。