QUICK REVIEW

[论文解读] Context-aware Synthesis for Video Frame Interpolation

Simon Niklaus, Feng Liu|arXiv (Cornell University)|Mar 29, 2018

Advanced Vision and Imaging参考文献 31被引用 30

一句话总结

本文提出了一种上下文感知的视频帧插值方法，通过使用双向光流对输入帧及其像素级上下文特征进行变形，从而提升合成质量。与以往仅通过融合变形帧的方法不同，该方法采用全卷积神经网络，从变形帧和上下文图中合成中间帧，在遮挡、大运动和模糊情况下的表现更优，在Middlebury基准测试中实现了34.62的PSNR，优于当前最先进方法。

ABSTRACT

Video frame interpolation algorithms typically estimate optical flow or its variations and then use it to guide the synthesis of an intermediate frame between two consecutive original frames. To handle challenges like occlusion, bidirectional flow between the two input frames is often estimated and used to warp and blend the input frames. However, how to effectively blend the two warped frames still remains a challenging problem. This paper presents a context-aware synthesis approach that warps not only the input frames but also their pixel-wise contextual information and uses them to interpolate a high-quality intermediate frame. Specifically, we first use a pre-trained neural network to extract per-pixel contextual information for input frames. We then employ a state-of-the-art optical flow algorithm to estimate bidirectional flow between them and pre-warp both input frames and their context maps. Finally, unlike common approaches that blend the pre-warped frames, our method feeds them and their context maps to a video frame synthesis neural network to produce the interpolated frame in a context-aware fashion. Our neural network is fully convolutional and is trained end to end. Our experiments show that our method can handle challenging scenarios such as occlusion and large motion and outperforms representative state-of-the-art approaches.

研究动机与目标

解决由遮挡、大运动和光流不准确导致的帧插值性能限制。
通过引入上下文信息，超越简单融合变形帧的合成质量。
开发一种灵活的、端到端可训练的神经网络，利用运动与语义上下文实现高质量插值。
在具有挑战性的视频插值基准测试中展示优越性能，尤其在处理运动模糊和数据缺失方面。

提出的方法

使用预训练神经网络从输入帧中提取像素级上下文特征。
采用PWC-Net估计输入帧之间的双向光流。
利用估计的双向光流预先对输入帧及其上下文图进行变形。
训练一个全卷积帧合成网络，以变形帧和上下文图为输入，生成中间帧。
使用学习到的损失函数（如$ε$-损失或拉普拉斯损失）进行监督训练，通过网络架构设计避免棋盘状伪影。
在合成网络中采用双线性上采样而非转置卷积，以防止出现网格状伪影。

实验结果

研究问题

RQ1在存在遮挡和运动模糊的情况下，引入像素级上下文信息是否能提升视频帧插值性能？
RQ2基于变形帧和上下文图的合成网络是否优于传统的基于融合的方法？
RQ3在Middlebury和DAVIS等基准数据集上，上下文感知合成方法与当前最先进方法相比表现如何？
RQ4该方法是否能在不重新训练或递归优化的情况下实现任意时间位置$t \in [0,1]$的插值？

主要发现

在DVF数据集上，所提方法实现了34.62的PSNR，优于体素光流基线方法（34.12）。
在Middlebury基准测试中，该方法在所有已发表方法中表现最佳。
人工评估显示，在80%的对比中，参与者更偏好使用$ε$-损失的本方法结果，优于五种竞争方法。
该方法能有效处理大运动和遮挡，产生的伪影少于基线融合方法。
上下文图的使用使合成网络能够做出更准确的预测，尤其在运动模糊或光流缺失区域。
该方法支持在任意时间位置$t \in [0,1]$进行插值，无需重新训练或递归步骤。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。