QUICK REVIEW

[論文レビュー] Video Frame Interpolation via Adaptive Separable Convolution

Simon Niklaus, Long Mai|arXiv (Cornell University)|Aug 5, 2017

Advanced Vision and Imaging参考文献 39被引用数 72

ひとこと要約

ニューラルネットワークが画素ごとに密な1Dカーネルの組を推定し、分離可能で空間適応的な畳み込みを用いて動画フレーム補完を行い、低いメモリ消費で全フレームの合成を実現し、視覚品質向上のために知覚損失を使用する選択肢を可能にする。

ABSTRACT

Standard video frame interpolation methods first estimate optical flow between input frames and then synthesize an intermediate frame guided by motion. Recent approaches merge these two steps into a single convolution process by convolving input frames with spatially adaptive kernels that account for motion and re-sampling simultaneously. These methods require large kernels to handle large motion, which limits the number of pixels whose kernels can be estimated at once due to the large memory demand. To address this problem, this paper formulates frame interpolation as local separable convolution over input frames using pairs of 1D kernels. Compared to regular 2D kernels, the 1D kernels require significantly fewer parameters to be estimated. Our method develops a deep fully convolutional neural network that takes two input frames and estimates pairs of 1D kernels for all pixels simultaneously. Since our method is able to estimate kernels and synthesizes the whole video frame at once, it allows for the incorporation of perceptual loss to train the neural network to produce visually pleasing frames. This deep neural network is trained end-to-end using widely available video data without any human annotation. Both qualitative and quantitative experiments show that our method provides a practical solution to high-quality video frame interpolation.

研究の動機と目的

明示的な光学 Flow推定を伴わない、エンドツーエンドで高品質なフレーム補間を動機づける。
大きなモーションに対して空間的に適応するカーネルのメモリ要件と計算負荷を削減する。
全ての画素に対して同時に分離可能な1Dカーネルを予測する完全畳み込みネットワークを提案する。
補間フレームの視覚品質を向上させるために知覚損失の組み込みを可能にする。

提案手法

出力ピクセルごとに2D適応カーネルを近似するために、完全な2Dカーネルを分離可能な1Dカーネルに置換する。
ピクセルごとに4組の1Dカーネルを予測する完全畳み込みのエンコーダ-デコーダネットワークを用いる（2フレーム、2方向）。
予測された1Dカーネルを入力フレーム上の局所畳み込みとして適用し、1回の処理で中間フレームを合成する。
シャープさと細部を改善するためにL1損失または知覚損失（VGGベースの特徴再構成）で訓練する。
境界は複製パディングで処理し、デコーダでバイリニアアップサンプリングを選択してチェッカーボードアーティファクトを低減する。
モーション処理と受容野のバランスを取るためにカーネルサイズ（51）とプーリング層（5）を実験する。

実験結果

リサーチクエスチョン

RQ1 separable 1Dカーネルは、メモリ要件を低減しつつ、フレーム補間のための完全な2D空間適応カーネルを近似できるか？
RQ2知覚損失を用いたエンドツーエンドの訓練は、純粋にピクセル単位の損失と比較して補間フレームの知覚品質を高めるか？
RQ3提案された分離畳み込みアプローチは、品質と速度の点で最先端の光学フローベースおよびAdaConv法とどのように比較されるか？
RQ4大きなモーションを処理し、1080pで全フレーム合成を維持するために最適なカーネルサイズとネットワークアーキテクチャの選択は何か？
RQ5遮蔽、モーションの不連続性、明るさの変化などの困難な状況に対して手法は頑健か？

主な発見

分離可能な1Dカーネルアプローチは、カーネルあたりのメモリをn^2から2nに削減し、1回の処理で全フレーム1080p補間を可能にする。
L1損失は数値的性能が高く、特に不連続モーション領域でMiddleburyの最先端結果を達成する。
知覚損失（L_F）の組み込みは視覚的シャープさと高周波の細部を向上させ、定性的およびユーザー調査結果に示される。
本手法は1080p補間でAdaConvよりはるかに高速（20倍以上）で、視覚的にもより好ましい結果を出すことが多い。
デコーダでのバイリニアアップサンプリングを使用することで、いくつかのアップサンプリング法に伴うチェッカーボードアーティファクトの低減に役立つ。
定量的結果は最先端手法と競合するMAEとSSIMを示し、保持評価ではL1モデルが全体的に最も良好である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。