QUICK REVIEW

[論文レビュー] VRT: A Video Restoration Transformer

Jingyun Liang, Jiezhang Cao|arXiv (Cornell University)|Jan 28, 2022

Advanced Image Processing Techniques被引用数 82

ひとこと要約

VRTは、長距離の時系列依存性を Temporal Mutual Self Attention と並列ワープを用いてモデル化し、複数のビデオ復元タスクにおいて LQ シーケンスから HQ フレームを復元する並列・マルチスケールの Video Restoration Transformer を導入します。

ABSTRACT

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins ($ extbf{up to 2.16dB}$) on fourteen benchmark datasets.

研究の動機と目的

長さのある時間的依存性を滑動窓や再帰的アプローチを超えて活用することで、ビデオ復元を改善する動機づけ。
複数フレームから特徴を共同的に抽出・整列・融合する、並列・マルチスケールのフレームワークを提案する。
Frames間の暗黙の動き推定と特徴融合を可能にする mutual attention を開発する。
シーケンスのシフトを介してクロスクリップの相互作用を可能にし、時系列モデリングを強化する。
多様なビデオ復元タスクにおいて最先端の性能を示す。

提案手法

各スケールに Temporal Mutual Self Attention (TMSA) と Parallel Warping モジュールを含むマルチスケールな VRT を導入する。
mutual attention を用いてリファレンスとサポーティングフレーム間の共同整列と融合を行い、ソフトワープ機構として機能させる。
シーケンスを2フレームクリップに分割して並列処理し、クロスクリップ相互作用を可能にするよう層をシフトして TMSA を適用する。
各スケールの末尾で parallel warping を取り入れ、フロー誘導の deformable な整列を介して隣接フレーム情報を融合する。
Charbonnier loss で訓練し、浅層と深層特徴から残差学習で HQ フレームを再構成する。
長いシーケンスに対してフレームを並列処理し、長期的な時系列モデリングとデプロイをスケーラブルに行えるようにする。

実験結果

リサーチクエスチョン

RQ1滑動窓や再帰アーキテクチャを超える長距離時間モデリングによって、ビデオ復元はどのように恩恵を受けるのか。
RQ2変換器ベースのフレームワークが、複数のスケールにまたがってマルチフレームの特徴を共同で抽出、整列、融合することができるのか。
RQ3mutual attention は、明示的な光学フローのグラウンドトゥルースなしで適応的かつ堅牢なモーション推定と特徴ワーピングを可能にするのか。
RQ4SR、デブラー、デノイジング、フレーム補間、空間-時間 ST-VSR を含む幅広いビデオ復元タスクでの VRT の性能はどうなるのか。

主な発見

VRT は複数のビデオ復元タスクで最先端の性能を達成し、ベンチマークデータセット上で最大 2.16 dB の利得を達成する。
スライディングウィンドウ法や再帰的手法と比較して、VRT は並列処理と長距離の時系列依存性モデリングをサポートする。
Mutual attention は、フレーム整列と融合のための明示的なモーションワーピングのソフトで適応的な代替手段を提供する。
VRT は複数のデータセットで、ビデオ SR、デブラー、デノイジング、フレーム補間、そして空間-時間 ST-VSR において強力な結果を示す。
このモデルは TMSA と parallel warping を備えたマルチスケールアーキテクチャを用い、パラメータ効率とランタイム特性の競争力を有する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。