QUICK REVIEW

[论文解读] VRT: A Video Restoration Transformer

Jingyun Liang, Jiezhang Cao|arXiv (Cornell University)|Jan 28, 2022

Advanced Image Processing Techniques被引用 82

一句话总结

VRT 引入了一种并行、多尺度的视频修复变换器，该模型通过时域互相关自注意力和并行扭曲来建模长程时间依赖，从低质量序列中恢复高质量帧，应用于多种视频修复任务。

ABSTRACT

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins ($ extbf{up to 2.16dB}$) on fourteen benchmark datasets.

研究动机与目标

通过利用超越滑窗和循环方法的长程时间相关性，推动视频修复的改进。
提出一个并行、多尺度框架，联合提取、对齐并融合多帧特征。
开发互注意力用于隐式运动估计和帧之间特征融合。
通过序列移位实现跨片段的交互，以增强时间建模。
展示在多种视频修复任务上的前沿性能。

提出的方法

引入一个多尺度的 VRT，其中每个尺度包含 Temporal Mutual Self Attention（TMSA）和 Parallel Warping 模块。
使用互注意力在参考帧和支撑帧之间执行联合对齐与融合，起到柔性扭曲的作用。
通过将序列分割为 2 帧片段、并行处理并移位层来实现跨片段交互，从而应用 TMSA。
在每个尺度末端加入并行扭曲，通过流向引导的可变形对齐融合相邻帧信息。
使用 Charbonnier 损失训练，并从浅层与深层特征通过残差学习重建高质量帧。
对长序列的帧进行并行处理，实现可扩展的时间建模与部署。

实验结果

研究问题

RQ1视频修复如何从超越滑动窗口和递归架构的长程时间建模中受益？
RQ2基于变换器的框架能否在多尺度上有效地联合提取、对齐和融合多帧特征？
RQ3互注意力是否能够在没有显式光流地面真实值的情况下实现自适应、鲁棒的运动估计和特征扭曲？
RQ4VRT 在包括超分、去模糊、去噪、帧插值以及时空超分等多种视频修复任务中的表现如何？

主要发现

VRT 在多种视频修复任务上达到最先进的性能，在基准数据集上提升高达 2.16 dB。
与滑动窗口和循环方法相比，VRT 支持并行处理和长程时间依赖建模。
互注意力为帧对齐与融合提供了一种软性、可自适应的替代显式运动扭曲的方法。
VRT 在多个数据集上的视频超分、去模糊、去噪、帧插值和时空超分等任务中展现出强劲的结果。
该模型使用具备 TMSA 与并行扭曲的多尺度架构，在参数效率和运行时特性方面具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。