QUICK REVIEW

[论文解读] Video Super-Resolution Transformer

Jiezhang Cao, Yawei Li|arXiv (Cornell University)|Jun 12, 2021

Advanced Image Processing Techniques参考文献 4被引用 132

一句话总结

这篇论文提出 VSR-Transformer，一种用于视频超分辨率的 Transformer 变体，使用时空卷积自注意力（STCSA）层和双向基于光流的前馈（BOFF）层来捕捉局部性并实现跨帧特征传播/对齐。

ABSTRACT

Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.

研究动机与目标

通过利用 Transformer 来提升 VSR，同时解决局部性和跨帧对齐的局限性。
提出 STCSA 以利用视频帧中的时空局部性。
引入 BOFF 以通过光流实现跨帧的特征传播和对齐。
在基准 VSR 数据集上展示有效性并与最先进方法进行比较。

提出的方法

提出时空卷积自注意力（STCSA），将输入帧展开为局部三维小块并对小块计算注意力以捕捉局部性。
提供理论分析，显示 STCSA 相对于用于学习 k-pattern 局部性的全连接自注意力的优势（定理 2）。
引入双向基于光流的前馈层（BOFF），使用前向/后向光流来变形特征并在帧间进行双向传播与融合。
采用 3D 固定的时空位置编码，以在置换不变架构中保持位置信息。
构建一个带有特征提取器、VSR-Transformer 编码器和重建网络的编码器-重建管线。
在 REDS4、Vimeo-90K-T 和 Vid4 上使用标准 PSNR/SSIM 指标进行训练和评估。

实验结果

研究问题

RQ1相较于 Vision Transformer 中的传统全连接自注意力，STCSA 是否能有效利用视频数据中的局部性？
RQ2双向基于光流的前馈层是否能改善跨帧特征传播和对齐，从而提升 VSR？
RQ3在 VSR 性能上加入时空位置编码的影响是什么？
RQ4提出的 VSR-Transformer 与在标准基准上的最先进 VSR 方法相比如何？
RQ5在参数规模可扩展的同时，是否仍能提供具有竞争力的 VSR 结果？

主要发现

VSR-Transformer 在 REDS4 的 4x VSR 上实现了最高的 PSNR，并在多项基线中具有竞争力的 SSIM。
在 Vimeo-90K-T 上，该方法实现了强 PSNR/SSIM，超越了若干 7 帧基线。
在 Vid4（Y 通道）上，该方法在所报告的方法中实现了领先的平均性能。
STCSA 层在捕捉局部性方面相较 FCSA 展现出理论与经验上的优势。
BOFF 能实现有效的特征传播和跨帧对齐，提升 VSR 性能。
64 通道的模型可以超越 EDVR-L（128 通道），尤其是在帧可用性有限的情景中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。