Skip to main content
QUICK REVIEW

[论文解读] Recurrent Video Restoration Transformer with Guided Deformable Attention

Jingyun Liang, Yuchen Fan|arXiv (Cornell University)|Jun 5, 2022
Advanced Image Processing Techniques被引用 81
一句话总结

该论文提出 RVRT,一种在全局递归框架内并行处理局部帧片段并使用引导可变形注意力进行片段到片段对齐的视频修复 Transformer,在视频超分辨率、去模糊和去噪方面实现最先进的性能,同时实现模型规模与效率的平衡。

ABSTRACT

Video restoration aims at restoring multiple high-quality frames from multiple low-quality frames. Existing video restoration methods generally fall into two extreme cases, i.e., they either restore all frames in parallel or restore the video frame by frame in a recurrent way, which would result in different merits and drawbacks. Typically, the former has the advantage of temporal information fusion. However, it suffers from large model size and intensive memory consumption; the latter has a relatively small model size as it shares parameters across frames; however, it lacks long-range dependency modeling ability and parallelizability. In this paper, we attempt to integrate the advantages of the two cases by proposing a recurrent video restoration transformer, namely RVRT. RVRT processes local neighboring frames in parallel within a globally recurrent framework which can achieve a good trade-off between model size, effectiveness, and efficiency. Specifically, RVRT divides the video into multiple clips and uses the previously inferred clip feature to estimate the subsequent clip feature. Within each clip, different frame features are jointly updated with implicit feature aggregation. Across different clips, the guided deformable attention is designed for clip-to-clip alignment, which predicts multiple relevant locations from the whole inferred clip and aggregates their features by the attention mechanism. Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime.

研究动机与目标

  • Motivate a method that combines benefits of parallel and recurrent video restoration to balance performance, model size, and efficiency.
  • Develop a clip-based recurrent transformer that processes neighboring frames in parallel within a globally recurrent framework.
  • Design a one-stage clip-to-clip alignment mechanism to replace frame-by-frame or post-hoc fusion approaches.

提出的方法

  • Introduce RVRT that divides videos into fixed-length clips and refines each clip's features using previously inferred clip features.
  • Within each clip, jointly update frame features using modified residual Swin Transformer blocks for implicit feature aggregation.
  • Propose guided deformable attention (GDA) for clip-to-clip alignment by predicting multiple relevant locations guided by optical flow and aggregating their features via dynamic attention weights.
  • Use optical-flow guided pre-alignment and a CNN to predict offsets for sampling locations, enabling one-stage video-to-video alignment.
  • Provide a multi-head/multi-group extension of GDA to balance computation and expressive power, with channel interaction through an MLP and residual connections.
  • Train with Charbonnier loss and leverage SpyNet-initialized optical flow to stabilize learning.

实验结果

研究问题

  • RQ1How can we fuse temporal information efficiently without incurring the large memory footprint of parallel transformers?
  • RQ2Can clip-level parallel processing within a recurrent framework preserve long-range temporal dependencies?
  • RQ3Does guided deformable attention enable effective clip-to-clip alignment for video restoration tasks?

主要发现

  • RVRT achieves state-of-the-art performance on video restoration tasks across eight benchmark datasets for super-resolution, deblurring, and denoising.
  • Compared with a representative recurrent model BasicVSR++, RVRT improves PSNR by approximately 0.2–0.5 dB.
  • RVRT outperforms the transformer-based VRT on REDS4 and Vid4 by up to about 0.36 dB (PSNR).
  • RVRT uses less than half the parameters and memory of several parallel methods, and reduces runtime by at least ~25%.
  • Ablation studies show clip length 2 provides a sweet spot, and GDA with optical-flow guidance and MLP channel interaction significantly boosts performance.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。