[論文レビュー] Flow-Guided Sparse Transformer for Video Deblurring
FGSTはFlow-Guided sparse window-based transformerと再帰埋め込みを用いて動画のブラーを除去し、DVDおよび GOPRO データセットでSOTAを上回る。
Exploiting similar and sharper scene patches in spatio-temporal neighborhoods is critical for video deblurring. However, CNN-based methods show limitations in capturing long-range dependencies and modeling non-local self-similarity. In this paper, we propose a novel framework, Flow-Guided Sparse Transformer (FGST), for video deblurring. In FGST, we customize a self-attention module, Flow-Guided Sparse Window-based Multi-head Self-Attention (FGSW-MSA). For each $query$ element on the blurry reference frame, FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse yet highly related $key$ elements corresponding to the same scene patch in neighboring frames. Besides, we present a Recurrent Embedding (RE) mechanism to transfer information from past frames and strengthen long-range temporal dependencies. Comprehensive experiments demonstrate that our proposed FGST outperforms state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring. Code and pre-trained models are publicly available at https://github.com/linjing7/VR-Baseline
研究の動機と目的
- Motivate video deblurring as leveraging long-range spatial dependencies and non-local self-similarity.
- Overcome CNN/standard Transformer limitations by introducing flow-guided attention.
- Capture long-range temporal dependencies via a recurrent embedding mechanism.
- Preserve original image information while exploiting motion cues for robust deblurring.
- Demonstrate state-of-the-art performance on DVD and GOPRO benchmarks.
提案手法
- Propose Flow-Guided Sparse Transformer (FGST) with Flow-Guided Sparse Window-based Multi-head Self-Attention (FGSW-MSA).
- Use optical flow to guide sampling of key elements across neighboring frames for each query, enabling globally sparse but highly relevant attention.
- Introduce Flow-Guided Multi-head Self-Attention (FGS-MSA) and its window-based extension FGSW-MSA for robustness to flow inaccuracies.
- Integrate a Recurrent Embedding (RE) mechanism to propagate information from past frames and model long-range temporal dependencies.
- Adopt a U-Net-like encoder–bottleneck–decoder architecture with FGABs (FGST Attention Blocks) and skip connections.
- Maintain computational efficiency by achieving near-linear complexity in the number of tokens via FGSW-MSA.
実験結果
リサーチクエスチョン
- RQ1Can a flow-guided attention mechanism effectively capture non-local self-similarity for video deblurring?
- RQ2Does sampling key elements guided by optical flow improve robustness to motion and reduce artifacts compared to traditional pre-warping?
- RQ3Does the recurrent embedding mechanism enhance long-range temporal dependencies in a Transformer-based deblurring model?
- RQ4How does FGST compare to state-of-the-art methods on standard benchmarks (DVD and GOPRO) in terms of quality and efficiency?
- RQ5What are the impacts of window size, flow estimators, and attention variants on performance?
主な発見
- FGSTはDVDおよび GOPROデータセットでSOTAを上回る。
- On DVD, FGST surpasses the prior best ARVo by 0.56 dB in PSNR.
- On GOPRO, FGST exceeds Suin et al. by 0.80 dB and TSP by 1.23 dB in PSNR.
- Ablations show RE and FGSW-MSA jointly contribute large PSNR gains (up to about 1.72 dB when both are used).
- FGST with FGSW-MSA achieves stronger attention to similar but misaligned patches than baselines, improving restoration of fast motion blur.
- FGST demonstrates favorable efficiency, with substantial parameter and FLOPS reductions while achieving higher PSNR/SSIM than several CNN-based and Transformer baselines.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。