QUICK REVIEW

[论文解读] Video Prediction Transformers without Recurrence or Convolution

Yujin Tang, Qi Lü|arXiv (Cornell University)|Oct 7, 2024

Robotics and Automated Systems被引用 8

一句话总结

该论文提出 PredFormer，一种基于纯 Transformer 的时空预测学习框架，在 Moving MNIST、TaxiBJ、WeatherBench 等数据集上相比基于 CNN 的方法具有更高的效率，且收敛更快。

ABSTRACT

Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released at https://github.com/yyyujintang/PredFormer.

研究动机与目标

激励一种无递归、纯 Transformer 的时空预测学习方法。
系统分析时空Transformer因式分解与交错架构。
开发九种 PredFormer 变体并评估它们在不同数据集上的性能。
展示在多项基准测试中相对于 CNN 基模型的最先进准确性与效率。

提出的方法

采用带分块嵌入的纯 Transformer 架构，并使用二维时空正弦位置编码。
引入门控 Transformer 块（GTB），将多头自注意力（MSA）与基于 SwiGLU 的前馈网络结合，以实现有效的时空建模。
探索全注意力编码器、因式分解编码器（空间优先与时间优先）以及六种交错架构，形成九种变体。
提供多种基于 GTB 的 PredFormer 配置，固定深度，以实现公平的参数比较。
使用 MSE/MAE/RMSE 与 SSIM 评估准确性，使用 FPS/参数量/FLOPs 评估跨数据集的效率。

实验结果

研究问题

RQ1纯 Transformer 架构是否能在不使用递归或卷积的情况下有效学习时空依赖？
RQ2空间注意力与时间注意力的因式分解和交错如何影响不同数据集上的性能？
RQ3各种 PredFormer 配置在准确性与效率之间的权衡是什么？
RQ4在长期与短期预测任务中，交错架构是否比全注意力和因式分解编码器具有稳健的提升？

主要发现

PredFormer 变体在 Moving MNIST、TaxiBJ、WeatherBench 上相比先前方法达到最先进的表现。
在 Moving MNIST 上，当以 patch 大小 4 训练 2000 轮时，PredFormer 将 MSE 相对于 SimVP 降低 51.3%。
在 TaxiBJ 上，PredFormer 将 MSE 降低 33.1%，FPS 从 533 提升至 2364。
在 WeatherBench 上，PredFormer 将 MSE 降低 11.1%，FPS 从 196 提升至 404。
交错变体持续超越全注意力与因式分解编码器，在不同设置中，Triplet-TST 和 Quadruplet-TSST 常常给出最佳结果。
Fac-T-S 模型以 far fewer parameters（5.3M）实现强大性能和显著的效率提升（FPS 高达 404），同时优于基于 CNN 的基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。