QUICK REVIEW

[論文レビュー] Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Samuel, Dvir, Issar Tzachor|arXiv (Cornell University)|Feb 2, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

この論文は、訓練不要の注意機構フレームワーク（TempCache、AnnCA、AnnSA）を導入し、KVキャッシュを圧縮し、 approximate nearest neighbor 法を用いてクロス/自己注意をスパース化することで自回帰的なビデオ拡散とワールドモデルを加速し、5–10倍のスピードアップと安定したメモリ使用を達成します。

ABSTRACT

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

研究の動機と目的

自回帰ビデオ拡散とワールドモデルにおけるKVキャッシュと注意機構の冗長性を特定する。
品質を損なうことなくKVキャッシュを圧縮し、注意機構をスパース化する訓練不要手法を開発する。
遅延とメモリ使用を削減しつつ安定した長時間のストリーミング映像生成を実現する。
既存のバックボーンと互換性のあるプラグアンドプレーフレームワークを提供する。

提案手法

TempCacheはフレーム間の時間的対応によりKVキャッシュを圧縮する。
AnnCAは高速ANNマッチングを用いてフレームごとにクロスアテンションのプロンプトトークンを剪定する。
AnnSAはANNを用いて意味的に一致するキーにクエリを制限することで自己注意をスパース化する。
注意は計算を削減する近似最近傍探索として扱われる。
冗長性を保持するグルーピング補題により、重複キーが結合された場合には正確な注意を再現することができる。

Figure 1 : Our method substantially accelerates pre-trained autoregressive video diffusion models and autoregressive world models while maintaining high visual quality, by introducing a new KV-cache compression with self- and cross-attention sparsification. On a single H100 GPU, it achieves $5\times

実験結果

リサーチクエスチョン

RQ1KVキャッシュの圧縮とスパース注意を通じて自回帰的ビデオ拡散の注意を訓練不要にできるか？
RQ2TempCache、AnnCA、AnnSA は計算量・メモリ・待機時間をどれだけ削減できるのか、画質を損なわずに？
RQ3時間的冗長性とフレーム固有のプロンプトの関連性は長距離・ワールドモデル設定で一般化するか？

主な発見

Method	PSNR	SSIM	LPIPS	VBench	Min Density	Max Recall	Total Speed
Dense (FlashAttention 3)	–	–	–	84.08	100%	100%	×1.0
TeaCache	16.12	0.315	0.523	84.11	93.2%	84.6%	×1.1
FlowCache	22.15	0.634	0.222	84.15	82.9%	86.9%	×2.3
TempCache-LSH (ours)	24.13	0.651	0.149	84.17	16.8%	90.2%	×6.8
TempCache-Quant (ours)	24.26	0.653	0.143	84.19	16.2%	91.4%	×6.9
AnnSA-LSH (ours)	25.73	0.688	0.142	83.25	27.6%	92.4%	×5.1
AnnSA-Quant (ours)	25.77	0.689	0.141	83.29	28.0%	92.6%	×5.2
AnnCA-LSH (ours)	25.68	0.679	0.155	83.23	33.1%	94.2%	×2.2
AnnCA-Quant (ours)	24.11	0.646	0.148	82.89	29.5%	91.1%	×2.3
All Ours-LSH	25.71	0.681	0.147	84.02	–	–	×10.7
All Ours-Quant	25.73	0.678	0.147	83.99	–	–	×10.8

長期ホライゾンのビデオ生成でエンドツーエンドの速度upが最大5–10×。
長いローリングで peak GPUメモリがほぼ一定で、ベースラインのメモリ増加とは異なる。
TempCache/LShまたはQuantはキャッシュ密度が非常に低くても密度の高い品質指標（PSNR/SSIM/LPIPS）を保持。
AnnSAとAnnCAは高いVBench品質を維持しつつ大幅な速度upを実現。
キャッシュ圧縮とSA/CAのスパース性の融合が最も強い効果を生み（約10×）。
定性的な結果は、オフラインのスパース性ベースラインよりも対象人物の同一性と時間的一貫性を保持し、アーティファクトを減少させる。

Figure 2 : Attention sparsity in autoregressive video diffusion. Attention recall vs. density on Rolling-Forcing (Liu et al., 2025b ) , averaged over transformer blocks (shaded: std). Density is induced by keeping only the highest-attention entries. This achieves high recall, e.g., $\approx$ 85% at

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。