QUICK REVIEW

[论文解读] CHAI: CacHe Attention Inference for text2video

Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj|arXiv (Cornell University)|Feb 18, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

CHAI 引入 Cache Attention，以实现文本到视频扩散的跨推理缓存，通过重用实体级缓存潜在变量，在极少质量损失的前提下显著提升速度。

ABSTRACT

Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

研究动机与目标

在不重新训练或进行大规模工程化改造的前提下，推动文本到视频扩散的低延迟化。
探索在实体层级（对象/场景）而非整条提示上进行跨推理的重用。
开发一种无训练成本的机制，在不降低质量的前提下注入缓存信息。
演示实际可用的缓存预算与可扩展的缓存管理，以适用于真实部署。

提出的方法

引入 Cache Attention，将缓存潜在变量用作注意力的键/值输入；查询保持为提示条件的噪声。
通过实体提取器在提示中识别实体，并将嵌入存储在与潜在缓存相连的向量数据库中。
将缓存使用限制在第 2、3、4 次去噪步骤，以在延迟与质量之间取得平衡。
在 OpenSora 1.2 上构建两种扩散模式：全量（缓存未命中）与快速（缓存命中）。
在 VBench 与 VidProM 数据集上进行评估，并与 OpenSora 1.2、NIRVANA-VID 和 AdaCache 进行对比。

Figure 1 : Feature distance between latents produced by adjacent denoising steps in a single text-to-video inference. The highlighted region indicates steps that are skipped by intra-inference caching approaches due to low degree of difference.

实验结果

研究问题

RQ1在减少去噪步骤和延迟的同时，Cache Attention 是否能保持视频质量？
RQ2在受限的缓存预算下，跨推理缓存的性能如何，以及缓存大小的扩展性如何？
RQ3CHAI 与片内推理缓存基线在延迟和质量上相比如何？
RQ4在有限的内存预算下，哪些缓存管理策略能实现较高的命中率？

主要发现

CHAI 在 52–100% 的缓存命中率下对 OpenSora 1.2 实现 1.65x–3.35x 的端到端加速，同时保持视频质量。
在 8 次去噪步骤下，CHAI 的 VBench 得分为 0.7985，比 30 步基线的 OpenSora 1.2 低 0.3%，几乎达到同等质量。
在适中的存储预算（1–5 GB）下，CHAI 达到较高的缓存命中率 (>80%)。
在缓存受限（仅为满量缓存的 10%）的情况下，实体级重用在 VidProM 上达到 52% 命中率和 1.65x 的延迟降低，优于整条提示重用。
在同等或更低延迟下，CHAI 的质量优于 NIRVANA-VID，且在相近或更低延迟下超越 AdaCache 的 VBench 得分。

Figure 2 : Cache hit rate (%) vs. cache size on 2000 unseen VidProM prompts. Cached and unseen prompts show little overall similarity, but they share common entities and thus achieve a higher entity-similarity-based cache hit rate.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。