[Paper Review] CHAI: CacHe Attention Inference for text2video
CHAI introduces Cache Attention to enable cross-inference caching for text-to-video diffusion, achieving significant speedups with minimal quality loss by reusing entity-level cached latents.
Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.
Motivation & Objective
- Motivate reducing latency in text-to-video diffusion without retraining or heavy engineering.
- Explore cross-inference reuse at the entity level (objects/scenes) rather than whole prompts.
- Develop a training-free mechanism to inject cached information without degrading quality.
- Demonstrate practical cache budgets and scalable cache management for real deployments.
Proposed method
- Introduce Cache Attention that uses cached latents as key/value inputs to attention; the query remains prompt-conditioned noise.
- Identify entities in prompts via an Entity Extractor and store embeddings in a vector DB linked to latent caches.
- Limit cache usage to the 2nd, 3rd, and 4th denoising steps to balance latency and quality.
- Build on OpenSora 1.2 with two diffusion modes: full (cache miss) and fast (cache hit).
- Evaluate across VBench and VidProM datasets with comparisons to OpenSora 1.2, NIRVANA-VID, and AdaCache.

Experimental results
Research questions
- RQ1Does Cache Attention preserve video quality while reducing denoising steps and latency?
- RQ2How does cross-inference caching perform under constrained cache budgets and scale with cache size?
- RQ3How does CHAI compare to intra-inference caching baselines in latency and quality?
- RQ4What cache management strategies yield high hit rates with modest memory budgets?
Key findings
- CHAI achieves 1.65x–3.35x end-to-end speedup over OpenSora 1.2 at 52–100% cache hit rates while preserving video quality.
- With 8 denoising steps, CHAI attains a VBench score of 0.7985, 0.3% below the 30-step baseline OpenSora 1.2.
- CHAI reaches high cache hit rates (>80%) under modest storage budgets (1–5 GB).
- Under constrained caches (10% of full cache), entity-level reuse yields 52% hit rate and 1.65x latency reduction on VidProM, outperforming whole-prompt reuse.
- CHAI outperforms NIRVANA-VID in quality while maintaining lower latency, and surpasses AdaCache in VBench score at similar or lower latency.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.