QUICK REVIEW

[Paper Review] CHAI: CacHe Attention Inference for text2video

Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj|arXiv (Cornell University)|Feb 18, 2026

Generative Adversarial Networks and Image Synthesis0 citations

TL;DR

CHAI introduces Cache Attention to enable cross-inference caching for text-to-video diffusion, achieving significant speedups with minimal quality loss by reusing entity-level cached latents.

ABSTRACT

Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

Motivation & Objective

Motivate reducing latency in text-to-video diffusion without retraining or heavy engineering.
Explore cross-inference reuse at the entity level (objects/scenes) rather than whole prompts.
Develop a training-free mechanism to inject cached information without degrading quality.
Demonstrate practical cache budgets and scalable cache management for real deployments.

Proposed method

Introduce Cache Attention that uses cached latents as key/value inputs to attention; the query remains prompt-conditioned noise.
Identify entities in prompts via an Entity Extractor and store embeddings in a vector DB linked to latent caches.
Limit cache usage to the 2nd, 3rd, and 4th denoising steps to balance latency and quality.
Build on OpenSora 1.2 with two diffusion modes: full (cache miss) and fast (cache hit).
Evaluate across VBench and VidProM datasets with comparisons to OpenSora 1.2, NIRVANA-VID, and AdaCache.

Figure 1 : Feature distance between latents produced by adjacent denoising steps in a single text-to-video inference. The highlighted region indicates steps that are skipped by intra-inference caching approaches due to low degree of difference.

Experimental results

Research questions

RQ1Does Cache Attention preserve video quality while reducing denoising steps and latency?
RQ2How does cross-inference caching perform under constrained cache budgets and scale with cache size?
RQ3How does CHAI compare to intra-inference caching baselines in latency and quality?
RQ4What cache management strategies yield high hit rates with modest memory budgets?

Key findings

CHAI achieves 1.65x–3.35x end-to-end speedup over OpenSora 1.2 at 52–100% cache hit rates while preserving video quality.
With 8 denoising steps, CHAI attains a VBench score of 0.7985, 0.3% below the 30-step baseline OpenSora 1.2.
CHAI reaches high cache hit rates (>80%) under modest storage budgets (1–5 GB).
Under constrained caches (10% of full cache), entity-level reuse yields 52% hit rate and 1.65x latency reduction on VidProM, outperforming whole-prompt reuse.
CHAI outperforms NIRVANA-VID in quality while maintaining lower latency, and surpasses AdaCache in VBench score at similar or lower latency.

Figure 2 : Cache hit rate (%) vs. cache size on 2000 unseen VidProM prompts. Cached and unseen prompts show little overall similarity, but they share common entities and thus achieve a higher entity-similarity-based cache hit rate.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.