QUICK REVIEW

[论文解读] Steering Video Diffusion Transformers with Massive Activations

Xianhang Cheng, Yujian Zheng|arXiv (Cornell University)|Mar 18, 2026

Advanced Neuroimaging Techniques and Applications被引用 0

一句话总结

该论文分析视频扩散变换器中的 Massive Activations（MA），并引入 STAS，一种训练-free 的方法，在初始帧和潜在边界标记处选择性地引导 MA 值，以在几乎无开销的情况下提升视频质量与时序连贯性。

ABSTRACT

Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.

研究动机与目标

Identify and characterize Massive Activations in video diffusion transformers across models and latent compression settings.
Understand how MA magnitudes relate to token positions and temporal structure in video generation.
Develop a training-free activation steering method that leverages MA structure to improve video quality and coherence.
Demonstrate STAS applicability across multiple backbones and its compatibility with existing training-free techniques.

提出的方法

Systematically analyze MA patterns in video DiTs across WAN and CogVideo backbones and various temporal compression ratios.
Define STAS as a masked, self-guidance-like update that amplifies MA dimensions at structurally important tokens (first-frame and latent-boundary tokens) during early denoising steps.
Specify steering targets using a global-reference amplification rule based on the current layer’s MA maxima.
Apply STAS on top of CFG and evaluate in a single forward pass with negligible overhead.
Evaluate STAS with VBench metrics and frame-to-frame similarity measures (DINO/CLIP) across multiple backbones.
Perform ablations to isolate the effects of MA dimensions, target tokens, timestep window, and amplification rule.

实验结果

研究问题

RQ1What are the structural properties of Massive Activations in video diffusion transformers across models and compression ratios?
RQ2Can a training-free activation steering method exploit MA structure to improve temporal coherence and visual quality without altering model parameters?
RQ3How does STAS interact with existing training-free guidance methods (e.g., CFG) across diverse video DiT backbones?
RQ4What is the impact of STAS on cross-chunk versus within-chunk temporal consistency and object–attribute bindings?

主要发现

Model	Method	Subject Consistency	Background Consistency	Aesthetic Quality	Imaging Quality	Quality Score	Semantic Score	Total Score
Wan2.1-1.3B	Vanilla	94.63	95.81	61.91	68.14	81.81	79.70	81.39
Wan2.1-1.3B	+Ours	95.00	95.93	62.03	68.95	82.03	80.66	81.76
CogVideoX-5B	Vanilla	93.40	95.29	59.98	64.62	79.78	77.59	79.34
CogVideoX-5B	+Ours	93.80	95.47	60.31	65.12	79.95	78.24	79.61
Wan2.2-5B	Vanilla	95.13	96.63	61.67	69.02	81.75	81.68	81.74
Wan2.2-5B	+Ours	95.37	96.70	61.72	69.39	81.82	82.35	81.93

MAs in video DiTs show a consistent positional pattern: first-frame tokens have the largest activations and latent-frame boundaries exhibit periodic spikes aligned with temporal compression.
STAS improves video quality and temporal coherence with negligible overhead by selectively steering MA values at first-frame and boundary tokens during early denoising steps.
STAS yields consistent gains across Wan2.1-1.3B, CogVideoX-5B, and Wan2.2-5B backbones, in both quality and semantic metrics.
When combined with CFG, STAS further boosts quality metrics and temporal stability across multiple baselines.
Ablations show steering MA dimensions, targeting first-frame plus boundary tokens, and using a max-based amplification rule are crucial for effectiveness.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。