QUICK REVIEW

[论文解读] Demystifing Video Reasoning

Ruisi Wang, Zhongang Cai|arXiv (Cornell University)|Mar 17, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

论文主张扩散基的视频模型在扩散去噪步骤上进行推理（链式步骤，Chain-of-Steps），而非跨帧推理（链式帧，Chain-of-Frames），并展示了涌现行为以及一种无需训练的潜在变量集合方法以提升推理性能。

ABSTRACT

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

研究动机与目标

研究扩散基视频生成模型中推理的内部机制。
测试视频推理是遵循链式帧（CoF）还是链式步骤（CoS）过程。
识别涌现的推理行为以及扩散步骤在塑造推理质量中的作用。
探索在无需训练的情况下提高视频推理性能的实用方法。

提出的方法

分析每个扩散步骤的中间潜在状态，以在去噪推进时可视化语义决策。
进行噪声扰动实验以评估推理最易受影响的位置（步骤扰动与帧扰动）。
进行扩散变换器的分层机制分析，以识别感知、推理和巩固发生的位置。
提出跨种子的一次性无训练潜在轨迹集成，通过合并多次运行的潜在表征来提高推理。

Figure 1 : Chain-of-Steps. We discover that video reasoning occurs along the diffusion steps with surprising emergent behaviors such as making multiple possible moves ( e.g. , paths) simultaneously at early steps, gradually pruning suboptimal choices during middle steps, and reaching a final decisio

实验结果

研究问题

RQ1扩散模型中的视频推理主要沿着扩散步骤发生，还是跨帧发生？
RQ2在扩散模型的视频推理中有哪些涌现行为（记忆、自我纠正、先感知再行动等）？
RQ3推理相关表征在扩散-变换器各层如何组织？
RQ4是否可以通过无训练的潜在轨迹集成来提高推理性能？

主要发现

推理主要沿着扩散去噪步骤（Chain-of-Steps）涌现，而非跨帧（Chain-of-Frames）。
早期扩散步骤会生成多种候选假设，后续步骤收敛到最终解。
涌现行为包括工作记忆、自我纠正/增强，以及感知在行动之前的动态。
分层分析显示早期层处理感知，中间层驱动推理，后期层巩固表征。
一个简单的无训练潜在轨迹集合（多种种子）在 VBVR-Bench 上实现了可衡量的性能提升。

Figure 2 : Chain-of-Steps elicits reasoning along the diffusion process. We observe that video reasoning models explore multiple possible solutions simultaneously in the early denoising steps before converging to a final outcome in later steps. Specifically, we observe: (a) two potential routes (cya

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。