QUICK REVIEW

[論文レビュー] Demystifing Video Reasoning

Ruisi Wang, Zhongang Cai|arXiv (Cornell University)|Mar 17, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

論文は拡散ベースの動画モデルが拡散復号ステップ（Chain-of-Steps）に沿って推論を行い、フレーム間（Chain-of-Frames）ではなく emergent behaviors を示すこと、訓練不要の潜在表現アンサンブル手法で推論性能を向上させることを示す。

ABSTRACT

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

研究の動機と目的

拡散ベースの動画生成モデルにおける推論の内部メカニズムを調査する。
動画推論が Chain-of-Frames（CoF）なのか Chain-of-Steps（CoS）なのかを検証する。
推論の出現的な振る舞いと推論品質を形作る拡散ステップの役割を特定する。
動画推論性能を改善する実用的で訓練不要な戦略を探る。

提案手法

各拡散ステップで中間潜在状態を分析し、復号が進むにつれて意味的決定を可視化する。
ノイズ撹乱実験を実施して推論が最も敏感な箇所（ステップ撹乱 vs フレーム撹乱）を評価する。
Diffusion Transformer の層別機械分析を行い、知覚・推論・統合の発生箇所を特定する。
複数のシード間で訓練不要な潜在軌道アンサンブルを提案し、複数回の潜在表現を統合して推論を改善する。

Figure 1 : Chain-of-Steps. We discover that video reasoning occurs along the diffusion steps with surprising emergent behaviors such as making multiple possible moves ( e.g. , paths) simultaneously at early steps, gradually pruning suboptimal choices during middle steps, and reaching a final decisio

実験結果

リサーチクエスチョン

RQ1拡散モデルにおける動画推論は主に拡散ステップに沿って生じるのか、それともフレーム間で生じるのか？

主な発見

推論は主に拡散復号ステップ（Chain-of-Steps）に沿って現れ、フレーム間（Chain-of-Frames）では現れない。
初期の拡散ステップは複数の候補仮説を具体化し、後半のステップで最終解へと絞り込む。
出現的振る舞いには作業記憶、自己修正/強化、認知前動作などが含まれる。
層別分析により、初期層が知覚を担当し、中間層が推論を推進し、後半層が表現を統合することが示される。
訓練不要の潜在軌道アンサンブル（マルチシード）により VBVR-Bench で測定可能な性能向上を得られる。

Figure 2 : Chain-of-Steps elicits reasoning along the diffusion process. We observe that video reasoning models explore multiple possible solutions simultaneously in the early denoising steps before converging to a final outcome in later steps. Specifically, we observe: (a) two potential routes (cya

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。