QUICK REVIEW

[论文解读] What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Xinyu Zhang|arXiv (Cornell University)|Mar 23, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

该论文在 Atari Breakout 和 Pong 上对两种世界模型架构 IRIS 和 DIAMOND 进行研究，揭示学习到的游戏状态内部表征大致呈线性、可被功能性使用，并通过线性/MLP 探针、因果干预和令牌消融显示出有组织的注意力。

ABSTRACT

World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

研究动机与目标

研究在强化学习中预测未来观测时，世界模型学习到的潜在表征是什么。
评估这些表征是否以线性可解的形式编码核心游戏状态变量（位置、分数）。
评估表征是否仅相关而非因果参与在模型预测中。
考察 IRIS 与 DIAMOND 在何处以及以何种方式出现线性表征的架构差异。

提出的方法

对 IRIS 与 DIAMOND 的冻结隐藏表示在各层上应用线性与 MLP 探针。
使用 Ridge 与 MLP 探针（5 折交叉验证）评估编码的游戏状态变量的线性性（R^2）。
通过沿着探针派生方向扰动隐藏状态来进行因果干预，并测量对下一步token预测的变化。
分析 IRIS 的注意力头的空间专门化，并进行多基线的令牌消融以评估令牌的重要性。

Figure 1: Probe $R^{2}$ across layers (in network data-flow order) for IRIS (left) and DIAMOND (right) on Breakout (top) and Pong (bottom). Each line tracks one game-state property; shaded bands show $\pm$ 1 std over 5-fold CV. IRIS representations are flat across transformer layers, while DIAMOND s

实验结果

研究问题

RQ1世界模型是否发展出对游戏状态变量可线性解码的表征？
RQ2这些表征在预测中是否被功能性使用，而不仅仅是相关？
RQ3两种架构（IRIS 与 DIAMOND）在何处以及如何出现线性表征上有何差异？
RQ4哪些空间区域（令牌）和注意力头对跟踪游戏对象最为重要？

主要发现

Representation	ball_x	ball_y	player_x	score
Random model	-1.21	-1.22	-1.14	-1.18
Shuffled labels	-0.51	-0.49	-0.53	-0.52
Raw pixels	-1.31	-0.48	0.9989±0.0006	0.9998±0.0001
IRIS (Linear)	0.85±0.006	0.58±0.03	0.9994±0.0001	1.0000±0.0000
IRIS (MLP)	0.91±0.005	0.59±0.03	0.9987±0.0002	0.9999±0.0000
Δ_IRIS	+0.06	+0.01	-0.0007	-0.0001
DIAMOND (Linear)	0.81±0.01	0.57±0.05	1.0000±0.0000	1.0000±0.0000
DIAMOND (MLP)	0.91±0.005	0.63±0.05	0.9994±0.0002	0.9998±0.0001
Δ_DIAMOND	+0.10	+0.06	-0.0006	-0.0002

IRIS 与 DIAMOND 都发展出对游戏状态变量（如球的位置、球拍/分数）的近似线性表征，且选择性差距较小（Breakout Δ ≤ 0.06；Pong Δ ≤ 0.03）。
因果干预显示沿着探针方向移动隐藏状态会导致预测的相关性变化，表明存在功能性利用（r ≥ 0.95）。
IRIS 的注意力头呈现空间专门化，部分头聚焦于与游戏对象重叠的令牌；令牌消融在基线下始终将包含对象的令牌标记为高度重要（ρ > 0.9）。
DIAMOND 瓶颈在各层呈现倒V型的抽象状态编码，而 MLP 探针在解码阶段恢复非线性的球位置信息；两者都优于基线（原始像素表现较差）。
在不同游戏中，Pong 通常比 Breakout 具有更高的 R^2，可能是因为场景更简单；两种架构的特征模式（IRIS 在各层平坦，DIAMOND 在瓶颈处达到峰值）在两款游戏中都存在。

Figure 2: Causal intervention on Breakout: shifting IRIS layer-5 hidden states along probe directions produces correlated changes in predictions ( $r\geq 0.96$ for all properties, measured via KL divergence).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。