QUICK REVIEW

[论文解读] Energy-Entropy Regularization: The True Power of Minimal Looped Transformers

Wai-Lun Lam|arXiv (Cornell University)|Jan 14, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

本文提出 Energy-Entropy Regularization，训练一个最小单头循环 Transformer (d=8) 来执行长距离 induction 任务，利用 Tsallis 熵和哈密顿动力学重塑损失景观，实现可靠收敛。

ABSTRACT

Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.

研究动机与目标

由于非凸损失景观，训练单头循环 Transformer 具有挑战性，因此需要动机说明。
利用 Tsallis 熵建立熵收缩框架以稳定训练。
引入哈密顿潜在动力学视角，在潜在空间向全局最小值导航。
提出能量-熵正则化损失，创建漏斗状景观以实现可靠收敛。
在长度高达 1000 token 的序列上，用最小模型（d=8）演示长距离 induction 能力。

提出的方法

对注意力映射应用 Tsallis 熵以获得自注意力动力学的收缩区间。
将潜在状态演化建模为离散的哈密顿系统，位置记为 Z、速度记为 V。
定义一个能量化的注意力算子 F_tau 和一个指引轨迹的注意力能量 E_tau。
在损失中引入三个耦合正则项：动能 Kinetic、势能 Potential 和熵 Entropy，形成 Hamiltonian-Tsallis 损失。
展示一个由相变驱动的优化路径，从探索到潜在状态的晶化。
在长度为 1000 的 induction head 任务上评估长度泛化，并与 FOP-Looped-Adaptive 进行比较。

实验结果

研究问题

RQ1 Tsallis 熵基础的收缩能否确保单头循环 Transformer 的稳定固定点收敛？
RQ2能量-熵正则化是否将损失景观重塑为有利于全局优化的漏斗形？
RQ3最小的单头循环 Transformer 能否解决长距离 induction 任务（长达 1000 token）？
RQ4在提出的能量-熵框架下，长度泛化表现如何？

主要发现

Model	Latent Dimension (d)	Attention Heads (h)	Position Encoding	Recurrence Depth (T)	Training Steps	Learning Rate	Weight Decay	Batch Size	Training Range (L)	Loss Objective
FOP-Looped-Adaptive	64	4	0.15× Sinusoidal	25	100k	1e-4	0.05	64	16–64	Cross-Entropy (CE)
EER (Ours)	8	1	0.15× Sinusoidal	25	20k	1e-3	0.10	32	16–64	L_Task + L_Kinetic + L_Potential + L_Entropy

EER 框架训练出一个 d=8 的单头循环 Transformer，能够在长度高达 1000 token 的序列上解决 induction head 任务。
EER 在长度泛化方面达到 L=1000，参数量远少于 FOP-Looped-Adaptive 基线（据报道不到其参数量的 0.02%）。
在大约第 500 轮时观察到显著的相变，Acc L1000 从 33.5% 跃升至 79.2%。
在中间阶段，准确率达到平台期（如 L=100 时为 96.7%），随后稳定，体现了从以动能驱动的探索向以能量主导的晶化的转变。
该方法将动能、势能和熵正则化融合，将损失景观转变为漏斗状几何，降低优化噪声并实现可靠收敛。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。