QUICK REVIEW

[论文解读] Geometric and Dynamic Scaling in Deep Transformers

Haoran Su, Chenyu You|arXiv (Cornell University)|Jan 3, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

提出 Manifold-Geometric Transformer (MGT)，将流形约束的超连接（mHC）与 Deep Delta Learning (DDL) 结合，实现超深变压器的稳定性与可擦除更新。

ABSTRACT

Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.

研究动机与目标

说明标准残差在深度增加时为何会导致秩崩溃。
引入一个几何-动力框架，将更新约束在数据流形上，并实现擦除。
提出两个正交组件（mHC 和 DDL），并展示它们在稳定性上的协同效应。
提供一个严格的评估协议，用于测试超过 100 层的 ultra-deep 规模化。

提出的方法

将 Manifold-Constrained Hyper-Connections (mHC) 定义为对更新软投影到数据流形切空间的操作。
引入 Deep Delta Learning (DDL)，引入一个数据相关门控 beta，以实现擦除和受控写操作。
将 Delta 操作符 A(beta,k)=I-beta k k^T 公式化，并推导其谱特性与三种几何阶段。
将 mHC 和 DDL 集成到 MGT 模块中，采用三阶段前向传播：生成、几何整流、以及 Delta 动态。
给出明确的 MGT 更新规则 X_{l+1}=X_l+beta*(V_mHC - alpha*X_l) 并讨论擦除-写入语义。
概述一个评估框架，包括秩演化、消融实验、beta 分布分析、深度扩展以及语言模型测试。

Figure 1: Architecture of the Manifold-Geometric Transformer (MGT) Block. The pipeline explicitly separates three phases: (1) Generation via LayerNorm and MHSA/FFN producing raw updates $\mathbf{V}_{raw}$ , (2) Geometric Rectification via mHC projection $\Psi$ constraining updates to the tangent spa

实验结果

研究问题

RQ1实务上，标准 Transformer 深度是否会导致在非常深的模型中的秩崩溃？
RQ2流形约束加上动态擦除是否能缓解表征漂移并支持超深尺度？
RQ3mHC 与 DDL 的单独与共同贡献对稳定性和性能有何影响？
RQ4提出的三阶段 MGT 块能否在 100 层以上保持梯度流并实现擦除-写入？
RQ5在大规模语言建模任务中，MGT 如何影响困惑度和训练稳定性？

主要发现

本文提出了一个理论基础，认为将几何约束与动态擦除结合对于超越当前深度极限的尺度化至关重要。
MGT 引入一个模块化块，将更新限制在切空间方向，并通过数据相关门控调节其幅度。
Delta 残差块恢复擦除-写入动态，使更新呈现非单调性。
提出一个实验框架，用以反证几何是 Transformer 尺度化的根本瓶颈的假设。
该方法旨在在保持梯度流的同时，允许受控的特征擦除以防止秩崩溃。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。