QUICK REVIEW

[论文解读] Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks

Binchuan Qi|arXiv (Cornell University)|Feb 18, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

本文提出对偶学习理论，通过凸对偶性、对数族可实践可学习性与 Fenchel–Young 损失来统一深度神经网络的可训练性与泛化性，给出理论结果并进行实证验证。

ABSTRACT

In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework's correctness and consistency.

研究动机与目标

在经典最优化之外，阐明需要统一理论来解释深度神经网络的可训练性与泛化性。
提出以凸对偶性为基础的对偶学习理论框架，用以建模实际可学习性与分布估计。
在该框架下表征架构、数据与优化如何交互，以解释训练动力学与泛化。
通过凸约束引入先验知识，约束假设空间并提高学习效率。

提出的方法

将条件分布 Y|X 建模为由 X 的函数参数化的指数族分布。
表明在该设定下最大似然等价于最小化 Fenchel–Young 损失。
定义包含凸生成函数的对偶学习目标，以及表示先验知识的凸约束集。
引入结构矩阵和梯度能量，以将非凸经验风险最小化重新解释为约束梯度动力学。
通过广义条件熵导出确定性与概率性的泛化界限。
通过大量深度学习实验验证预测，与理论保持一致。

Figure 1: Schematic illustration of the conjugate learning framework. The diagram outlines the complete processing pipeline from raw input to learning target approximation, emphasizing the interplay among model output, conjugate transformation, and distance measurement.

实验结果

研究问题

RQ1在有限样本的深度学习设置中，如何刻画实际可学习性？
RQ2在将学习视为条件分布估计时，什么机制支配可训练性与泛化性？
RQ3在对偶学习框架下，批量大小、架构与先验如何影响收敛性与泛化性？
RQ4Fenchel–Young 损失与凸对偶性是否能够统一跨任务的损失（分类、回归、生成建模）？

主要发现

对偶学习理论通过凸对偶性提供了一个统一框架，将可训练性与泛化性联系起来。
在指数族假设下的最大似然等价于最小化 Fenchel–Young 损失，从而在带有凸约束的情况下实现原理性损失设计。
提出的新结构矩阵与梯度相关因子量化了架构与数据如何影响小批量 SGD 的收敛。
通过广义条件熵推导泛化界限，捕捉信息损失、损失规模与数据特征的影响。
该框架能够容纳非独立同分布数据、显式先验集成与对偶预测映射，带来对正则化、不可逆性与深度的洞见。
实验结果显示理论预测与在标准深度神经网络设置中的经验行为高度一致。

Figure 3: Custom-designed model architectures and configuration parameters. Gray blocks represent components where the number of repetitions can be adjusted via the parameter $n_{d}$ , and model width can be tuned via the parameter $n_{w}$ . Model B is a modified variant of Model A with additional s

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。