[论文解读] GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks
GradNorm 自动通过基于梯度的损失项动态调整梯度大小来实现多任务学习的自动平衡,减少过拟合,并在只有一个非对称性超参数 α 的情况下与网格搜索性能相匹配。
Deep multitask networks, in which one neural network produces multiple predictive outputs, can offer better speed and performance than their single-task counterparts but are challenging to train properly. We present a gradient normalization (GradNorm) algorithm that automatically balances training in deep multitask models by dynamically tuning gradient magnitudes. We show that for various network architectures, for both regression and classification tasks, and on both synthetic and real datasets, GradNorm improves accuracy and reduces overfitting across multiple tasks when compared to single-task networks, static baselines, and other adaptive multitask loss balancing techniques. GradNorm also matches or surpasses the performance of exhaustive grid search methods, despite only involving a single asymmetry hyperparameter $α$. Thus, what was once a tedious search process that incurred exponentially more compute for each task added can now be accomplished within a few training runs, irrespective of the number of tasks. Ultimately, we will demonstrate that gradient manipulation affords us great control over the training dynamics of multitask networks and may be one of the keys to unlocking the potential of multitask learning.
研究动机与目标
- Motivate the challenge of training deep multitask networks due to imbalanced gradients across tasks.
- Propose GradNorm to balance task training by tuning gradient magnitudes through loss weights.
- Show that GradNorm improves multitask performance across regression and classification tasks on synthetic and real data.
- Demonstrate that GradNorm can match or exceed grid-search based baselines with minimal hyperparameter tuning.
提出的方法
- Define task-specific gradient norms G_W^(i)(t) and average gradient norm over tasks.
- Introduce a target gradient norm for each task as G_W^(i)(t) ≈ ¯G_W(t) * [r_i(t)]^α, where r_i(t) is the relative inverse training rate and α is a hyperparameter.
- Define a gradient loss L_grad(t; w_i(t)) = Σ_i | G_W^(i)(t) − ¯G_W(t) * [r_i(t)]^α |1 and optimize w_i(t) to minimize L_grad while keeping ¯G_W(t) fixed during differentiation.
- Renormalize weights w_i(t) after each update so that Σ_i w_i(t) = T.
- Apply standard backpropagation to update network parameters W using the total loss L(t) = Σ_i w_i(t)L_i(t).
- Use the last shared layer W for gradient normalization to reduce compute.]
- research_questions:["Can GradNorm balance training across diverse multitask objectives in deep networks?","How does GradNorm compare to static weighting and uncertainty-based weighting approaches in terms of accuracy and overfitting?","What is the impact of the asymmetry hyperparameter α on training dynamics and final performance?","Can GradNorm recover near-optimal static loss weights without exhaustive grid search?"]
- key_findings:["GradNorm improves multitask test-time performance across regression and classification tasks on synthetic data and real datasets.","GradNorm matches or surpasses single-task networks and outperforms static weighting and uncertainty-based baselines in several settings.","The method requires tuning only a single hyperparameter α and can emulate optimal grid-search weights in a single training run.","Time-averaged GradNorm weights E_t[w_i(t)] closely align with optimal static weights, enabling effective static weight estimation.","GradNorm can reduce overfitting by actively balancing gradient contributions across tasks during training.","GradNorm demonstrates robustness across different architectures (e.g., VGG SegNet and ResNet-based FCN) and tasks within the NYUv2 datasets."]
- table_headers_:[],
- table_rows_:[
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。