QUICK REVIEW

[論文レビュー] Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout

Chen Zhao, Jiquan Ngiam|arXiv (Cornell University)|Oct 14, 2020

Domain Adaptation and Few-Shot Learning参考文献 51被引用数 64

ひとこと要約

GradDrop は一貫性スコアに基づいて勾配符号を選択する勾配マスキング層を導入し、複数の損失間の共通最小点を促進し、マルチタスクと転移学習の性能を改善します。

ABSTRACT

The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights. However, these multiple updates can impede optimal training by pulling the model in conflicting directions. We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. GradDrop is implemented as a simple deep layer that can be used in any deep net and synergizes with other gradient balancing approaches. We show that GradDrop outperforms the state-of-the-art multiloss methods within traditional multitask and transfer learning settings, and we discuss how GradDrop reveals links between optimal multiloss training and gradient stochasticity.

研究の動機と目的

naïve の逐語訳を避け、複数の勾配信号の単純な加算がマルチタスク訓練を妨げることがあるという動機づけ。
GradDrop は勾配符号を選択的にマスキングすることにより共通最小点を促進する。
GradDrop の有効性をマルチタスク学習、転移学習、そして複雑な単一タスクモデルで示す。
GradDrop の理論特性と既存の勾配バランス法との相乗効果を探る。

提案手法

Define Gradient Positive Sign Purity P = 1/2(1 + sum_i ∇L_i / sum_i |∇L_i|).
Compute mask M_i for each gradient using a monotonic function f and a random uniform U to decide which sign to keep, producing new gradient as sum_i M_i ∇L_i.
Apply GradDrop as a modular layer before the prediction heads, with optional leak parameters ℓ_i to bias where necessary.
Extend GradDrop to batch-separated gradients by summing gradients across the batch with a virtual layer to compute P and M_i.
Provide a full algorithm for BACKWARD pass of the GradDrop layer, including normalization and optional gradient leakage.
Prove that GradDrop ensures stable points only at joint minima and that gradient magnitude remains sensitive to each loss.

実験結果

リサーチクエスチョン

RQ1Can GradDrop reliably steer optimization toward joint minima across multiple losses in multitask settings?
RQ2How does GradDrop compare to existing multitask gradient methods (MGDA, PCGrad, GradNorm) in various tasks and architectures?
RQ3Does GradDrop interact beneficially with transfer learning and other gradient-based regularizers?
RQ4What are the theoretical guarantees and statistical properties of GradDrop updates?

主な発見

方法	エラー率 (%) ↓	最大 F1 スコア ↑	ベースライン比較の速度 ↑
ベースライン	8.71	29.35	1.00
勾配クリッピング [50]	8.70	29.34	1.00
勾配ペナルティ [10]	8.63	29.43	0.35
MGDA [37]	10.82	26.00	0.25
PCGrad [47]	8.72	29.25	0.20
GradNorm [3]	8.68	29.32	0.41
ランダム GradDrop	8.60	29.42	0.45
GradDrop (私たちの提案)	8.52	29.57	0.45

GradDrop は CelebA, CIFAR-100 転移学習、Waymo 3D 検出において主要指標で最先端のマルチタスク手法を上回る。
CelebA では GradDrop が最も低いエラー率 (8.52%) と最高の max F1 (29.57) を達成し、ベースラインと同等の速度を維持。
GradDrop は転移学習 (CIFAR-100) および 3D 検出指標で顕著な利得を提供し、GradNorm との相乗効果を示す。
GradDrop は合計損失の動きが予測通りであり、個々のタスクに対する勾配感度を高め、共通最小点を促進する。
GradDrop は推論時のオーバーヘッドが最小で、トレーニング時間は一般に他のいくつかの代替手法よりも低い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。