QUICK REVIEW

[论文解读] What Makes Training Multi-Modal Classification Networks Hard?

Wei‐Yao Wang, Du Tran|arXiv (Cornell University)|May 29, 2019

Human Pose and Action Recognition参考文献 63被引用 28

一句话总结

本文識別出過擬合與跨模態泛化不一致為多模態神經網絡中反直覺性性能下降的主要原因，即使單模態模型通常表現優於聯合模型。本文提出梯度混合（G-Blend）方法，根據各模態的過擬合行為動態優化組合監督信號，進而於 Kinetics、EPIC-Kitchens 和 AudioSet 上實現最尖端的準確率，相較基線與最尖端方法有顯著提升。

ABSTRACT

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.

研究动机与目标

探討為何端到端訓練的多模態網絡即使擁有更多輸入資訊，仍經常表現不如表現最佳的單模態對應模型。
診斷多模態訓練中性能下降的根本原因，特別是過擬合與各模態間泛化速率差異。
開發一種原則性、與架構無關的方法，根據模態特異的過擬合行為動態平衡監督信號。
證明標準正則化與特徵融合技術無法解決此問題，因而需要新的訓練範式。

提出的方法

提出「過擬合到泛化比率」（OGR）作為量化指標，用以衡量與比較不同模態之間的過擬合行為。
提出梯度混合（G-Blend）訓練方案，根據各模態的 OGR 值動態計算最佳的模態特異梯度混合，以最小化整體過擬合。
使用可學習的混合係數，於訓練過程中調整以傾向於泛化能力較佳的模態，有效解耦各模態的優化動態。
在晚期融合設定中應用 G-Blend，透過在最後一層拼接模態特徵，實現模態特異梯度加權的端到端訓練。
採用可微分的混合策略，可融入標準反向傳播，允許聯合優化，同時尊重各模態的過擬合特徵。
在多個基準（Kinetics、EPIC-Kitchens、AudioSet）上驗證方法，使用多種主幹網絡與融合策略，顯示在不改變架構的情況下表現持續提升。

实验结果

研究问题

RQ1為何多模態網絡即使擁有更多輸入資訊，仍經常表現不如單模態網絡？
RQ2過擬合與模態間泛化速率差異在多模態聯合訓練中對性能下降的貢獻程度為何？
RQ3能否設計一種統一優化策略，有效平衡具有不同過擬合行為的多個模態？
RQ4能否透過原則性、可學習的監督信號混合方式，提升泛化能力，並在多模態學習中超越標準正則化與融合技術？

主要发现

在 Kinetics 上，G-Blend 取得 72.6% 的 top-1 準確率，與最佳單模態 RGB 模型持平，並較晚期融合基線最高提升 2.6 個百分點。
在 EPIC-Kitchens 上，G-Blend 在未見廚房挑戰中獲得第二名，在已見廚房挑戰中獲得第四名，表現優於集成模型，且僅使用更少模態與單一模型。
在 AudioSet 上，G-Blend 取得 0.418 mAP 與 0.975 mAUC，分別較最尖端方法如 Multi-level Attn. 與 TAL-Net 提升 5.8% 與 5.5%，儘管每段影片僅使用 10 個畫格。
G-Blend 相較於 naïve 晚期融合 A/V 基線在 Kinetics 上提升 1.4%，且在性能上與 SlowFast 等價，但速度快 2 倍。
在微調預訓練特徵時，G-Blend 在 Kinetics 上達成 83.3% 的 top-1 準確率，創下新最尖端紀錄，且無需光流或預訓練。
該方法具備架構與任務無關性，可廣泛應用於其他領域，例如結合 RGB 與點雲輸入的 3D 物體檢測。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。