QUICK REVIEW

[论文解读] What Makes Training Multi-Modal Networks Hard?

Wei‐Yao Wang, Du Tran|arXiv (Cornell University)|May 29, 2019

Human Pose and Action Recognition参考文献 30被引用 25

一句话总结

本文指出，多模态网络由于过拟合以及不同模态间泛化速率不一致，导致其性能低于单模态网络。本文提出梯度融合（Gradient Blending）技术，根据各模态的过拟合行为自适应地组合梯度，显著提升性能，并在多个多模态基准测试中达到最先进结果。

ABSTRACT

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including fine-grained sport classification, human action recognition, and acoustic event detection.

研究动机与目标

调查尽管多模态网络拥有更多输入信息，为何其性能通常仍低于单模态网络。
识别多模态训练中性能下降的根本原因，特别是过拟合以及模态间泛化不一致的问题。
开发一种训练策略，根据模态特异性过拟合行为动态调整梯度更新，以提升泛化能力。
在包括细粒度分类和动作识别在内的多种多模态任务和基准测试中，评估所提方法的有效性。

提出的方法

该方法提出梯度融合技术，在反向传播过程中根据各模态的个体过拟合倾向，计算模态特异性加权梯度组合。
通过在训练过程中监控验证集损失来估计过拟合行为，使模型能够自适应地为各模态分配更高或更低的梯度权重。
融合权重通过可微机制在训练过程中动态学习，反映各模态损失随时间的相对稳定性。
该方法在保持端到端训练的同时，解耦了不同模态的优化动态，减少了学习过程中的干扰。
该方法无需对多模态架构进行修改，即可应用于多种模型，具有广泛适用性。
通过标准基线进行评估，并与多个基准测试中的最先进方法进行比较，包括细粒度运动分类和声学事件检测。

实验结果

研究问题

RQ1尽管多模态网络拥有更多输入信息，为何其性能始终低于单模态网络？
RQ2不同模态间过拟合速率的差异在多模态训练性能下降中所起的作用有多大？
RQ3能否通过一种考虑模态特异性过拟合行为的动态梯度融合策略，提升多模态学习中的泛化能力？
RQ4与标准优化和正则化技术相比，梯度融合在减少多模态任务过拟合和提升准确率方面表现如何？

主要发现

在多个任务和数据集上，多模态网络始终表现低于单模态网络，即使前者拥有更多输入信息。
性能差距主要由模型容量增加导致的过拟合，以及模态间泛化速率不一致引起。
梯度融合通过根据各模态的过拟合行为动态调整梯度贡献，有效减少了过拟合。
该方法在细粒度运动分类、人类动作识别和声学事件检测基准测试中达到最先进准确率。
梯度融合在提升多模态模型泛化能力方面优于广泛使用的正则化和优化基线方法。
该方法在不同模态组合和任务中均表现出一致的性能提升，证明了其方法的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。