QUICK REVIEW

[论文解读] Multimodal Prompting with Missing Modalities for Visual Recognition

Yi-Lun Lee, Yi‐Hsuan Tsai|arXiv (Cornell University)|Mar 6, 2023

Multimodal Machine Learning Applications被引用 10

一句话总结

本文提出缺失感知提示（missing-aware prompts）用于多模态变换器，以在训练和测试阶段处理各种缺失模态情景，同时避免完全微调，在训练参数显著更少的情况下实现强鲁棒性。

ABSTRACT

In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.

研究动机与目标

在训练或测试时，当模态可能缺失时，动员稳健的多模态学习。
通过避免对大型多模态变换器进行完全微调来降低计算成本。
提出基于具体缺失模态情景来约束模型预测的提示。
评估不同的提示设计（输入层级与注意力层级）在多模态数据集上的表现。

提出的方法

将缺失模态情景定义为按样本与阶段（训练/测试）动态变化。
在冻结骨干网络的同时，将可学习的缺失感知提示附加到预训练的多模态变换器（ViLT）。
探索两种提示设计：输入层级提示与注意力层级提示；将提示附加到选定的变换器层。
仅训练提示、池化器和分类器；保持骨干网不变，以将可训练参数降至模型的<1%。
对缺失模态使用虚拟输入并将提示串联或路由以引导预测。
在具有不同缺失率设置的数据集上报告性能，以评估鲁棒性与效率。

实验结果

研究问题

RQ1当模态在训练和测试中部分观察到时，缺失感知提示是否能够实现稳健的多模态识别？
RQ2在不同缺失模态情景下，输入层级提示设计与注意力层级提示设计在效果与稳定性方面有何对比？
RQ3提示长度、层位置以及提示层数的权衡对性能与效率有何影响？

主要发现

注意力层级提示在缺失模态情景下始终提升基线鲁棒性。
输入层级提示通常获得最佳性能，但对数据集特征更为敏感；注意力层级提示提供更高的稳定性。
与113M参数骨干相比，该方法使用的附加参数不足0.2%（约221k），在不进行完整模型微调的情况下实现具有竞争力的结果。
从早期变换器层开始的提示分层通常比仅对后期层进行提示产生更大影响。
缺失模态导致的性能下降得到缓解，鲁棒性在MM-IMDb、UPMC Food-101和Hateful Memes数据集上得到证明。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。