QUICK REVIEW

[论文解读] Factorized Multimodal Transformer for Multimodal Sequential Learning

Amir Zadeh, Chengfeng Mao|arXiv (Cornell University)|Nov 22, 2019

Speech and dialogue systems参考文献 47被引用 37

一句话总结

FMT 引入了因式分解多模态自注意力机制，用于建模异步多模态序列中的同模态与异模态动力学，在 CMU-MOSI、IEMOCAP 和 POM 数据集上达到最新研究成果。

ABSTRACT

The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without encountering difficulties during training (e.g. overfitting) even on relatively low-resource setups. All the attention mechanisms within FMT have a full time-domain receptive field which allows them to asynchronously capture long-range multimodal dynamics. In our experiments we focus on datasets that contain the three commonly studied modalities of language, vision and acoustic. We perform a wide range of experiments, spanning across 3 well-studied datasets and 21 distinct labels. FMT shows superior performance over previously proposed models, setting new state of the art in the studied datasets.

研究动机与目标

激发并解决跨语言、视觉和音频模态的异步时空交互建模挑战。
提出一种单一Transformer 架构（FMT），结合 Factorized Multimodal Self-attention（FMS），以捕捉单模态、双模态和三模态交互。
在资源有限的情况下实现对长距离多模态动态的可扩展建模，同时在资源有限的情况下避免过拟合。

提出的方法

对每个模态进行单模态嵌入并加入位置信息。
使用包含多种 Factorized Multimodal Self-attentions (FMS) 的多模态Transformer层（MTL）以捕获因式分解的模态内和模态间动力学。
在每个 FMS 内，计算对应于 L、V、A、LV、LA、VA、LVA 因子的七个注意力，且具有完整序列长度。
应用一维卷积摘要网络（S1 和 S2）将高维 FMS 输出缩减为可处理的表示。
将最终的 MTL 输出输入到基于 GRU 的预测器，用于带时间戳的监督和最终序列标注。
使用标准多模态指标将 FMT 与 CMU-MOSI、IEMOCAP 和 POM 上的强基线进行比较。

实验结果

研究问题

RQ1在单个 Transformer 内部的因式分解注意力机制是否能够在异步多模态序列中有效建模单模态、双模态和三模态交互？
RQ2一个紧凑的、全时间域关注的架构是否在情感、情绪与人格特质识别等任务上优于先前的多模态序列模型？
RQ3在一个 MTL 中改变 FMS 单元数量如何影响性能与训练效率？
RQ4移除单模态/双模态/三模态因子对整体性能有何影响？

主要发现

FMT 在 CMU-MOSI 的多模态情感分析上相较基线取得更优性能（表 1）。
FMT 在 IEMOCAP 的离散情绪识别上超越基线，除了 Happy（表 2）。
FMT 在 POM 的 16 项特质上优于基线（表 3）。
消融研究表明，为获得最佳性能，所有因子类型（UNI、BI、TRI）和摘要组件都是必需的（表 4）。
在一个 MTL 内将 FMS 单元数量增加到最多 6 时，在他们的实验中达到峰值性能（表 5）。
FMT 使用的总注意力数量比 MulT 少，但在相同任务上取得了更好的表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。