[论文解读] Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion
该论文提出Multimodal Graph,一种基于图神经网络的模型,通过图卷积和图池化技术对未对齐的多模态序列进行建模,以捕捉模态内和模态间的动态关系。该模型在CMU-MOSI和CMU-MOSEI数据集上优于SOTA方法(如MulT),在参数量更少、效率更高的情况下实现了SOTA性能,优于RNN和Transformer模型。
In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The graph-based structure enables parallel computation in time dimension and can learn longer temporal dependency in long unaligned sequences. Specifically, our Multimodal Graph is hierarchically structured to cater to two stages, i.e., intra- and inter-modal dynamics learning. For the first stage, a graph convolutional network is employed for each modality to learn intra-modal dynamics. In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities. Additionally, we define multiple ways to construct the adjacency matrix for sequential data. Experimental results suggest that our graph-based model reaches state-of-the-art performance on two benchmark datasets.
研究动机与目标
- 为解决现实场景中视觉、语言和语音序列未对齐的多模态序列分析挑战。
- 克服RNN在建模长程时间依赖时存在的梯度消失和高时间复杂度等局限性。
- 开发一种基于图的框架,实现并行计算并有效进行跨模态融合,且无需词级别对齐。
- 探究图卷积和池化在跨多模态序列数据建模中的有效性。
- 比较不同GCN架构和图池化策略,以在未对齐多模态学习中实现最优性能。
提出的方法
- 为每种模态(文本、视觉、语音)构建单模态图,将每个时间步视为节点,并通过非参数化和可学习方法定义邻接矩阵。
- 采用GraphSAGE结合均值池化作为基础GCN,以学习时间步之间的模态内动态,支持长程依赖建模。
- 设计图池化融合网络(GPFN),通过动态对齐跨模态的节点来学习模态间关联,且无需词级别对齐。
- 采用多种邻接矩阵构建策略,包括可学习的、基于实例的注意力方法,其性能优于非参数化方法。
- 使用图池化技术,如最大池化/均值池化和链接相似性池化,消融实验表明其在多数指标上优于DiffPool。
- 将单模态和模态间图学习整合到分层框架中,联合建模模态内和模态间动态。
实验结果
研究问题
- RQ1图神经网络是否能在不依赖循环结构的情况下有效建模未对齐的多模态序列?
- RQ2图卷积在建模未对齐序列中的长程时间依赖方面,与RNN和TCN相比表现如何?
- RQ3不同邻接矩阵构建方法对序列数据建模性能的影响是什么?
- RQ4图池化融合是否能在捕捉复杂、长时程跨模态交互方面优于词级别融合?
- RQ5所提出的Multimodal Graph在性能和效率方面与SOTA模型(如MulT和TFN)相比如何?
主要发现
- Multimodal Graph在CMU-MOSI和CMU-MOSEI上均实现了SOTA性能,在CMU-MOSI上除7分类准确率外,其余指标均优于MulT。
- 在CMU-MOSEI上,基于GraphSAGE的模型达到81.4%的准确率,优于消融实验中的GAT(80.3%)和GIN(81.1%)。
- 基于GraphSAGE的GPFN实现最佳性能,CMU-MOSEI上F1得分为81.7%,相关系数为0.675,优于DiffPool在所有指标上的表现,仅7分类准确率除外。
- 在CMU-MOSEI上,模型仅使用1,225,400个参数,仅为MulT参数量的64.46%,展现出卓越的参数效率。
- 可学习的邻接矩阵在捕捉未对齐序列中的动态时间关系方面显著优于非参数化方法。
- 基于图的方法在性能和复杂度方面均优于RNN和Transformer,验证了GCN作为序列建模可行替代方案的有效性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。