QUICK REVIEW

[論文レビュー] Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

Sijie Mai, Songlong Xing|arXiv (Cornell University)|Nov 27, 2020

Music and Audio Processing参考文献 60被引用数 26

ひとこと要約

本論文では、グラフ畳み込みとグラフプーリングを用いて非同期なマルチモーダルシーケンスをモデル化する、グラフニューラルネットワークベースのMultimodal Graphを提案する。このモデルは、モダリティ内およびモダリティ間のダイナミクスを捉えることができ、CMU-MOSIおよびCMU-MOSEIでMulTなどの最先端手法を上回り、パラメータ数が少なく、RNN や Transformers よりも効率的である。

ABSTRACT

In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The graph-based structure enables parallel computation in time dimension and can learn longer temporal dependency in long unaligned sequences. Specifically, our Multimodal Graph is hierarchically structured to cater to two stages, i.e., intra- and inter-modal dynamics learning. For the first stage, a graph convolutional network is employed for each modality to learn intra-modal dynamics. In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities. Additionally, we define multiple ways to construct the adjacency matrix for sequential data. Experimental results suggest that our graph-based model reaches state-of-the-art performance on two benchmark datasets.

研究の動機と目的

視覚的・言語的・音声的シーケンスが非同期である現実世界のシナリオにおけるマルチモーダルシーケンス解析の課題に対処すること。
長時間にわたる時間的依存関係をモデル化する際のRNNの限界、たとえば消失勾配や高い時間計算量を克服すること。
単語レベルのアライメントを必要とせず、並列計算が可能で効果的なクロスモダリティ統合を可能にするグラフベースのフレームワークの開発。
グラフ畳み込みとプーリングが複数モダリティにまたがる順序データをモデル化する有効性の調査。
非同期マルチモーダル学習における最適なパフォーマンスを達成するための、さまざまなGCNアーキテクチャとグラフプーリング戦略の比較。

提案手法

テキスト、ビジョン、オーディオの各モダリティに対してユニモダルドグラフを構築し、各タイムステップをノードとみなして、非パrametricおよび学習可能な手法による隣接行列を定義する。
時間的ステップ間のモダリティ内ダイナミクスを学習するために、ベースとなるGCNとしてGraphSAGEと平均プーリングを適用し、長距離依存関係のモデル化を可能にする。
語彙レベルのアライメントを必要とせず、モダリティ間のノードを動的にアライメントすることで、インターモダリティ関連性を学習するグラフプーリング統合ネットワーク（GPFN）を設計する。
学習可能なアテンションベースの方法を含む、複数の隣接行列構築戦略を採用し、非パラメトリックな代替手法を上回る性能を示した。
最大/平均プーリングやリンク類似度プーリングなどのグラフプーリング技術を用い、アブレーション実験でDiffPoolを多くの指標で上回ることを確認した。
ユニモダルドグラフとインターモダルグラフ学習を階層的フレームワークに統合し、モダリティ内およびモダリティ間のダイナミクスを同時にモデル化する。

実験結果

リサーチクエスチョン

RQ1再帰構造に依存せずに、グラフニューラルネットワークが非同期マルチモーダルシーケンスを効果的にモデル化できるか。
RQ2RNN や TCN と比較して、グラフ畳み込みは非同期シーケンスにおける長距離時間的依存関係を学習する上でどのように優れているか。
RQ3異なる隣接行列構築手法が順序データにおけるモデル性能に与える影響は何か。
RQ4グラフプーリング統合は、複雑で長時間にわたるクロスモダリティ相互作用を捉える点で、語彙レベルの統合を上回るか。
RQ5MulT や TFN などの最先端モデルと比較して、提案されたMultimodal Graphはパフォーマンスと効率性の点でどのように優れているか。

主な発見

Multimodal Graph は、CMU-MOSI および CMU-MOSEI の両データセットで最先端のパフォーマンスを達成し、CMU-MOSI の7クラス精度を除くすべての指標でMulTを上回った。
アブレーションスタディにおいて、GraphSAGEを用いたモデルはCMU-MOSEIで81.4%の精度を達成し、GAT（80.3%）とGIN（81.1%）を上回った。
GraphSAGEベースのGPFNは最高のパフォーマンスを示し、CMU-MOSEIでF1スコア81.7%、相関係数0.675を達成し、すべての指標でDiffPoolを上回った（7クラス精度を除く）。
CMU-MOSEIではたった1,225,400パラメータで実現され、MulTのパラメータ数の64.46%にとどまり、パラメータ効率性の優位性を示した。
学習可能な隣接行列は、非パラメトリック手法に比べて、非同期シーケンス内の動的時間的関係をより効果的に捉えることができた。
グラフベースのアプローチは、RNN や Transformers よりも高いパフォーマンスと低い計算複雑性を達成しており、GCNが順序データモデリングの代替として有効であることを裏付けた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。