QUICK REVIEW

[論文レビュー] Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

Hai Pham, Paul Pu Liang|arXiv (Cornell University)|Dec 19, 2018

Sentiment Analysis and Opinion Mining参考文献 60被引用数 38

ひとこと要約

本稿では、言語、視覚、音声モダリティ間のサイクル的シーケンス・ツー・シーケンス翻訳を通じて、ロバストな共同マルチモーダル表現を学ぶ手法であるマルチモーダル・サイクル的翻訳ネットワーク（MCTN）を提案する。ペア化されたマルチモーダルデータで訓練し、サイクル整合性を課すことにより、テスト時にのみソースモダリティを使用して感情分析予測が可能となり、CMU-MOSI、ICT-MMMO、YouTubeデータセットで最先端の性能を達成するとともに、欠損またはノイズのあるモダリティに対してもロバストである。

ABSTRACT

Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations by requiring all modalities as input and as a result, the learned representations may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence (Seq2Seq) models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test time for final sentiment prediction. This ensures that our model remains robust from perturbations or missing information in the other modalities. We train our model with a coupled translation-prediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICT-MMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to missing or perturbed modalities.

研究の動機と目的

テスト時に入力モダリティがノイズ混じりまたは欠損している場合でも効果を発揮する、マルチモーダル感情分析におけるロバストな共同表現の学習という課題に対処すること。
推論時にすべてのモダリティを必要とする従来の手法の限界を克服し、データの摂動に対して感受性の低いモデルを実現すること。
機械翻訳分野で成功を収めたシーケンス・ツー・シーケンスモデルの知見を活用し、クロスモダリティ翻訳を通じて共同表現を学ぶこと。
翻訳のサイクル整合性を強制することで、すべてのモダリティからの最大限の情報を保持する共同表現を確保すること。
サイクル的翻訳損失と感情予測損失を組み合わせたカップルド損失を用いて、エンド・トゥ・エンドで訓練することで、タスク固有の識別力を高めつつ、ロバスト性を維持すること。

提案手法

ソースモダリティとターゲットモダリティ間の双方向的シーケンス・ツー・シーケンス翻訳を通じて共同表現を学ぶマルチモーダル・サイクル的翻訳ネットワーク（MCTN）を提案する。
前向き翻訳（ソース → ターゲット）と逆向き翻訳（予測されたターゲット → ソース）を訓練することで、サイクル整合性を強制し、対称性と情報保持を確保する。
前向きおよび逆向きの両翻訳に共通のSeq2Seqアーキテクチャを用いることで、過学習を低減し、統一された共同表現を促進する。
最初にソースモダリティと1つのターゲットモダリティ間で翻訳を行い、次に中間表現から2番目のターゲットモダリティへと2段階の翻訳を実行するハイアーチカルMCTNの変種を導入する。
サイクル的翻訳損失と感情予測損失を組み合わせたカップルド損失を用いて、エンド・トゥ・エンドでモデルを訓練し、タスク固有の識別力を保証する。
事前学習後、テスト時にのみソースモダリティを使用して推論を可能にすることで、テスト時にターゲットモダリティが欠損または摂動している場合にもモデルのロバスト性を確保する。

実験結果

リサーチクエスチョン

RQ1モダリティ間のサイクル的翻訳は、共同マルチモーダル表現のロバスト性と識別品質をどのように向上させるか？
RQ2サイクルフレームワークにおいて、前向きおよび逆向き翻訳に共通のSeq2Seqモデルを用いることと、別々のモデルを用いることの違いは、性能にどのような影響を与えるか？
RQ3ソースモダリティとターゲットモダリティの選択が、共同表現学習のパフォーマンスに与える影響は何か？
RQ4トリモーダル設定において、直接的翻訳よりも2段階の翻訳（ハイアーチカル）を用いる利点は何か？
RQ5トレーニング時に入力モダリティの数を増やすことで、学習された共同表現の識別力はどの程度向上するか？

主な発見

サイクル的翻訳（例：サイクル整合性を有するMCTN）を用いたモデルは、バイモーダルおよびトリモーダル設定の両方で、すべてのベースラインを上回る性能を示し、特にトリモーダルケースで顕著な性能差が観察された。
2段階のサイクル的翻訳を実行するハイアーチカルMCTN（図4(e)）は、連結されたモダリティからの直接的翻訳（図4(h)）よりも優れた性能を達成しており、再帰的表現学習の利点を示している。
前向きおよび逆向き翻訳に共通のSeq2Seqモデルを用いることで、2つの別個のモデルを用いる場合よりも高い性能が得られ、過学習の低減とより良いパラメータ共有によるものと推定される。
言語モダリティは常に共同表現に最も寄与しており、言語をソースモダリティとして用いるモデルが、特に視覚モダリティと組み合わせた場合、最高のパフォーマンスを示した。
トレーニング時に使用する入力モダリティの数を増やすことで、モデルはより識別力の高い共同表現を学習するようになるが、同時にテスト時に欠損または摂動のあるモダリティに対してもロバスト性を維持している。
MCTNは、CMU-MOSI、ICT-MMMO、YouTubeマルチモーダル感情分析データセットで、新たな最先端の結果を達成しており、提案されたフレームワークの有効性を裏付けている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。