QUICK REVIEW

[Paper Review] SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization

Hao Dong, Ismail Nejjar|arXiv (Cornell University)|Oct 30, 2023

Multimodal Machine Learning Applications7 citations

TL;DR

SimMMDG introduces modality-specific and modality-shared feature splitting with supervised contrastive learning and a cross-modal translation module to improve multi-modal domain generalization and missing-modality robustness. It achieves strong results on EPIC-Kitchens and the HAC dataset.

ABSTRACT

In real-world scenarios, achieving domain generalization (DG) presents significant challenges as models are required to generalize to unknown target distributions. Generalizing to unseen multi-modal distributions poses even greater difficulties due to the distinct properties exhibited by different modalities. To overcome the challenges of achieving domain generalization in multi-modal scenarios, we propose SimMMDG, a simple yet effective multi-modal DG framework. We argue that mapping features from different modalities into the same embedding space impedes model generalization. To address this, we propose splitting the features within each modality into modality-specific and modality-shared components. We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints on modality-specific features to promote diversity. In addition, we introduce a cross-modal translation module to regularize the learned features, which can also be used for missing-modality generalization. We demonstrate that our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon (HAC) dataset introduced in this paper. Our source code and HAC dataset are available at https://github.com/donghao51/SimMMDG.

Motivation & Objective

Motivate robust generalization across unseen multi-modal distributions.
Prevent loss of modality-specific information by avoiding naïve feature alignment across modalities.
Promote cross-modal sharing of label-consistent information while preserving modality diversity.
Provide a cross-modal translation mechanism to handle missing modalities during testing.
Introduce a novel HAC dataset to benchmark multi-modal DG.

Proposed method

Split each modality embedding into modality-specific and modality-shared components.
Apply supervised contrastive learning on modality-shared features to cluster same-label cross-modality instances.
Impose distance-based loss to maximize separation between modality-specific and modality-shared features within each modality.
Introduce a cross-modal translation module (MLP) to translate embeddings across modalities and regularize features (L_trans).
Combine losses into a final objective: L = L_cls + alpha_con L_con + alpha_dis L_dis + alpha_trans L_trans.
During missing-modality testing, predict missing embeddings via translation (E_i_t) and substitute them for robust predictions.

Figure 1: (a). Different modalities possess shared information, while simultaneously containing unique information exclusive to each modality. Inspired by this, we propose to split the feature of each modality into modality-specific and modality-shared parts in our framework. (b) Our new multi-modal

Experimental results

Research questions

RQ1How to improve multi-modal DG without collapsing modalities into a single shared embedding space?
RQ2Can modality-specific information be preserved while leveraging shared cross-modal information for DG?
RQ3Does a cross-modal translation mechanism improve robustness to missing modalities?
RQ4How does the approach generalize across standard multi-modal DG benchmarks and a new HAC dataset?

Key findings

SimMMDG consistently outperforms baselines on EPIC-Kitchens with improvements up to 9.58% when using all three modalities.
With SlowFast and ResNet-18 backbones, SimMMDG yields average improvements of up to 5.73% over baselines.
On the HAC dataset, SimMMDG improves results by up to 7.73% over baselines.
In multi-modal single-source DG, SimMMDG achieves up to 5.71% average improvement over competing methods.
For missing modalities, replacing zeros with cross-modal translation embeddings yields up to 10.47% accuracy gains compared to zero-filling, and often beats unimodal models.

Figure 2: Overview of SimMMDG . We split the features of each modality into modality-specific and modality-shared parts. For the modality-shared part, we use supervised contrastive learning to map the features with the same label to be as close as possible. For modality-specific features, we use a d

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.