QUICK REVIEW

[论文解读] Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

Chunlei Meng, Ziyang Zhou|arXiv (Cornell University)|Jan 20, 2026

Emotion and Mood Recognition被引用 0

一句话总结

TSDA 在跨模态交互前解耦每种模态的时间动态与空间结构，逐因素对齐并自适应重新耦合，在基准多模态情感数据集上实现最先进的结果。

ABSTRACT

Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.

研究动机与目标

Motivate that spatiotemporal heterogeneity causes information asymmetry and brittle predictions in MSA models.
Propose a two-branch modality disentanglement (temporal and spatial) prior to cross-modal interaction.
Develop Factor-Consistent Cross-Modal Alignment to align like factors across modalities.
Introduce a Gated Recouple module to adaptively fuse temporal and spatial summaries per instance.
Regularize to prevent cross-factor leakage while preserving complementarity.

提出的方法

For each modality, split input into a temporal encoder that yields a temporal token sequence and a spatial encoder that yields a time-invariant structural set.
Apply Factor-Consistent Cross-Modal Alignment with block-diagonal masked attention to align temporal tokens across modalities and spatial tokens across modalities.
Impose token-level factor purity (discriminator-based) and summary-level decorrelation (cosine and HSIC) to suppress cross-factor leakage.
Recouple aligned temporal and spatial summaries via a gated mechanism that depends on disagreement and factor confidences, plus an orthogonality regularizer.
Train with a task loss plus purity, decorrelation, and orthogonality losses to enforce factor separation and stable fusion.

实验结果

研究问题

RQ1Can explicit temporal and spatial disentanglement before interaction reduce spatiotemporal information asymmetry in multimodal sentiment analysis?
RQ2Does factor-consistent alignment improve cross-modal fusion by preventing cross-factor interference and static dominance?
RQ3Can instance-wise gated recoupling adaptively fuse temporal and spatial cues to improve robustness under aligned and unaligned conditions?
RQ4What is the impact of purity, decorrelation, and orthogonality regularizers on model performance and stability?

主要发现

TSDA achieves best performance on CMU-MOSI and CMU-MOSEI under both aligned and unaligned settings.
On MOSI, TSDA reduces MAE to 0.695 (aligned) and 0.680 (unaligned) and improves ACC7/ACC2/F1 by about 1 percentage point.
On MOSEI, TSDA achieves MAE 0.529 (aligned) and 0.527 (unaligned) with the highest accuracy and F1 scores.
Ablation shows removing temporal components or disentanglement harms performance more than removing either modality alone, and FCCA is essential for preventing cross-factor interference.
The gated recouple module enhances performance by adaptively weighting factors based on reliability signals, outperforming simple fusion baselines.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。