[논문 리뷰] DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
DLF는 언어 중심 강화에 중점을 둔 해해된 다중 모달 프레임워크를 제안하여 언어, 비주얼, 오디오 간의 중복을 완화하고 MOSI와 MOSEI에서 우수한 성능을 달성하며 구성 요소를 검증하는 ablations를 수행합니다.
Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such as language, vision, and audio, to enhance the understanding of human sentiment. While existing models often focus on extracting shared information across modalities or directly fusing heterogeneous modalities, such approaches can introduce redundancy and conflicts due to equal treatment of all modalities and the mutual transfer of information between modality pairs. To address these issues, we propose a Disentangled-Language-Focused (DLF) multimodal representation learning framework, which incorporates a feature disentanglement module to separate modality-shared and modality-specific information. To further reduce redundancy and enhance language-targeted features, four geometric measures are introduced to refine the disentanglement process. A Language-Focused Attractor (LFA) is further developed to strengthen language representation by leveraging complementary modality-specific information through a language-guided cross-attention mechanism. The framework also employs hierarchical predictions to improve overall accuracy. Extensive experiments on two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant performance gains achieved by the proposed DLF framework. Comprehensive ablation studies further validate the effectiveness of the feature disentanglement module, language-focused attractor, and hierarchical predictions. Our code is available at https://github.com/pwang322/DLF.
연구 동기 및 목표
- 다중 모달 간의 중복과 갈등을 줄이고 언어를 지배적 모달리티로 인식하여 MSA를 개선하고자 한다.
- 공유 정보와 모달리티 특이 정보를 분리하기 위한 해해된 표현학습 프레임워크를 개발한다.
- 보완적인 모달 정보 를 활용하는 language-focused attractor를 통해 언어 표현을 강화한다.
- 강화된 특징을 융합하고 계층적 예측을 적용하여 전체 감정 추정 성능을 향상시킨다.
제안 방법
- Feature extraction using unimodal encoders (language via BERT-base-uncased; vision via Facet; audio via COVAREP).
- Disentangle multimodal features into modality-shared and modality-specific spaces using a shared encoder and three modality-specific encoders.
- Regularize disentanglement with four geometric measures based on Euclidean distance and cosine similarity, plus reconstruction and triplet-based losses and a soft orthogonality loss.
- Introduce a Language-Focused Attractor (LFA) that uses language-centric multimodal cross-attention to pull in complementary information from other modalities (V and A) into language-specific features.
- Fuse enhanced shared and modality-specific features through a multimodal fusion layer, followed by hierarchical predictions (shared, specific, and final).
- Optimize with an objective combining decoupling loss and total MSA loss for end-to-end training.
실험 결과
연구 질문
- RQ1Can disentangling shared and modality-specific representations reduce redundancy and improve MSA performance?
- RQ2Does focusing information transfer toward the dominant language modality via LFA improve sentiment prediction accuracy?
- RQ3Do hierarchical predictions leveraging both pre-fused and post-fused features boost results over single-output baselines?
- RQ4How do the proposed regularization terms influence the quality of disentanglement and model performance?
- RQ5What is the contribution of each component (FDM, LFA, HP) to overall performance?
주요 결과
- DLF achieves superior performance on MOSI and MOSEI against multiple baselines and shows notable gains from the Language-Focused Attractor.
- Ablation studies confirm the effectiveness of the feature disentanglement module, LFA, and hierarchical predictions in improving accuracy.
- Removing LFA or FDM substantially degrades performance, validating their roles in reducing redundancy and enhancing modality-specific language features.
- Regularization terms (Lr, Ls, Lm, Lo) collectively contribute to robust disentanglement and prediction quality.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.