[论文解读] GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
GeminiFusion 引入像素级、线性复杂度的多模态融合模块,通过在对应像素位置执行跨模态注意力,将对齐模态(如 RGB、深度、LiDAR、事件)的紧密整合,在分割、翻译和3D检测任务中超越基于交换和全跨注意力的方法。
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion
研究动机与目标
- Motivate and analyze limitations of existing multimodal fusion methods (interaction-based and exchange-based) in vision transformers.
- Propose GeminiFusion, a pixel-wise fusion module with linear complexity that preserves unimodal information while enabling cross-modal interaction.
- Demonstrate GeminiFusion's effectiveness across multimodal segmentation, image-to-image translation, and 3D object detection tasks.
提出的方法
- Critique exchange-based token pruning and full cross-attention; show trade-offs in information preservation and efficiency.
- Introduce GeminiFusion: pixel-wise fusion at corresponding spatial positions to fuse X1[i] and X2[i] via a constrained cross-attention with shared/unimodal preservation.
- Use a relation discriminator and layer-adaptive noise to stabilize cross-modal attention and balance self- and cross-modal cues.
- Achieve linear complexity with respect to input tokens, reducing FLOPs dramatically compared to full attention (from ~17G to ~0.14G per fusion step).
- Integrate GeminiFusion into a SegFormer-like encoder-decoder with shared parameters across modalities (RGB, depth, event, LiDAR) and an MLP-based decoder for segmentation.

实验结果
研究问题
- RQ1Can pixel-wise, spatially aligned fusion outperform prune-then-substitute exchange methods and full cross-attention in multimodal vision transformers?
- RQ2What is the achievable trade-off between fusion accuracy and computational efficiency when using GeminiFusion in segmentation, translation, and 3D detection tasks?
- RQ3How do the proposed relation discriminator and layer-adaptive noise affect cross-modal interaction and learning dynamics?
- RQ4To what extent can unimodal pre-training be leveraged in a multimodal GeminiFusion framework without performance loss?
主要发现
| 方法 | 骨干网络 | 输入 | 像素精度 | mAcc | mIoU |
|---|---|---|---|---|---|
| TokenFusion | MiT-B3 | RGB+D | 79.0 | 66.9 | 54.2 |
| GeminiFusion | MiT-B3 | RGB+D | 79.9+0.9 | 69.9+3.0 | 56.8+2.6 |
| TokenFusion | MiT-B5 | RGB+D | 79.1 | 67.5 | 55.1 |
| GeminiFusion | MiT-B5 | RGB+D | 80.3+1.2 | 70.4+2.9 | 57.7+2.6 |
| TokenFusion | MiT-B3 | RGB+D (SUN RGB-D) | 82.8 | 63.6 | 51.4 |
| GeminiFusion | MiT-B3 | RGB+D (SUN RGB-D) | 83.3+0.5 | 64.6+1.0 | 52.7+1.3 |
| TokenFusion | MiT-B5 | RGB+D (SUN RGB-D) | 83.1 | 63.9 | 51.8 |
| GeminiFusion | MiT-B5 | RGB+D (SUN RGB-D) | 83.8+0.7 | 65.3+1.4 | 53.3+1.5 |
| TokenFusion | MiT-B2 | RGB+D | - | - | 63.7 |
| GeminiFusion | MiT-B2 | RGB+D | - | - | 66.4+2.7 |
| TokenFusion | MiT-B2 | RGB+E | - | - | 55.7 |
| GeminiFusion | MiT-B2 | RGB+E | - | - | 58.5+2.8 |
| TokenFusion | MiT-B2 | RGB+L | - | - | 55.5 |
| GeminiFusion | MiT-B2 | RGB+L | - | - | 58.6+3.1 |
| TokenFusion | MiT-B2 | RGB+D+E+L | - | - | 63.5 |
| GeminiFusion | MiT-B2 | RGB+D+E+L | - | - | 66.9+3.4 |
- GeminiFusion consistently improves over TokenFusion across NYUDv2, SUN RGB-D, and DeLiVER in multimodal semantic segmentation with gains up to 3.4% mIoU in some settings.
- On NYUDv2 and SUN RGB-D, GeminiFusion achieves higher Pixel Acc., mAcc, and mIoU compared to TokenFusion when fusing RGB+D (and other modalities), with notable mIoU gains (e.g., +2.6% on NYUDv2, +1.3% on SUN RGB-D).
- In image-to-image translation on Taskonomy, GeminiFusion yields better FID/KID and MAE/MSE metrics than TokenFusion in multiple modality-pair tasks (e.g., Shade+Texture→RGB: 41.32 FID vs 47.31; Depth+Normal→RGB: 96.98 vs 103.87).
- In 3D object detection on KITTI, GeminiFusion provides small but consistent improvements over MVX-Net, e.g., 3D APR11/ AP40 improvements across easy/medium/hard settings when fused with GeminiFusion.
- Ablation studies show an effective relation discriminator (1x1 conv + Softmax) and learnable, layer-specific noise improve cross-modal attention balance and performance.
- GeminiFusion enables near-unimodal efficiency by preserving unimodal skip connections and achieving linear complexity relative to token count, with substantial FLOPs reductions over full cross-attention.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。