QUICK REVIEW

[论文解读] Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation

Wonju Lee, Matteo Grimaldi|arXiv (Cornell University)|Feb 14, 2026

Advanced Sensor and Energy Harvesting Materials被引用 0

一句话总结

本文提出一种带有物理信息的双向力正则化的跨模态 Transformer（CMT），用于视觉与触觉融合以实现机器人插入力，接近特权性能。

ABSTRACT

Insertion tasks in robotic manipulation demand precise, contact-rich interactions that vision alone cannot resolve. While tactile feedback is intuitively valuable, existing studies have shown that naïve visuo-tactile fusion often fails to deliver consistent improvements. In this work, we propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion that integrates wrist-camera observations with tactile signals through structured self- and cross-attention. To stabilize tactile embeddings, we further introduce a physics-informed regularization that encourages bilateral force balance, reflecting principles of human motor control. Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate, surpassing naïve and gated fusion baselines and closely matching the privileged "wrist + contact force" configuration (96.09%). These results highlight two central insights: (i) tactile sensing is indispensable for precise alignment, and (ii) principled multimodal fusion, further strengthened by physics-informed regularization, unlocks complementary strengths of vision and touch, approaching privileged performance under realistic sensing.

研究动机与目标

推动鲁棒机器人插入力的实现，需同时获得全局视觉对齐和局部触觉反馈。
提出结构化的 visuo-tactile 融合，以发挥两种模态的互补优势。
将物理信息引入的双向力对称性作为正则化，稳定触觉嵌入。
证明在现实传感条件下，具有对称性感知的融合方法可达到接近特权性能。
提供可复现的方法学和代码以便在 TacSL 类任务上进行基准测试。

提出的方法

开发一种使用分层自注意力与交叉注意力的跨模态 Transformer，用于融合视觉与触觉特征。
以双向对称性正则化对残差触觉信号进行编码，使左右指之间的力对齐。
在跨注意力中将视觉作为查询，触觉作为键/值，以实现结构化的 visuo-tactile 融合。
引入物理信息辅助损失，强制左右触觉通道之间的双向力平衡。
使用 PPO 训练策略，将 PPO 目标与对称性正则化项结合。
在 TacSL 类插入力任务上评估，比较朴素、门控和 CMT 融合变体在有无对称性先验时的表现。

Figure 1: Comparison of observation modalities for robotic insertion policies. Left : Vision-only input provides global alignment cues but lacks local precision. Center : Tactile-only input encodes fine-grained force signals critical for corrective actions. Right : Visuo-tactile fusion integrates co

实验结果

研究问题

RQ1跨模态 Transformer 是否能够有效融合视觉与触觉数据以实现鲁棒的机器人插入力？
RQ2物理信息驱动的双向力对称性正则化是否能稳定触觉嵌入并提升插入力？
RQ3 visuo-tactile 融合在插入力任务中离特权的腕部+力感知有多近？
RQ4对称性正则化对训练稳定性和跨种子泛化有何影响？

主要发现

方法	特权	简化	接触力	腕部	触觉	成功率 (%)
特权	✓	\|	\|	\|	\|	96.74 ± 1.63
+ 联系力	✓	\|	✓	\|	\|	98.96 ± 0.83 (+2.22)
触觉	\|	✓	\|	\|	✓	91.41 ± 5.51
腕部	\|	✓	\|	✓	\|	93.23 ± 2.00
腕部 + 联系力	\|	✓	✓	✓	\|	96.09 ± 1.41 (+2.86)
融合 - Naïve [12]	\|	✓	\|	✓	✓	92.97 ± 1.41
融合 - Gated (λ_sym=0)	\|	✓	\|	✓	✓	94.53 ± 2.73 (+1.56)
融合 - CMT (λ_sym=0)	\|	✓	\|	✓	✓	96.22 ± 0.98 (+3.25)
融合 - Gated + 对称性正则化 (λ_sym=1)	\|	✓	\|	✓	✓	95.05 ± 1.76 (+2.08)
融合 - CMT + 对称性正则化 (λ_sym=1)	\|	✓	\|	✓	✓	96.59 ± 2.11 (+3.62)

使用 CMT 的 visuo-tactile 融合在简化设置下实现 96.22% 的成功率，接近特权的腕部+触觉力配置 96.09%。
对称性正则化在门控和 CMT 架构下进一步提升性能，CMT+Symmetry 达到 96.59%。
触觉力增强在多模态上提升性能，触觉独立策略也取得较强的单独结果（91.41%）。
朴素融合与最佳化之间仍有差距，而结构化的 CMT 融合显著缩小了与特权感知的差距。
CMT 在计算与性能之间提供有利的权衡，能够实现实时能力并显著优于基线。

Figure 2: Overview of visuo-tactile fusion architectures. (a) Naïve concatenation of embeddings, which risks diluting modality-specific signals. (b) Gated fusion with linear layers that adaptively weight neuronal contributions. (c) The proposed Cross-Modal Transformer (CMT), which embeds symmetry-aw

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。