QUICK REVIEW

[论文解读] A Low-Cost Vision-Based Tactile Gripper with Pretraining Learning for Contact-Rich Manipulation

Yaohua Liu, Binkai Ou|arXiv (Cornell University)|Jan 31, 2026

Advanced Sensor and Energy Harvesting Materials被引用 0

一句话总结

论文提出了 LVTG，一种具有模块化皮肤的低成本视觉-触觉抓取器，采用 CLIP 启发的跨模态预训练和基于 ACT 的策略，提升接触丰富操作的接触稳定性、耐用性和学习效率。与仅视觉的基线相比，表现更优。

ABSTRACT

Robotic manipulation in contact-rich environments remains challenging, particularly when relying on conventional tactile sensors that suffer from limited sensing range, reliability, and cost-effectiveness. In this work, we present LVTG, a low-cost visuo-tactile gripper designed for stable, robust, and efficient physical interaction. Unlike existing visuo-tactile sensors, LVTG enables more effective and stable grasping of larger and heavier everyday objects, thanks to its enhanced tactile sensing area and greater opening angle. Its surface skin is made of highly wear-resistant material, significantly improving durability and extending operational lifespan. The integration of vision and tactile feedback allows LVTG to provide rich, high-fidelity sensory data, facilitating reliable perception during complex manipulation tasks. Furthermore, LVTG features a modular design that supports rapid maintenance and replacement. To effectively fuse vision and touch, We adopt a CLIP-inspired contrastive learning objective to align tactile embeddings with their corresponding visual observations, enabling a shared cross-modal representation space for visuo-tactile perception. This alignment improves the performance of an Action Chunking Transformer (ACT) policy in contact-rich manipulation, leading to more efficient data collection and more effective policy learning. Compared to the original ACT method, the proposed LVTG with pretraining achieves significantly higher success rates in manipulation tasks.

研究动机与目标

开发一种低成本的视觉-触觉抓取器，扩展感知区域并实现模块化可替换以在苛环境中实现鲁棒操作。
通过 CLIP 启发的对比学习，将触觉与视觉嵌入对齐。
使用预训练的触觉表示结合 ACT 策略，提高对接触丰富任务的数据效率和策略学习。

提出的方法

设计具两指并行钳式结构、可模块化、可替换触觉皮肤的 LVTG，单个手指成本约为 $12。
通过对已处理的丙烯酸板进行直接成型透明硅胶，构建鲁棒的光学触觉皮肤，表面单块化、耐磨。
一种三步触觉图像处理流程：鱼眼失真矫正、ROI 提取、以及对触觉信号的光照/对比度增强。
CLIP 风格的对比学习，通过共享骨干网络和记忆库负采样策略，将触觉嵌入与视觉观测对齐。
用 5000 条视觉-触觉轨迹对触觉编码器进行预训练，随后使用具融合视觉-触觉特征的 Action Chunking Transformer (ACT) 进行策略学习。

实验结果

研究问题

RQ1与现有视觉-触觉传感器相比，LVTG 是否在抓取稳定性和可靠性方面有提升？
RQ2在长期使用中，LVTG 是否耐用且易于更换？
RQ3触觉反馈是否提升对接触丰富操作的策略学习，跨模态学习如何影响性能？

主要发现

基于视觉的触觉传感器	抓取葡萄酒瓶	抓取盘子	USB 插入与取出	平均分数
GelSlim	85	81	76	81
DIGIT	80	73	75	76
LVTG	92	89	73	85

LVTG 在需要大接触区域和较重物体的抓取任务中达到更高的成功率，葡萄酒瓶抓取平均 92%，盘子抓取 89%，USB 插头任务 85%。
耐用性测试显示 LVTG 的寿命是 9Dtact 的 2 倍以上，且由于模块化设计可快速更换（<30 秒）。
策略实验表明，结合触觉输入并进行预训练的 ACT，在多项任务中的平均成功率（55-63%）高于仅视觉基线（29-31%），预训练进一步提升结果。
LVTG 的更大感知区域（80x30 mm，2400 mm^2）和整体皮肤相较于单块式或脆弱的凝胶设计，在稳定性和耐用性方面具有优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。