QUICK REVIEW

[論文レビュー] The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?

Roberto Calandra, Andrew Owens|arXiv (Cornell University)|Oct 16, 2017

Robot Manipulation and Learning参考文献 34被引用数 109

ひとこと要約

エンドツーエンドの視覚–触覚ディープニューラルネットワークが把持結果を予測し、視覚と触覚を統合することで把持結果の予測精度と実世界の把持性能が大幅に向上する。

ABSTRACT

A successful grasp requires careful balancing of the contact forces. Deducing whether a particular grasp will be successful from indirect measurements, such as vision, is therefore quite challenging, and direct sensing of contacts through touch sensing provides an appealing avenue toward more successful and consistent robotic grasping. However, in order to fully evaluate the value of touch sensing for grasp outcome prediction, we must understand how touch sensing can influence outcome prediction accuracy when combined with other modalities. Doing so using conventional model-based techniques is exceptionally difficult. In this work, we investigate the question of whether touch sensing aids in predicting grasp outcomes within a multimodal sensing framework that combines vision and touch. To that end, we collected more than 9,000 grasping trials using a two-finger gripper equipped with GelSight high-resolution tactile sensors on each finger, and evaluated visuo-tactile deep neural network models to directly predict grasp outcomes from either modality individually, and from both modalities together. Our experimental results indicate that incorporating tactile readings substantially improve grasping performance.

研究の動機と目的

視覚と触覚を組み合わせてロボット把持のための多感覚知覚を動機づける。
視覚だけでなく触覚 sensing が把持結果の予測を改善するか評価する。
視覚と触覚入力を処理して把持成功を予測するエンドツーエンドのニューラルネットワークを開発する。
単一モダリティとマルチモダリティモデルを、結果予測および実世界の把持性能で定量的に比較する。

提案手法

GelSight搭載の二指グリッパーで9,000件超の把持試行を収集する。
RGB画像とGelSight画像から把持成功を予測するエンドツーエンドのCNNモデルを訓練する。
視覚と触覚の特徴をネットワークの後半で融合し、全結合分類器への入力とする。
視覚には把持前と把持中の2つの時点を使用し、GelSightの時間差（I_Tb - I_Ta）を触覚入力とする。
ImageNetで視覚と触覚CNNを事前学習し、訓練中に微調整する。
クロスオブジェクト分割でモデルを評価し、単一モダリティとマルチモーダルの性能を比較する。

実験結果

リサーチクエスチョン

RQ1視覚のみと比較して触覚 sensing は把持結果の予測を改善するか？
RQ2視覚-触覚のマルチモーダルモデルは、単一モダリティモデルより予測結果で優れているか？
RQ3未知の物体での実世界の把持選択において、視覚-触覚モデルはどのように機能するか？

主な発見

モデル	テスト精度（％）
触覚 + 視覚	77.8±0.3
視覚のみ	68.8±1.0
視覚 + グリッパー姿勢	68.8±1.3
深度	73.2±0.7
触覚（両方）	75.6±0.8
触覚（GelSight L）	75.3±1.4
触覚（GelSight R）	73.8±1.7
インデンテーション特徴量	72.7±0.8
偶然	61.8±1.9

触覚モデルは把持結果の予測で視覚モデルを上回る。
マルチモーダル視覚-触覚モデルが最高のテスト精度を77.8±0.3%で達成。
視覚のみ、深度、単一触覚モデルはより低い精度を達成する（例: 視覚のみ 68.8±1.0%、深度 73.2±0.7%）。
GelSightセンサー両方を使用（触覚両方）で75.6±0.8%の精度、GelSight LとRはそれぞれわずかに異なる（75.3±1.4%、73.8±1.7%）。
手作りのインデンテーション特徴量は72.7±0.8%に達し、エンドツーエンドモデルの利点を示す一方、手作り特徴は小規模データセットで競争力がある。
実世界の把持では、未知の物体で視覚のみを上回る視覚-触覚モデルは約14ポイント高い成功率を示す（94%対80%）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。