QUICK REVIEW

[論文レビュー] PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Soroush Nasiriany, Fei Xia|arXiv (Cornell University)|Feb 12, 2024

Elevator Systems and Control被引用数 5

ひとこと要約

PIVOT は空間的問題を反復的な視覚 QA に変換し、ビジョン-ランゲージモデルを用いてゼロショットのロボット制御を可能にする。画像に候補アクションを注釈し、タスク固有の訓練なしに複数回の反復を通じてそれらを洗練する。

ABSTRACT

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as iterative visual question answering. In each iteration, the image is annotated with a visual representation of proposals that the VLM can refer to (e.g., candidate robot actions, localizations, or trajectories). The VLM then selects the best ones for the task. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. Although current performance is far from perfect, our work highlights potentials and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. Website: pivot-prompt.github.io and HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

研究の動機と目的

タスク固有の微調整なしに、ビジョン-ランゲージモデル（VLM）によるゼロショットの空間推論と制御を探る。
空間タスク（ロボット制御、ローカリゼーション）を、反復的なビジュアル prompting で解ける視覚 QA 問題として捉える。
現実世界のロボティクスと空間推論タスクにおける VLM の限界と可能性を評価する。

提案手法

連続的なアクションと空間概念を、テキストラベル付きで画像空間の視覚注釈として表現する。
アクション提案を反復的にサンプリングし、画像内に注釈を付け、VLM に問いかけて有望なオプションをランク付けする。
最良のアクションに適合させてアクション提案分布を更新し、収束するまで繰り返す。
複数のPIVOT 実行を集約して頑健性を向上させるための並列呼び出し戦略を使用する。
出力を視覚的プロンプトとVLM（例：GPT-4V）のテキスト推論を通じて連続アクションへ grounded し、微調整なしで。

実験結果

リサーチクエスチョン

RQ1PIVOT は異なる実装形態でもゼロショットのロボットナビゲーションと操作をどれだけうまく実行できるか？
RQ2ゼロショット設定で RefCOCO のようなタスクに対して、PIVOT は正確な視覚的グラウンディングと空間推論を達成できるか？
RQ3テキストプロンプト、視覚プロンプト、反復最適化の影響は性能にどう影響するか？
RQ43D 理解、相互作用、および PIVOT における微細な制御の現在の VLM の限界は何か？
RQ5より大きく、より有能な VLM によって PIVOT はどのようにスケールするか？

主な発見

PIVOT はロボットの訓練データなしに現実世界のナビゲーションと操作で非ゼロの成功を可能にする。
反復最適化と並列呼び出しは成功率と効率を向上させる。
RefCOCO でのゼロショット視覚グラウンディングは、初期反復でも高い性能を示す。
オフラインのアブレーションは、テキストプロンプト、視覚プロンプト、反復がそれぞれ性能に寄与し、組み合わせて使用すると最高の結果を得られる。
より大きな Gemini モデルへのスケーリングは、操作とナビゲーションタスクで一様に性能を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。