QUICK REVIEW

[論文レビュー] Vision-Language Models as Success Detectors

Yuqing Du, Ksenia Konyushkova|arXiv (Cornell University)|Mar 13, 2023

Multimodal Machine Learning Applications被引用数 11

ひとこと要約

要約: 本論文はSuccessVQAを提案する。SuccessVQAはFlamingo（ビジョン言語モデル）をファインチューニングし、成功行動の検出を視覚質問応答タスクとして再構成することで、言語と視覚の変動に対してゼロショット一般化を可能にする。

ABSTRACT

Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii) "in-the-wild" human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of "in-the-wild" human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.

研究の動機と目的

堅牢で一般化可能な成功検出器をエージェントの報酬や評価指標として動機づける。
大規模な事前学習済みビジョン言語モデルを活用し、言語と視覚のバリエーションを一般化する。
単一の学習フレームワークを用いて、多様なドメインでの成功検出を統一する。
Simulated IA Playroom、ロボティクスのマニピュレーション、Ego4DデータセットでSuccessVQAの利点を示す。

提案手法

SuccessVQAと呼ばれる視覚質問応答（VQA）タスクとして成功検出を定式化する。
Flamingo（3B）を、視覚モジュールを更新し、言語モジュールを凍結したままファインチューニングする。
軌跡をクリップに分割し、人間ラベルから成功点を注釈することでSuccessVQAデータセットを作成する。
タスクテンプレートやナレーションから質問を生成し、成功フレームに応じて回答をYes/Noとしてラベル付けする。
3つのドメインで、同分布および分布外の言語/視覚の変動を評価する。
ベースラインとして、特注のドメイン特化型成功検出器と比較する。

実験結果

リサーチクエスチョン

RQ1Flamingoベースの成功検出器は、未知のタスク表現言語に一般化できるか？
RQ2ロボティクスおよび現実世界設定における未知の視覚変動（カメラの視点、干渉物）に対してSuccessVQAはどれだけ頑健か？
RQ3Out-of-distributionシナリオで、SuccessVQAは特注の報酬モデルより優れるか？
RQ4野外での自我視点動画データにおける成功検出をSuccessVQAは扱えるか？

主な発見

モデル	Test 1 (unseen episodes)	Test 2 (unseen behaviour)	Test 3 (unseen tasks)
オーダーメイドSD	80.6%	85.4%	49.9%
FT Flamingo 3B	83.4%	85.0%	59.3%

ファインチューニングしたFlamingoは、IA Playroomの未知エピソードや未知の行動において、特注SDにほぼ匹敵する。
未知タスクでは、FT Flamingo 3Bはエピソードレベルの精度で特注モデルを約10ポイント上回る。
Flamingoベースの検出器は、視点の変化や干渉物に対して特注モデルより頑健であり、多くの場合Test 1の精度の数パーセント程度の差にとどまる。
初期のEgo4D実験はタスクが非常に困難であることを示すが、より実世界の成功検出に向けた有望な方向性を示す。
ドメインを跨いで、最小限のドメイン特有の変更で、単一のマルチモーダルバックボーンがドメイン特化型報酬モデルに対して競争力のある性能を達成できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。